linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Devesh Sharma <devesh.sharma@avagotech.com>
To: Sagi Grimberg <sagig@dev.mellanox.co.il>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	linux-rdma@vger.kernel.org,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()
Date: Fri, 8 May 2015 20:54:07 +0530	[thread overview]
Message-ID: <CANjDDBiGLcaAofGwz6OGEXUUE_b2rcZepv0ebvTc-XNVEBq5Mw@mail.gmail.com> (raw)
In-Reply-To: <554B3EEB.7070302@dev.mellanox.co.il>

On Thu, May 7, 2015 at 4:01 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>
>> Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
>> pool lock is expensive, and unnecessary because most modern adapters
>> can transfer 100s of KBs of payload using just a single MR.
>>
>> Instead, acquire MRs one-at-a-time as chunks are registered, and
>> return them to rb_mws immediately during deregistration.
>>
>> Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
>> registration fails") is reverted: There is now a valid case where
>> registration can fail (with -ENOMEM) but the QP is still in RTS.
>>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>   net/sunrpc/xprtrdma/frwr_ops.c |  120
>> ++++++++++++++++++++++++++++------------
>>   net/sunrpc/xprtrdma/rpc_rdma.c |    3 -
>>   net/sunrpc/xprtrdma/verbs.c    |   21 -------
>>   3 files changed, 86 insertions(+), 58 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>> b/net/sunrpc/xprtrdma/frwr_ops.c
>> index a06d9a3..6f93a89 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -11,6 +11,62 @@
>>    * but most complex memory registration mode.
>>    */
>>
>> +/* Normal operation
>> + *
>> + * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
>> + * Work Request (frmr_op_map). When the RDMA operation is finished, this
>> + * Memory Region is invalidated using a LOCAL_INV Work Request
>> + * (frmr_op_unmap).
>> + *
>> + * Typically these Work Requests are not signaled, and neither are RDMA
>> + * SEND Work Requests (with the exception of signaling occasionally to
>> + * prevent provider work queue overflows). This greatly reduces HCA
>> + * interrupt workload.
>> + *
>> + * As an optimization, frwr_op_unmap marks MRs INVALID before the
>> + * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
>> + * rb_mws immediately so that no work (like managing a linked list
>> + * under a spinlock) is needed in the completion upcall.
>> + *
>> + * But this means that frwr_op_map() can occasionally encounter an MR
>> + * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
>> + * ordering prevents a subsequent FAST_REG WR from executing against
>> + * that MR while it is still being invalidated.
>> + */
>> +
>> +/* Transport recovery
>> + *
>> + * ->op_map and the transport connect worker cannot run at the same
>> + * time, but ->op_unmap can fire while the transport connect worker
>> + * is running. Thus MR recovery is handled in ->op_map, to guarantee
>> + * that recovered MRs are owned by a sending RPC, and not one where
>> + * ->op_unmap could fire at the same time transport reconnect is
>> + * being done.
>> + *
>> + * When the underlying transport disconnects, MRs are left in one of
>> + * three states:
>> + *
>> + * INVALID:    The MR was not in use before the QP entered ERROR state.
>> + *             (Or, the LOCAL_INV WR has not completed or flushed yet).
>> + *
>> + * STALE:      The MR was being registered or unregistered when the QP
>> + *             entered ERROR state, and the pending WR was flushed.
>> + *
>> + * VALID:      The MR was registered before the QP entered ERROR state.
>> + *
>> + * When frwr_op_map encounters STALE and VALID MRs, they are recovered
>> + * with ib_dereg_mr and then are re-initialized. Beause MR recovery
>> + * allocates fresh resources, it is deferred to a workqueue, and the
>> + * recovered MRs are placed back on the rb_mws list when recovery is
>> + * complete. frwr_op_map allocates another MR for the current RPC while
>> + * the broken MR is reset.
>> + *
>> + * To ensure that frwr_op_map doesn't encounter an MR that is marked
>> + * INVALID but that is about to be flushed due to a previous transport
>> + * disconnect, the transport connect worker attempts to drain all
>> + * pending send queue WRs before the transport is reconnected.
>> + */
>> +
>>   #include "xprt_rdma.h"
>>
>>   #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> @@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>         struct ib_device *device = ia->ri_device;
>>         enum dma_data_direction direction = rpcrdma_data_dir(writing);
>>         struct rpcrdma_mr_seg *seg1 = seg;
>> -       struct rpcrdma_mw *mw = seg1->rl_mw;
>> -       struct rpcrdma_frmr *frmr = &mw->r.frmr;
>> -       struct ib_mr *mr = frmr->fr_mr;
>> +       struct rpcrdma_mw *mw;
>> +       struct rpcrdma_frmr *frmr;
>> +       struct ib_mr *mr;
>>         struct ib_send_wr fastreg_wr, *bad_wr;
>>         u8 key;
>>         int len, pageoff;
>> @@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>         u64 pa;
>>         int page_no;
>>
>> +       mw = seg1->rl_mw;
>> +       seg1->rl_mw = NULL;
>> +       do {
>> +               if (mw)
>> +                       __frwr_queue_recovery(mw);
>> +               mw = rpcrdma_get_mw(r_xprt);
>> +               if (!mw)
>> +                       return -ENOMEM;
>> +       } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
>> +       frmr = &mw->r.frmr;
>> +       frmr->fr_state = FRMR_IS_VALID;
>> +
>>         pageoff = offset_in_page(seg1->mr_offset);
>>         seg1->mr_offset -= pageoff;     /* start of page */
>>         seg1->mr_len += pageoff;
>>         len = -pageoff;
>>         if (nsegs > ia->ri_max_frmr_depth)
>>                 nsegs = ia->ri_max_frmr_depth;
>> +
>>         for (page_no = i = 0; i < nsegs;) {
>>                 rpcrdma_map_one(device, seg, direction);
>>                 pa = seg->mr_dma;
>> @@ -285,8 +354,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>         dprintk("RPC:       %s: Using frmr %p to map %d segments (%d
>> bytes)\n",
>>                 __func__, mw, i, len);
>>
>> -       frmr->fr_state = FRMR_IS_VALID;
>> -
>>         memset(&fastreg_wr, 0, sizeof(fastreg_wr));
>>         fastreg_wr.wr_id = (unsigned long)(void *)mw;
>>         fastreg_wr.opcode = IB_WR_FAST_REG_MR;
>> @@ -298,6 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>         fastreg_wr.wr.fast_reg.access_flags = writing ?
>>                                 IB_ACCESS_REMOTE_WRITE |
>> IB_ACCESS_LOCAL_WRITE :
>>                                 IB_ACCESS_REMOTE_READ;
>> +       mr = frmr->fr_mr;
>>         key = (u8)(mr->rkey & 0x000000FF);
>>         ib_update_fast_reg_key(mr, ++key);
>>         fastreg_wr.wr.fast_reg.rkey = mr->rkey;
>> @@ -307,6 +375,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>         if (rc)
>>                 goto out_senderr;
>>
>> +       seg1->rl_mw = mw;
>>         seg1->mr_rkey = mr->rkey;
>>         seg1->mr_base = seg1->mr_dma + pageoff;
>>         seg1->mr_nsegs = i;
>> @@ -315,10 +384,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>
>>   out_senderr:
>>         dprintk("RPC:       %s: ib_post_send status %i\n", __func__, rc);
>> -       ib_update_fast_reg_key(mr, --key);
>> -       frmr->fr_state = FRMR_IS_INVALID;
>>         while (i--)
>>                 rpcrdma_unmap_one(device, --seg);
>> +       __frwr_queue_recovery(mw);
>>         return rc;
>>   }
>>
>> @@ -330,15 +398,19 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>>   {
>>         struct rpcrdma_mr_seg *seg1 = seg;
>>         struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +       struct rpcrdma_mw *mw = seg1->rl_mw;
>>         struct ib_send_wr invalidate_wr, *bad_wr;
>>         int rc, nsegs = seg->mr_nsegs;
>>
>> -       seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> +       dprintk("RPC:       %s: FRMR %p\n", __func__, mw);
>> +
>> +       seg1->rl_mw = NULL;
>> +       mw->r.frmr.fr_state = FRMR_IS_INVALID;
>>
>>         memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> -       invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
>> +       invalidate_wr.wr_id = (unsigned long)(void *)mw;
>>         invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> -       invalidate_wr.ex.invalidate_rkey =
>> seg1->rl_mw->r.frmr.fr_mr->rkey;
>> +       invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
>>         DECR_CQCOUNT(&r_xprt->rx_ep);
>>
>>         while (seg1->mr_nsegs--)
>> @@ -348,12 +420,13 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>>         read_unlock(&ia->ri_qplock);
>>         if (rc)
>>                 goto out_err;
>> +
>> +       rpcrdma_put_mw(r_xprt, mw);
>>         return nsegs;
>>
>>   out_err:
>> -       /* Force rpcrdma_buffer_get() to retry */
>> -       seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
>>         dprintk("RPC:       %s: ib_post_send status %i\n", __func__, rc);
>> +       __frwr_queue_recovery(mw);
>>         return nsegs;
>>   }
>>
>> @@ -370,29 +443,6 @@ out_err:
>>   static void
>>   frwr_op_reset(struct rpcrdma_xprt *r_xprt)
>>   {
>> -       struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> -       struct ib_device *device = r_xprt->rx_ia.ri_device;
>> -       unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
>> -       struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
>> -       struct rpcrdma_mw *r;
>> -       int rc;
>> -
>> -       list_for_each_entry(r, &buf->rb_all, mw_all) {
>> -               if (r->r.frmr.fr_state == FRMR_IS_INVALID)
>> -                       continue;
>> -
>> -               __frwr_release(r);
>> -               rc = __frwr_init(r, pd, device, depth);
>> -               if (rc) {
>> -                       dprintk("RPC:       %s: mw %p left %s\n",
>> -                               __func__, r,
>> -                               (r->r.frmr.fr_state == FRMR_IS_STALE ?
>> -                                       "stale" : "valid"));
>> -                       continue;
>> -               }
>> -
>> -               r->r.frmr.fr_state = FRMR_IS_INVALID;
>> -       }
>>   }
>>
>>   static void
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 98a3b95..35ead0b 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -284,9 +284,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct
>> xdr_buf *target,
>>         return (unsigned char *)iptr - (unsigned char *)headerp;
>>
>>   out:
>> -       if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
>> -               return n;
>> -
>>         for (pos = 0; nchunks--;)
>>                 pos += r_xprt->rx_ia.ri_ops->ro_unmap(r_xprt,
>>
>> &req->rl_segments[pos]);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 8a43c7ef..5226161 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -1343,12 +1343,11 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req,
>> struct rpcrdma_buffer *buf,
>>   struct rpcrdma_req *
>>   rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>   {
>> -       struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>> -       struct list_head stale;
>>         struct rpcrdma_req *req;
>>         unsigned long flags;
>>
>>         spin_lock_irqsave(&buffers->rb_lock, flags);
>> +
>>         if (buffers->rb_send_index == buffers->rb_max_requests) {
>>                 spin_unlock_irqrestore(&buffers->rb_lock, flags);
>>                 dprintk("RPC:       %s: out of request buffers\n",
>> __func__);
>> @@ -1367,17 +1366,7 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>         }
>>         buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;
>>
>> -       INIT_LIST_HEAD(&stale);
>> -       switch (ia->ri_memreg_strategy) {
>> -       case RPCRDMA_FRMR:
>> -               req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
>> -               break;
>> -       default:
>> -               break;
>> -       }
>>         spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> -       if (!list_empty(&stale))
>> -               rpcrdma_retry_flushed_linv(&stale, buffers);
>>         return req;
>>   }
>>
>> @@ -1389,18 +1378,10 @@ void
>>   rpcrdma_buffer_put(struct rpcrdma_req *req)
>>   {
>>         struct rpcrdma_buffer *buffers = req->rl_buffer;
>> -       struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>>         unsigned long flags;
>>
>>         spin_lock_irqsave(&buffers->rb_lock, flags);
>>         rpcrdma_buffer_put_sendbuf(req, buffers);
>> -       switch (ia->ri_memreg_strategy) {
>> -       case RPCRDMA_FRMR:
>> -               rpcrdma_buffer_put_mrs(req, buffers);
>> -               break;
>> -       default:
>> -               break;
>> -       }
>>         spin_unlock_irqrestore(&buffers->rb_lock, flags);
>>   }
>>
>>
>
> Don't you need a call to flush_workqueue(frwr_recovery_wq) when you're
> about to destroy the endpoint (and the buffers and the MRs...)?

I agree with Sagi here, in xprt_rdma_destroy() before calling
rpcrdma_destroy_buffer(), flush_workqueue and cancelling any pending
work seems required.
With the optimization, is it possible that before completion of
REG-FRMR sever starts traffic on a yet-to-be-reg-complete rkey, this
will cause access-error async event on client side and server will see
flush errors.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
-Regards
Devesh

  reply	other threads:[~2015-05-08 15:24 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-04 17:56 [PATCH v1 00/14] client NFS/RDMA patches for 4.2 Chuck Lever
2015-05-04 17:56 ` [PATCH v1 01/14] xprtrdma: Transport fault injection Chuck Lever
2015-05-05 13:49   ` Anna Schumaker
2015-05-05 13:53     ` Chuck Lever
2015-05-05 14:44       ` Anna Schumaker
2015-05-05 15:15         ` Chuck Lever
2015-05-05 15:16           ` Anna Schumaker
2015-05-05 15:10   ` Steve Wise
2015-05-04 17:57 ` [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects Chuck Lever
2015-05-06 11:37   ` Devesh Sharma
2015-05-06 13:24     ` Chuck Lever
2015-05-06 14:05       ` Sagi Grimberg
2015-05-06 14:22       ` Devesh Sharma
2015-05-06 16:48         ` Jason Gunthorpe
2015-05-07  7:53           ` Devesh Sharma
2015-05-04 17:57 ` [PATCH v1 03/14] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt Chuck Lever
2015-05-07  9:38   ` Sagi Grimberg
2015-05-07 13:25     ` Chuck Lever
2015-05-04 17:57 ` [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely Chuck Lever
2015-05-07 10:00   ` Sagi Grimberg
2015-05-07 13:39     ` Chuck Lever
2015-05-07 13:56       ` Sagi Grimberg
2015-05-07 14:12         ` Chuck Lever
2015-05-07 15:11           ` Sagi Grimberg
2015-05-11 15:22             ` Chuck Lever
2015-05-11 18:26             ` Hefty, Sean
2015-05-11 18:57               ` Chuck Lever
2015-05-12 10:01               ` Sagi Grimberg
2015-05-04 17:57 ` [PATCH v1 05/14] xprtrdma: Introduce helpers for allocating MWs Chuck Lever
2015-05-07 10:16   ` Sagi Grimberg
2015-05-04 17:57 ` [PATCH v1 06/14] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() Chuck Lever
2015-05-07 10:15   ` Sagi Grimberg
2015-05-04 17:57 ` [PATCH v1 07/14] xprtrdma: Introduce an FRMR recovery workqueue Chuck Lever
2015-05-07 10:37   ` Devesh Sharma
2015-05-04 17:57 ` [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external() Chuck Lever
2015-05-07 10:31   ` Sagi Grimberg
2015-05-08 15:24     ` Devesh Sharma [this message]
2015-05-08 15:40       ` Chuck Lever
2015-05-10 10:17         ` Sagi Grimberg
2015-05-04 17:58 ` [PATCH v1 09/14] xprtrdma: Remove unused LOCAL_INV recovery logic Chuck Lever
2015-05-07 10:35   ` Sagi Grimberg
2015-05-08 15:31     ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 10/14] xprtrdma: Remove ->ro_reset Chuck Lever
2015-05-07 10:36   ` Sagi Grimberg
2015-05-08 15:33     ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 11/14] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy Chuck Lever
2015-05-07 10:36   ` Sagi Grimberg
2015-05-08 15:34     ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 12/14] xprtrdma: Split rb_lock Chuck Lever
2015-05-07 10:37   ` Sagi Grimberg
2015-05-04 17:58 ` [PATCH v1 13/14] xprtrdma: Stack relief in fmr_op_map() Chuck Lever
2015-05-07 10:50   ` Sagi Grimberg
2015-05-08 15:36     ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 14/14] xprtrmda: Reduce per-transport MR allocation Chuck Lever
2015-05-07 11:00   ` Sagi Grimberg
2015-05-08 15:53     ` Devesh Sharma
2015-05-05 15:17 ` [PATCH v1 00/14] client NFS/RDMA patches for 4.2 Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANjDDBiGLcaAofGwz6OGEXUUE_b2rcZepv0ebvTc-XNVEBq5Mw@mail.gmail.com \
    --to=devesh.sharma@avagotech.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=sagig@dev.mellanox.co.il \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).