From: Steve Wise <swise@opengridcomputing.com>
To: Sagi Grimberg <sagig@dev.mellanox.co.il>,
Chuck Lever <chuck.lever@oracle.com>,
linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
Date: Wed, 16 Apr 2014 08:30:48 -0500 [thread overview]
Message-ID: <534E8608.8030801@opengridcomputing.com> (raw)
In-Reply-To: <534E7C1C.5070407@dev.mellanox.co.il>
On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>> The current CQ handler uses the ib_wc.opcode field to distinguish
>> between event types. However, the contents of that field are not
>> reliable if the completion status is not IB_WC_SUCCESS.
>>
>> When an error completion occurs on a send event, the CQ handler
>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>> This is never correct behavior, and sometimes it results in a panic.
>>
>> To resolve this issue, split the completion queue into a send CQ and
>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>> wr_id's, and the receive CQ handler now handles only struct
>> rpcrdma_rep wr_id's.
>
> Hey Chuck,
>
> So 2 suggestions related (although not directly) to this one.
>
> 1. I recommend suppressing Fastreg completions - no one cares that
> they succeeded.
>
Not true. The nfsrdma client uses frmrs across re-connects for the same
mount and needs to know at any point in time if a frmr is registered or
invalid. So completions of both fastreg and invalidate need to be
signaled. See:
commit 5c635e09cec0feeeb310968e51dad01040244851
Author: Tom Tucker <tom@ogc.us>
Date: Wed Feb 9 19:45:34 2011 +0000
RPCRDMA: Fix FRMR registration/invalidate handling.
> 2. If I understood correctly, I see that the CQs are loop-polled in an
> interrupt context.
> This may become problematic in stress workload where the CQ simply
> never drains (imagine
> some studios task keeps posting WRs as soon as IOs complete). This
> will cause CQ poll loop
> to go on forever. This situation may cause starvation of other CQs
> that share the same EQ (CPU
> is hogged by the endless poller).
> This may be solved in 2 ways:
> - Use blk-iopoll to maintain fairness - the downside will be
> moving from interrupt context to soft-irq (slower).
> - Set some configurable budget to CQ poller - after finishing the
> budget, notify the CQ and bail.
> If there is another CQ waiting to get in - it's only fair that
> it will be given with a chance, else another interrupt
> on that CQ will immediately come.
>
> I noticed that one with iSER which polls from tasklet context
> (under heavy workload).
>
> Sagi.
>
>> Fix suggested by Shirley Ma <shirley.ma@oracle.com>
>>
>> Reported-by: Rafael Reiter <rafael.reiter@ims.co.at>
>> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
>> Tested-by: Klemens Senn <klemens.senn@ims.co.at>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>
>> net/sunrpc/xprtrdma/verbs.c | 228
>> +++++++++++++++++++++++----------------
>> net/sunrpc/xprtrdma/xprt_rdma.h | 1
>> 2 files changed, 135 insertions(+), 94 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 0f8c22c..35f5ab6 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event
>> *event, void *context)
>> }
>> }
>> -static inline
>> -void rpcrdma_event_process(struct ib_wc *wc)
>> +static void
>> +rpcrdma_send_event_process(struct ib_wc *wc)
>> {
>> - struct rpcrdma_mw *frmr;
>> - struct rpcrdma_rep *rep =
>> - (struct rpcrdma_rep *)(unsigned long) wc->wr_id;
>> + struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned
>> long)wc->wr_id;
>> - dprintk("RPC: %s: event rep %p status %X opcode %X
>> length %u\n",
>> - __func__, rep, wc->status, wc->opcode, wc->byte_len);
>> + dprintk("RPC: %s: frmr %p status %X opcode %d\n",
>> + __func__, frmr, wc->status, wc->opcode);
>> - if (!rep) /* send completion that we don't care about */
>> + if (wc->wr_id == 0ULL)
>> return;
>> -
>> - if (IB_WC_SUCCESS != wc->status) {
>> - dprintk("RPC: %s: WC opcode %d status %X, connection
>> lost\n",
>> - __func__, wc->opcode, wc->status);
>> - rep->rr_len = ~0U;
>> - if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode !=
>> IB_WC_LOCAL_INV)
>> - rpcrdma_schedule_tasklet(rep);
>> + if (wc->status != IB_WC_SUCCESS)
>> return;
>> - }
>> - switch (wc->opcode) {
>> - case IB_WC_FAST_REG_MR:
>> - frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>> + if (wc->opcode == IB_WC_FAST_REG_MR)
>> frmr->r.frmr.state = FRMR_IS_VALID;
>> - break;
>> - case IB_WC_LOCAL_INV:
>> - frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>> + else if (wc->opcode == IB_WC_LOCAL_INV)
>> frmr->r.frmr.state = FRMR_IS_INVALID;
>> - break;
>> - case IB_WC_RECV:
>> - rep->rr_len = wc->byte_len;
>> - ib_dma_sync_single_for_cpu(
>> - rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>> - rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>> - /* Keep (only) the most recent credits, after check validity */
>> - if (rep->rr_len >= 16) {
>> - struct rpcrdma_msg *p =
>> - (struct rpcrdma_msg *) rep->rr_base;
>> - unsigned int credits = ntohl(p->rm_credit);
>> - if (credits == 0) {
>> - dprintk("RPC: %s: server"
>> - " dropped credits to 0!\n", __func__);
>> - /* don't deadlock */
>> - credits = 1;
>> - } else if (credits > rep->rr_buffer->rb_max_requests) {
>> - dprintk("RPC: %s: server"
>> - " over-crediting: %d (%d)\n",
>> - __func__, credits,
>> - rep->rr_buffer->rb_max_requests);
>> - credits = rep->rr_buffer->rb_max_requests;
>> - }
>> - atomic_set(&rep->rr_buffer->rb_credits, credits);
>> - }
>> - rpcrdma_schedule_tasklet(rep);
>> - break;
>> - default:
>> - dprintk("RPC: %s: unexpected WC event %X\n",
>> - __func__, wc->opcode);
>> - break;
>> - }
>> }
>> -static inline int
>> -rpcrdma_cq_poll(struct ib_cq *cq)
>> +static int
>> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>> {
>> struct ib_wc wc;
>> int rc;
>> - for (;;) {
>> - rc = ib_poll_cq(cq, 1, &wc);
>> - if (rc < 0) {
>> - dprintk("RPC: %s: ib_poll_cq failed %i\n",
>> - __func__, rc);
>> - return rc;
>> - }
>> - if (rc == 0)
>> - break;
>> + while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>> + rpcrdma_send_event_process(&wc);
>> + return rc;
>> +}
>> - rpcrdma_event_process(&wc);
>> +/*
>> + * Handle send, fast_reg_mr, and local_inv completions.
>> + *
>> + * Send events are typically suppressed and thus do not result
>> + * in an upcall. Occasionally one is signaled, however. This
>> + * prevents the provider's completion queue from wrapping and
>> + * losing a completion.
>> + */
>> +static void
>> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>> +{
>> + int rc;
>> +
>> + rc = rpcrdma_sendcq_poll(cq);
>> + if (rc) {
>> + dprintk("RPC: %s: ib_poll_cq failed: %i\n",
>> + __func__, rc);
>> + return;
>> }
>> + rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>> + if (rc) {
>> + dprintk("RPC: %s: ib_req_notify_cq failed: %i\n",
>> + __func__, rc);
>> + return;
>> + }
>> + rpcrdma_sendcq_poll(cq);
>> +}
>> - return 0;
>> +static void
>> +rpcrdma_recv_event_process(struct ib_wc *wc)
>> +{
>> + struct rpcrdma_rep *rep =
>> + (struct rpcrdma_rep *)(unsigned long)wc->wr_id;
>> +
>> + dprintk("RPC: %s: rep %p status %X opcode %X length %u\n",
>> + __func__, rep, wc->status, wc->opcode, wc->byte_len);
>> +
>> + if (wc->status != IB_WC_SUCCESS) {
>> + rep->rr_len = ~0U;
>> + goto out_schedule;
>> + }
>> + if (wc->opcode != IB_WC_RECV)
>> + return;
>> +
>> + rep->rr_len = wc->byte_len;
>> + ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>> + rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>> +
>> + if (rep->rr_len >= 16) {
>> + struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
>> + unsigned int credits = ntohl(p->rm_credit);
>> +
>> + if (credits == 0)
>> + credits = 1; /* don't deadlock */
>> + else if (credits > rep->rr_buffer->rb_max_requests)
>> + credits = rep->rr_buffer->rb_max_requests;
>> + atomic_set(&rep->rr_buffer->rb_credits, credits);
>> + }
>> +
>> +out_schedule:
>> + rpcrdma_schedule_tasklet(rep);
>> +}
>> +
>> +static int
>> +rpcrdma_recvcq_poll(struct ib_cq *cq)
>> +{
>> + struct ib_wc wc;
>> + int rc;
>> +
>> + while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>> + rpcrdma_recv_event_process(&wc);
>> + return rc;
>> }
>> /*
>> - * rpcrdma_cq_event_upcall
>> + * Handle receive completions.
>> *
>> - * This upcall handles recv and send events.
>> * It is reentrant but processes single events in order to maintain
>> * ordering of receives to keep server credits.
>> *
>> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>> * connection shutdown. That is, the structures required for
>> * the completion of the reply handler must remain intact until
>> * all memory has been reclaimed.
>> - *
>> - * Note that send events are suppressed and do not result in an upcall.
>> */
>> static void
>> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
>> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>> {
>> int rc;
>> - rc = rpcrdma_cq_poll(cq);
>> - if (rc)
>> + rc = rpcrdma_recvcq_poll(cq);
>> + if (rc) {
>> + dprintk("RPC: %s: ib_poll_cq failed: %i\n",
>> + __func__, rc);
>> return;
>> -
>> + }
>> rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>> if (rc) {
>> - dprintk("RPC: %s: ib_req_notify_cq failed %i\n",
>> + dprintk("RPC: %s: ib_req_notify_cq failed: %i\n",
>> __func__, rc);
>> return;
>> }
>> -
>> - rpcrdma_cq_poll(cq);
>> + rpcrdma_recvcq_poll(cq);
>> }
>> #ifdef RPC_DEBUG
>> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct
>> rpcrdma_ia *ia,
>> struct rpcrdma_create_data_internal *cdata)
>> {
>> struct ib_device_attr devattr;
>> + struct ib_cq *sendcq, *recvcq;
>> int rc, err;
>> rc = ib_query_device(ia->ri_id->device, &devattr);
>> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct
>> rpcrdma_ia *ia,
>> ep->rep_attr.cap.max_recv_sge);
>> /* set trigger for requesting send completion */
>> - ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /* - 1*/;
>> + ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>> if (ep->rep_cqinit <= 2)
>> ep->rep_cqinit = 0;
>> INIT_CQCOUNT(ep);
>> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct
>> rpcrdma_ia *ia,
>> init_waitqueue_head(&ep->rep_connect_wait);
>> INIT_DELAYED_WORK(&ep->rep_connect_worker,
>> rpcrdma_connect_worker);
>> - ep->rep_cq = ib_create_cq(ia->ri_id->device,
>> rpcrdma_cq_event_upcall,
>> + sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>> rpcrdma_cq_async_error_upcall, NULL,
>> - ep->rep_attr.cap.max_recv_wr +
>> ep->rep_attr.cap.max_send_wr + 1, 0);
>> - if (IS_ERR(ep->rep_cq)) {
>> - rc = PTR_ERR(ep->rep_cq);
>> - dprintk("RPC: %s: ib_create_cq failed: %i\n",
>> + if (IS_ERR(sendcq)) {
>> + rc = PTR_ERR(sendcq);
>> + dprintk("RPC: %s: failed to create send CQ: %i\n",
>> __func__, rc);
>> goto out1;
>> }
>> - rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
>> + rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>> if (rc) {
>> dprintk("RPC: %s: ib_req_notify_cq failed: %i\n",
>> __func__, rc);
>> goto out2;
>> }
>> - ep->rep_attr.send_cq = ep->rep_cq;
>> - ep->rep_attr.recv_cq = ep->rep_cq;
>> + recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
>> + rpcrdma_cq_async_error_upcall, NULL,
>> + ep->rep_attr.cap.max_recv_wr + 1, 0);
>> + if (IS_ERR(recvcq)) {
>> + rc = PTR_ERR(recvcq);
>> + dprintk("RPC: %s: failed to create recv CQ: %i\n",
>> + __func__, rc);
>> + goto out2;
>> + }
>> +
>> + rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
>> + if (rc) {
>> + dprintk("RPC: %s: ib_req_notify_cq failed: %i\n",
>> + __func__, rc);
>> + ib_destroy_cq(recvcq);
>> + goto out2;
>> + }
>> +
>> + ep->rep_attr.send_cq = sendcq;
>> + ep->rep_attr.recv_cq = recvcq;
>> /* Initialize cma parameters */
>> @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct
>> rpcrdma_ia *ia,
>> return 0;
>> out2:
>> - err = ib_destroy_cq(ep->rep_cq);
>> + err = ib_destroy_cq(sendcq);
>> if (err)
>> dprintk("RPC: %s: ib_destroy_cq returned %i\n",
>> __func__, err);
>> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct
>> rpcrdma_ia *ia)
>> ep->rep_pad_mr = NULL;
>> }
>> - rpcrdma_clean_cq(ep->rep_cq);
>> - rc = ib_destroy_cq(ep->rep_cq);
>> + rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> + rc = ib_destroy_cq(ep->rep_attr.recv_cq);
>> + if (rc)
>> + dprintk("RPC: %s: ib_destroy_cq returned %i\n",
>> + __func__, rc);
>> +
>> + rpcrdma_clean_cq(ep->rep_attr.send_cq);
>> + rc = ib_destroy_cq(ep->rep_attr.send_cq);
>> if (rc)
>> dprintk("RPC: %s: ib_destroy_cq returned %i\n",
>> __func__, rc);
>> @@ -801,7 +841,9 @@ retry:
>> if (rc && rc != -ENOTCONN)
>> dprintk("RPC: %s: rpcrdma_ep_disconnect"
>> " status %i\n", __func__, rc);
>> - rpcrdma_clean_cq(ep->rep_cq);
>> +
>> + rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> + rpcrdma_clean_cq(ep->rep_attr.send_cq);
>> xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>> id = rpcrdma_create_id(xprt, ia,
>> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep,
>> struct rpcrdma_ia *ia)
>> {
>> int rc;
>> - rpcrdma_clean_cq(ep->rep_cq);
>> + rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> + rpcrdma_clean_cq(ep->rep_attr.send_cq);
>> rc = rdma_disconnect(ia->ri_id);
>> if (!rc) {
>> /* returns without wait if not connected */
>> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>> ib_dma_sync_single_for_cpu(ia->ri_id->device,
>> rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>> - DECR_CQCOUNT(ep);
>> rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>> if (rc)
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 362a19d..334ab6e 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>> int rep_cqinit;
>> int rep_connected;
>> struct rpcrdma_ia *rep_ia;
>> - struct ib_cq *rep_cq;
>> struct ib_qp_init_attr rep_attr;
>> wait_queue_head_t rep_connect_wait;
>> struct ib_sge rep_pad; /* holds zeroed pad */
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2014-04-16 13:30 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-14 22:22 [PATCH 0/8] NFS/RDMA patches for review Chuck Lever
2014-04-14 22:22 ` [PATCH 1/8] xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context Chuck Lever
2014-04-14 22:22 ` [PATCH 2/8] xprtrdma: Remove BOUNCEBUFFERS memory registration mode Chuck Lever
2014-04-14 22:22 ` [PATCH 3/8] xprtrdma: Disable ALLPHYSICAL mode by default Chuck Lever
2014-04-14 22:22 ` [PATCH 4/8] xprtrdma: Remove support for MEMWINDOWS registration mode Chuck Lever
2014-04-14 22:23 ` [PATCH 5/8] xprtrdma: Simplify rpcrdma_deregister_external() synopsis Chuck Lever
2014-04-14 22:23 ` [PATCH 6/8] xprtrdma: Make rpcrdma_ep_destroy() return void Chuck Lever
2014-04-14 22:23 ` [PATCH 7/8] xprtrdma: Split the completion queue Chuck Lever
2014-04-16 12:48 ` Sagi Grimberg
2014-04-16 13:30 ` Steve Wise [this message]
2014-04-16 14:12 ` Sagi Grimberg
2014-04-16 14:25 ` Steve Wise
2014-04-16 14:35 ` Sagi Grimberg
2014-04-16 14:43 ` Steve Wise
2014-04-16 15:18 ` Sagi Grimberg
2014-04-16 15:46 ` Steve Wise
2014-04-16 15:08 ` Chuck Lever
2014-04-16 15:23 ` Sagi Grimberg
2014-04-16 18:21 ` Chuck Lever
2014-04-17 7:06 ` Sagi Grimberg
2014-04-17 13:55 ` Chuck Lever
2014-04-17 14:34 ` Steve Wise
2014-04-17 19:11 ` Sagi Grimberg
2014-04-19 16:31 ` Chuck Lever
2014-04-20 12:42 ` Sagi Grimberg
2014-04-17 19:08 ` Sagi Grimberg
2014-04-14 22:23 ` [PATCH 8/8] xprtrdma: Reduce the number of hardway buffer allocations Chuck Lever
2014-04-15 20:15 ` [PATCH 0/8] NFS/RDMA patches for review Steve Wise
2014-04-15 20:20 ` Steve Wise
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=534E8608.8030801@opengridcomputing.com \
--to=swise@opengridcomputing.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=sagig@dev.mellanox.co.il \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).