From: Devesh Sharma <devesh.sharma@avagotech.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-rdma@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets
Date: Mon, 21 Sep 2015 14:21:21 +0530 [thread overview]
Message-ID: <CANjDDBhm=iXdRkLPWCX6gix=GRTeDb-Apb6zdaJp3NTGTyecBw@mail.gmail.com> (raw)
In-Reply-To: <A3B1A836-81AF-40E6-B551-587B7173B72D@oracle.com>
On Fri, Sep 18, 2015 at 7:49 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Devesh-
>
>
> On Sep 18, 2015, at 2:52 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
>
>> On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
>>> handler") was supposed to prevent xprtrdma's upcall handlers from
>>> starving other softIRQ work by letting them return to the provider
>>> before all CQEs have been polled.
>>>
>>> The logic assumes the provider will call the upcall handler again
>>> immediately if the CQ is re-armed while there are still queued CQEs.
>>>
>>> This assumption is invalid. The IBTA spec says that after a CQ is
>>> armed, the hardware must interrupt only when a new CQE is inserted.
>>> xprtrdma can't rely on the provider calling again, even though some
>>> providers do.
>>>
>>> Therefore, leaving CQEs on queue makes sense only when there is
>>> another mechanism that ensures all remaining CQEs are consumed in a
>>> timely fashion. xprtrdma does not have such a mechanism. If a CQE
>>> remains queued, the transport can wait forever to send the next RPC.
>>>
>>> Finally, move the wcs array back onto the stack to ensure that the
>>> poll array is always local to the CPU where the completion upcall is
>>> running.
>>>
>>> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> net/sunrpc/xprtrdma/verbs.c | 100 ++++++++++++++++++---------------------
>>> net/sunrpc/xprtrdma/xprt_rdma.h | 5 --
>>> 2 files changed, 45 insertions(+), 60 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 8a477e2..f2e3863 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
>>> }
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * The common case is a single completion is ready. By asking
>>> + * for two entries, a return code of 1 means there is exactly
>>> + * one completion and no more. We don't have to poll again to
>>> + * know that the CQ is now empty.
>>> + */
>>> +static void
>>> rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> - struct ib_wc *wcs;
>>> - int budget, count, rc;
>>> + struct ib_wc *pos, wcs[2];
>>> + int count, rc;
>>>
>>> - budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>> do {
>>> - wcs = ep->rep_send_wcs;
>>> + pos = wcs;
>>>
>>> - rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> - if (rc <= 0)
>>> - return rc;
>>> + rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> + if (rc < 0)
>>> + goto out_warn;
>>>
>>> count = rc;
>>> while (count-- > 0)
>>> - rpcrdma_sendcq_process_wc(wcs++);
>>> - } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> - return 0;
>>> + rpcrdma_sendcq_process_wc(pos++);
>>> + } while (rc == ARRAY_SIZE(wcs));
>>
>> I think I have missed something and not able to understand the reason
>> for polling 2 CQEs in one poll?
>
> See the block comment above.
>
> When ib_poll_cq() returns the same number of WCs as the
> consumer requested, there may still be CQEs waiting to
> be polled. Another call to ib_poll_cq() is needed to find
> out if that's the case.
True...
>
> When ib_poll_cq() returns fewer WCs than the consumer
> requested, the consumer doesn't have to call again to
> know that the CQ is empty.
Agree, the while loop will terminate here. What if immediately after
the vendor_poll_cq() decided to report only 1 CQE and terminate
polling loop, another CQE is added. This new CQE will be polled only
after T usec (where T is interrupt-latency).
>
> The common case, by far, is that one CQE is ready. By
> requesting two, the number returned is less than the
> number requested, and the consumer can tell immediately
> that the CQE is drained. The extra ib_poll_cq call is
> avoided.
>
> Note that the existing logic also relies on polling
> multiple WCs at once, and stopping the loop when the
> number of returned WCs is less than the size of the
> array.
There is a logic to perform extra polling too if arm_cq reports missed cqes.
we change the while-loop arm-cq reporting missed cqe logic may be removed.
>
>
>> It is possible that in a given poll_cq
>> call you end up getting on 1 completion, the other completion is
>> delayed due to some reason.
>
> If a CQE is allowed to be delayed, how does polling
> again guarantee that the consumer can retrieve it?
Its possible the moment vendor_poll_cq() looked at the CQ-memory
buffer and decided to report 1 CQE, another CQE was in flight CQE but
poll_cq has already decided not to report 2.
>
> What happens if a signal occurs, there is only one CQE,
> but it is delayed? ib_poll_cq would return 0 in that
> case, and the consumer would never call again, thinking
> the CQ is empty. There's no way the consumer can know
> for sure when a CQ is drained.
Hardware usually guarantees to signal only after the CQE is dma'ed.
>
> If the delayed CQE happens only when there is more
> than one CQE, how can polling multiple WCs ever work
> reliably?
>
> Maybe I don't understand what is meant by delayed.
>
>
>> Would it be better to poll for 1 in every
>> poll call Or
>> otherwise have this
>> while ( rc <= ARRAY_SIZE(wcs) && rc);
>>
>>> + return;
>>> +
>>> +out_warn:
>>> + pr_warn("RPC: %s: ib_poll_cq() failed %i\n", __func__, rc);
>>> }
>>>
>>> -/*
>>> - * Handle send, fast_reg_mr, and local_inv completions.
>>> - *
>>> - * Send events are typically suppressed and thus do not result
>>> - * in an upcall. Occasionally one is signaled, however. This
>>> - * prevents the provider's completion queue from wrapping and
>>> - * losing a completion.
>>> +/* Handle provider send completion upcalls.
>>> */
>>> static void
>>> rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>> @@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>> struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>> int rc;
>>>
>>> - rc = rpcrdma_sendcq_poll(cq, ep);
>>> - if (rc) {
>>> - dprintk("RPC: %s: ib_poll_cq failed: %i\n",
>>> - __func__, rc);
>>> - return;
>>> - }
>>> + rpcrdma_sendcq_poll(cq, ep);
>>>
>>> rc = ib_req_notify_cq(cq,
>>> IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>> @@ -247,44 +245,41 @@ out_fail:
>>> goto out_schedule;
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * struct ib_wc is 64 bytes, making the poll array potentially
>>> + * large. But this is at the bottom of the call chain. Further
>>> + * substantial work is done in another thread.
>>> + */
>>> +static void
>>> rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> - struct list_head sched_list;
>>> - struct ib_wc *wcs;
>>> - int budget, count, rc;
>>> + struct ib_wc *pos, wcs[4];
>>> + LIST_HEAD(sched_list);
>>> + int count, rc;
>>>
>>> - INIT_LIST_HEAD(&sched_list);
>>> - budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>> do {
>>> - wcs = ep->rep_recv_wcs;
>>> + pos = wcs;
>>>
>>> - rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> - if (rc <= 0)
>>> - goto out_schedule;
>>> + rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> + if (rc < 0)
>>> + goto out_warn;
>>>
>>> count = rc;
>>> while (count-- > 0)
>>> - rpcrdma_recvcq_process_wc(wcs++, &sched_list);
>>> - } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> - rc = 0;
>>> + rpcrdma_recvcq_process_wc(pos++, &sched_list);
>>> + } while (rc == ARRAY_SIZE(wcs));
>>>
>>> out_schedule:
>>> rpcrdma_schedule_tasklet(&sched_list);
>>> - return rc;
>>> + return;
>>> +
>>> +out_warn:
>>> + pr_warn("RPC: %s: ib_poll_cq() failed %i\n", __func__, rc);
>>> + goto out_schedule;
>>> }
>>>
>>> -/*
>>> - * Handle receive completions.
>>> - *
>>> - * It is reentrant but processes single events in order to maintain
>>> - * ordering of receives to keep server credits.
>>> - *
>>> - * It is the responsibility of the scheduled tasklet to return
>>> - * recv buffers to the pool. NOTE: this affects synchronization of
>>> - * connection shutdown. That is, the structures required for
>>> - * the completion of the reply handler must remain intact until
>>> - * all memory has been reclaimed.
>>> +/* Handle provider receive completion upcalls.
>>> */
>>> static void
>>> rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>> @@ -292,12 +287,7 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>> struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>> int rc;
>>>
>>> - rc = rpcrdma_recvcq_poll(cq, ep);
>>> - if (rc) {
>>> - dprintk("RPC: %s: ib_poll_cq failed: %i\n",
>>> - __func__, rc);
>>> - return;
>>> - }
>>> + rpcrdma_recvcq_poll(cq, ep);
>>>
>>> rc = ib_req_notify_cq(cq,
>>> IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index c09414e..42c8d44 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -77,9 +77,6 @@ struct rpcrdma_ia {
>>> * RDMA Endpoint -- one per transport instance
>>> */
>>>
>>> -#define RPCRDMA_WC_BUDGET (128)
>>> -#define RPCRDMA_POLLSIZE (16)
>>> -
>>> struct rpcrdma_ep {
>>> atomic_t rep_cqcount;
>>> int rep_cqinit;
>>> @@ -89,8 +86,6 @@ struct rpcrdma_ep {
>>> struct rdma_conn_param rep_remote_cma;
>>> struct sockaddr_storage rep_remote_addr;
>>> struct delayed_work rep_connect_worker;
>>> - struct ib_wc rep_send_wcs[RPCRDMA_POLLSIZE];
>>> - struct ib_wc rep_recv_wcs[RPCRDMA_POLLSIZE];
>>> };
>>>
>>> /*
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
next prev parent reply other threads:[~2015-09-21 8:52 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-17 20:44 [PATCH v1 00/18] RFC NFS/RDMA patches for merging into v4.4 Chuck Lever
2015-09-17 20:44 ` [PATCH v1 01/18] xprtrdma: Enable swap-on-NFS/RDMA Chuck Lever
2015-09-21 8:58 ` Devesh Sharma
2015-09-17 20:44 ` [PATCH v1 02/18] xprtrdma: Replace global lkey with lkey local to PD Chuck Lever
2015-09-21 8:59 ` Devesh Sharma
2015-09-17 20:44 ` [PATCH v1 03/18] xprtrdma: Remove completion polling budgets Chuck Lever
2015-09-18 6:52 ` Devesh Sharma
2015-09-18 14:19 ` Chuck Lever
2015-09-20 10:35 ` Sagi Grimberg
2015-09-21 8:51 ` Devesh Sharma
2015-09-21 15:45 ` Chuck Lever
2015-09-22 17:32 ` Devesh Sharma
2015-10-01 16:37 ` Chuck Lever
2015-10-01 17:13 ` Jason Gunthorpe
2015-10-01 17:36 ` Chuck Lever
2015-10-01 18:15 ` Jason Gunthorpe
2015-10-01 18:31 ` Chuck Lever
2015-10-01 18:38 ` Jason Gunthorpe
2015-09-21 8:51 ` Devesh Sharma [this message]
2015-09-17 20:44 ` [PATCH v1 04/18] xprtrdma: Refactor reply handler error handling Chuck Lever
2015-09-21 8:59 ` Devesh Sharma
2015-09-17 20:44 ` [PATCH v1 05/18] xprtrdma: Replace send and receive arrays Chuck Lever
2015-09-20 10:52 ` Sagi Grimberg
2015-09-21 23:04 ` Chuck Lever
2015-09-17 20:45 ` [PATCH v1 06/18] SUNRPC: Abstract backchannel operations Chuck Lever
2015-09-17 20:45 ` [PATCH v1 07/18] xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers Chuck Lever
2015-09-21 10:28 ` Devesh Sharma
2015-09-17 20:45 ` [PATCH v1 08/18] xprtrdma: Pre-allocate Work Requests for backchannel Chuck Lever
2015-09-21 10:33 ` Devesh Sharma
2015-09-21 22:41 ` Chuck Lever
2015-09-17 20:45 ` [PATCH v1 09/18] xprtrdma: Add support for sending backward direction RPC replies Chuck Lever
2015-09-17 20:45 ` [PATCH v1 10/18] xprtrdma: Handle incoming backward direction RPC calls Chuck Lever
2015-09-17 20:45 ` [PATCH v1 11/18] svcrdma: Add backward direction service for RPC/RDMA transport Chuck Lever
2015-09-17 20:45 ` [PATCH v1 12/18] SUNRPC: Remove the TCP-only restriction in bc_svc_process() Chuck Lever
2015-09-17 20:45 ` [PATCH v1 13/18] NFS: Enable client side NFSv4.1 backchannel to use other transports Chuck Lever
2015-09-17 20:46 ` [PATCH v1 14/18] svcrdma: Define maximum number of backchannel requests Chuck Lever
2015-09-17 20:46 ` [PATCH v1 15/18] svcrdma: Add svc_rdma_get_context() API that is allowed to fail Chuck Lever
2015-09-20 12:40 ` Sagi Grimberg
2015-09-21 22:34 ` Chuck Lever
2015-09-17 20:46 ` [PATCH v1 16/18] svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls Chuck Lever
2015-09-17 20:46 ` [PATCH v1 17/18] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies Chuck Lever
2015-09-17 20:46 ` [PATCH v1 18/18] xprtrdma: Add class for RDMA backwards direction transport Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CANjDDBhm=iXdRkLPWCX6gix=GRTeDb-Apb6zdaJp3NTGTyecBw@mail.gmail.com' \
--to=devesh.sharma@avagotech.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).