linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Devesh Sharma <devesh.sharma@avagotech.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-rdma@vger.kernel.org,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets
Date: Mon, 21 Sep 2015 14:21:21 +0530	[thread overview]
Message-ID: <CANjDDBhm=iXdRkLPWCX6gix=GRTeDb-Apb6zdaJp3NTGTyecBw@mail.gmail.com> (raw)
In-Reply-To: <A3B1A836-81AF-40E6-B551-587B7173B72D@oracle.com>

On Fri, Sep 18, 2015 at 7:49 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Devesh-
>
>
> On Sep 18, 2015, at 2:52 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
>
>> On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
>>> handler") was supposed to prevent xprtrdma's upcall handlers from
>>> starving other softIRQ work by letting them return to the provider
>>> before all CQEs have been polled.
>>>
>>> The logic assumes the provider will call the upcall handler again
>>> immediately if the CQ is re-armed while there are still queued CQEs.
>>>
>>> This assumption is invalid. The IBTA spec says that after a CQ is
>>> armed, the hardware must interrupt only when a new CQE is inserted.
>>> xprtrdma can't rely on the provider calling again, even though some
>>> providers do.
>>>
>>> Therefore, leaving CQEs on queue makes sense only when there is
>>> another mechanism that ensures all remaining CQEs are consumed in a
>>> timely fashion. xprtrdma does not have such a mechanism. If a CQE
>>> remains queued, the transport can wait forever to send the next RPC.
>>>
>>> Finally, move the wcs array back onto the stack to ensure that the
>>> poll array is always local to the CPU where the completion upcall is
>>> running.
>>>
>>> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> net/sunrpc/xprtrdma/verbs.c     |  100 ++++++++++++++++++---------------------
>>> net/sunrpc/xprtrdma/xprt_rdma.h |    5 --
>>> 2 files changed, 45 insertions(+), 60 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 8a477e2..f2e3863 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
>>>        }
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * The common case is a single completion is ready. By asking
>>> + * for two entries, a return code of 1 means there is exactly
>>> + * one completion and no more. We don't have to poll again to
>>> + * know that the CQ is now empty.
>>> + */
>>> +static void
>>> rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> -       struct ib_wc *wcs;
>>> -       int budget, count, rc;
>>> +       struct ib_wc *pos, wcs[2];
>>> +       int count, rc;
>>>
>>> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>>        do {
>>> -               wcs = ep->rep_send_wcs;
>>> +               pos = wcs;
>>>
>>> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> -               if (rc <= 0)
>>> -                       return rc;
>>> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> +               if (rc < 0)
>>> +                       goto out_warn;
>>>
>>>                count = rc;
>>>                while (count-- > 0)
>>> -                       rpcrdma_sendcq_process_wc(wcs++);
>>> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> -       return 0;
>>> +                       rpcrdma_sendcq_process_wc(pos++);
>>> +       } while (rc == ARRAY_SIZE(wcs));
>>
>> I think I have missed something and not able to understand the reason
>> for polling 2 CQEs in one poll?
>
> See the block comment above.
>
> When ib_poll_cq() returns the same number of WCs as the
> consumer requested, there may still be CQEs waiting to
> be polled. Another call to ib_poll_cq() is needed to find
> out if that's the case.

True...

>
> When ib_poll_cq() returns fewer WCs than the consumer
> requested, the consumer doesn't have to call again to
> know that the CQ is empty.

Agree, the while loop will terminate here. What if immediately after
the vendor_poll_cq() decided to report only 1 CQE and terminate
polling loop, another CQE is added. This new CQE will be polled only
after T usec (where T is interrupt-latency).

>
> The common case, by far, is that one CQE is ready. By
> requesting two, the number returned is less than the
> number requested, and the consumer can tell immediately
> that the CQE is drained. The extra ib_poll_cq call is
> avoided.
>
> Note that the existing logic also relies on polling
> multiple WCs at once, and stopping the loop when the
> number of returned WCs is less than the size of the
> array.

There is a logic to perform extra polling too if arm_cq reports missed cqes.
we change the while-loop arm-cq reporting missed cqe logic may be removed.

>
>
>> It is possible that in a given poll_cq
>> call you end up getting on 1 completion, the other completion is
>> delayed due to some reason.
>
> If a CQE is allowed to be delayed, how does polling
> again guarantee that the consumer can retrieve it?

Its possible the moment vendor_poll_cq() looked at the CQ-memory
buffer and decided to report 1 CQE, another CQE was in flight CQE but
poll_cq has already decided not to report 2.

>
> What happens if a signal occurs, there is only one CQE,
> but it is delayed? ib_poll_cq would return 0 in that
> case, and the consumer would never call again, thinking
> the CQ is empty. There's no way the consumer can know
> for sure when a CQ is drained.

Hardware usually guarantees to signal only after the CQE is dma'ed.

>
> If the delayed CQE happens only when there is more
> than one CQE, how can polling multiple WCs ever work
> reliably?
>
> Maybe I don't understand what is meant by delayed.
>
>
>> Would it be better to poll for 1 in every
>> poll call Or
>> otherwise have this
>> while ( rc <= ARRAY_SIZE(wcs) && rc);
>>
>>> +       return;
>>> +
>>> +out_warn:
>>> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>>> }
>>>
>>> -/*
>>> - * Handle send, fast_reg_mr, and local_inv completions.
>>> - *
>>> - * Send events are typically suppressed and thus do not result
>>> - * in an upcall. Occasionally one is signaled, however. This
>>> - * prevents the provider's completion queue from wrapping and
>>> - * losing a completion.
>>> +/* Handle provider send completion upcalls.
>>>  */
>>> static void
>>> rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>> @@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>>        struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>>        int rc;
>>>
>>> -       rc = rpcrdma_sendcq_poll(cq, ep);
>>> -       if (rc) {
>>> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> -                       __func__, rc);
>>> -               return;
>>> -       }
>>> +       rpcrdma_sendcq_poll(cq, ep);
>>>
>>>        rc = ib_req_notify_cq(cq,
>>>                        IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>> @@ -247,44 +245,41 @@ out_fail:
>>>        goto out_schedule;
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * struct ib_wc is 64 bytes, making the poll array potentially
>>> + * large. But this is at the bottom of the call chain. Further
>>> + * substantial work is done in another thread.
>>> + */
>>> +static void
>>> rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> -       struct list_head sched_list;
>>> -       struct ib_wc *wcs;
>>> -       int budget, count, rc;
>>> +       struct ib_wc *pos, wcs[4];
>>> +       LIST_HEAD(sched_list);
>>> +       int count, rc;
>>>
>>> -       INIT_LIST_HEAD(&sched_list);
>>> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>>        do {
>>> -               wcs = ep->rep_recv_wcs;
>>> +               pos = wcs;
>>>
>>> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> -               if (rc <= 0)
>>> -                       goto out_schedule;
>>> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> +               if (rc < 0)
>>> +                       goto out_warn;
>>>
>>>                count = rc;
>>>                while (count-- > 0)
>>> -                       rpcrdma_recvcq_process_wc(wcs++, &sched_list);
>>> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> -       rc = 0;
>>> +                       rpcrdma_recvcq_process_wc(pos++, &sched_list);
>>> +       } while (rc == ARRAY_SIZE(wcs));
>>>
>>> out_schedule:
>>>        rpcrdma_schedule_tasklet(&sched_list);
>>> -       return rc;
>>> +       return;
>>> +
>>> +out_warn:
>>> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>>> +       goto out_schedule;
>>> }
>>>
>>> -/*
>>> - * Handle receive completions.
>>> - *
>>> - * It is reentrant but processes single events in order to maintain
>>> - * ordering of receives to keep server credits.
>>> - *
>>> - * It is the responsibility of the scheduled tasklet to return
>>> - * recv buffers to the pool. NOTE: this affects synchronization of
>>> - * connection shutdown. That is, the structures required for
>>> - * the completion of the reply handler must remain intact until
>>> - * all memory has been reclaimed.
>>> +/* Handle provider receive completion upcalls.
>>>  */
>>> static void
>>> rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>> @@ -292,12 +287,7 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>>        struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>>        int rc;
>>>
>>> -       rc = rpcrdma_recvcq_poll(cq, ep);
>>> -       if (rc) {
>>> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> -                       __func__, rc);
>>> -               return;
>>> -       }
>>> +       rpcrdma_recvcq_poll(cq, ep);
>>>
>>>        rc = ib_req_notify_cq(cq,
>>>                        IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index c09414e..42c8d44 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -77,9 +77,6 @@ struct rpcrdma_ia {
>>>  * RDMA Endpoint -- one per transport instance
>>>  */
>>>
>>> -#define RPCRDMA_WC_BUDGET      (128)
>>> -#define RPCRDMA_POLLSIZE       (16)
>>> -
>>> struct rpcrdma_ep {
>>>        atomic_t                rep_cqcount;
>>>        int                     rep_cqinit;
>>> @@ -89,8 +86,6 @@ struct rpcrdma_ep {
>>>        struct rdma_conn_param  rep_remote_cma;
>>>        struct sockaddr_storage rep_remote_addr;
>>>        struct delayed_work     rep_connect_worker;
>>> -       struct ib_wc            rep_send_wcs[RPCRDMA_POLLSIZE];
>>> -       struct ib_wc            rep_recv_wcs[RPCRDMA_POLLSIZE];
>>> };
>>>
>>> /*
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>

  parent reply	other threads:[~2015-09-21  8:52 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-17 20:44 [PATCH v1 00/18] RFC NFS/RDMA patches for merging into v4.4 Chuck Lever
2015-09-17 20:44 ` [PATCH v1 01/18] xprtrdma: Enable swap-on-NFS/RDMA Chuck Lever
2015-09-21  8:58   ` Devesh Sharma
2015-09-17 20:44 ` [PATCH v1 02/18] xprtrdma: Replace global lkey with lkey local to PD Chuck Lever
2015-09-21  8:59   ` Devesh Sharma
2015-09-17 20:44 ` [PATCH v1 03/18] xprtrdma: Remove completion polling budgets Chuck Lever
2015-09-18  6:52   ` Devesh Sharma
2015-09-18 14:19     ` Chuck Lever
2015-09-20 10:35       ` Sagi Grimberg
2015-09-21  8:51         ` Devesh Sharma
2015-09-21 15:45           ` Chuck Lever
2015-09-22 17:32             ` Devesh Sharma
2015-10-01 16:37               ` Chuck Lever
2015-10-01 17:13                 ` Jason Gunthorpe
2015-10-01 17:36                   ` Chuck Lever
2015-10-01 18:15                     ` Jason Gunthorpe
2015-10-01 18:31                       ` Chuck Lever
2015-10-01 18:38                         ` Jason Gunthorpe
2015-09-21  8:51       ` Devesh Sharma [this message]
2015-09-17 20:44 ` [PATCH v1 04/18] xprtrdma: Refactor reply handler error handling Chuck Lever
2015-09-21  8:59   ` Devesh Sharma
2015-09-17 20:44 ` [PATCH v1 05/18] xprtrdma: Replace send and receive arrays Chuck Lever
2015-09-20 10:52   ` Sagi Grimberg
2015-09-21 23:04     ` Chuck Lever
2015-09-17 20:45 ` [PATCH v1 06/18] SUNRPC: Abstract backchannel operations Chuck Lever
2015-09-17 20:45 ` [PATCH v1 07/18] xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers Chuck Lever
2015-09-21 10:28   ` Devesh Sharma
2015-09-17 20:45 ` [PATCH v1 08/18] xprtrdma: Pre-allocate Work Requests for backchannel Chuck Lever
2015-09-21 10:33   ` Devesh Sharma
2015-09-21 22:41     ` Chuck Lever
2015-09-17 20:45 ` [PATCH v1 09/18] xprtrdma: Add support for sending backward direction RPC replies Chuck Lever
2015-09-17 20:45 ` [PATCH v1 10/18] xprtrdma: Handle incoming backward direction RPC calls Chuck Lever
2015-09-17 20:45 ` [PATCH v1 11/18] svcrdma: Add backward direction service for RPC/RDMA transport Chuck Lever
2015-09-17 20:45 ` [PATCH v1 12/18] SUNRPC: Remove the TCP-only restriction in bc_svc_process() Chuck Lever
2015-09-17 20:45 ` [PATCH v1 13/18] NFS: Enable client side NFSv4.1 backchannel to use other transports Chuck Lever
2015-09-17 20:46 ` [PATCH v1 14/18] svcrdma: Define maximum number of backchannel requests Chuck Lever
2015-09-17 20:46 ` [PATCH v1 15/18] svcrdma: Add svc_rdma_get_context() API that is allowed to fail Chuck Lever
2015-09-20 12:40   ` Sagi Grimberg
2015-09-21 22:34     ` Chuck Lever
2015-09-17 20:46 ` [PATCH v1 16/18] svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls Chuck Lever
2015-09-17 20:46 ` [PATCH v1 17/18] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies Chuck Lever
2015-09-17 20:46 ` [PATCH v1 18/18] xprtrdma: Add class for RDMA backwards direction transport Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANjDDBhm=iXdRkLPWCX6gix=GRTeDb-Apb6zdaJp3NTGTyecBw@mail.gmail.com' \
    --to=devesh.sharma@avagotech.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).