Re: [PATCH 7/8] xprtrdma: Split the completion queue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sagi Grimberg <sagig@dev.mellanox.co.il>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Steve Wise <swise@opengridcomputing.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
Date: Sun, 20 Apr 2014 15:42:55 +0300	[thread overview]
Message-ID: <5353C0CF.9080806@dev.mellanox.co.il> (raw)
In-Reply-To: <593D9BFA-714E-417F-ACA0-05594290C4D1@oracle.com>

On 4/19/2014 7:31 PM, Chuck Lever wrote:
> Hi Sagi-
>
> On Apr 17, 2014, at 3:11 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> On 4/17/2014 5:34 PM, Steve Wise wrote:
>>
>> <SNIP>
>>> You could use a small array combined with a loop and a budget count.  So the code would
>>> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
>>> desired budget is reached...
>> Bingo... couldn't agree more.
>>
>> Poll Arrays are a nice optimization,
> Typically, a provider's poll_cq implementation takes the CQ lock
> using spin_lock_irqsave().  My goal of using a poll array is to
> reduce the number of times the completion handler invokes
> spin_lock_irqsave / spin_unlock_irqsave pairs when draining a
> large queue.

Yes, hence the optimization.

>
>> but large arrays will just burden the stack (and might even make things worse in high workloads...)
>
> My prototype moves the poll array off the stack and into allocated
> storage.  Making that array as large as a single page would be
> sufficient for 50 or more ib_wc structures on a platform with 4KB
> pages and 64-bit addresses.

You assume here the worst-case workload. In the sparse case you are 
carrying around a redundant storage space...
I would recommend using say 16 wc array or so.

> The xprtrdma completion handler polls twice:
>
>    1.  Drain the CQ completely
>
>    2.  Re-arm
>
>    3.  Drain the CQ completely again
>
> So between steps 1. and 3. a single notification could handle over
> 100 WCs, if we were to budget by draining just a single array's worth
> during each step. (Btw, I'm not opposed to looping while polling
> arrays. This is just an example for discussion).
>
> As for budgeting itself, I wonder if there is a possibility of losing
> notifications.  The purpose of re-arming and then draining again is to
> ensure that any items queued after step 1. and before step 2. are
> captured, as by themselves they would never generate an upcall
> notification, IIUC.

I don't think there is a possibility for implicit loss of completions. 
HCAs that may miss completions
should respond to ib_req_notify flag IB_CQ_REPORT_MISSED_EVENTS.
/**
  * ib_req_notify_cq - Request completion notification on a CQ.
  * @cq: The CQ to generate an event for.
  * @flags:
  *   Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP
  *   to request an event on the next solicited event or next work
  *   completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS
  *   may also be |ed in to request a hint about missed events, as
  *   described below.
  *
  * Return Value:
  *    < 0 means an error occurred while requesting notification
  *   == 0 means notification was requested successfully, and if
  *        IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events
  *        were missed and it is safe to wait for another event.  In
  *        this case is it guaranteed that any work completions added
  *        to the CQ since the last CQ poll will trigger a completion
  *        notification event.
  *    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed
  *        in.  It means that the consumer must poll the CQ again to
  *        make sure it is empty to avoid missing an event because of a
  *        race between requesting notification and an entry being
  *        added to the CQ.  This return value means it is possible
  *        (but not guaranteed) that a work completion has been added
  *        to the CQ since the last poll without triggering a
  *        completion notification event.
  */

Other than that, if one stops polling and requests notify - he should be 
invoked again from the
correct producer index (i.e. no missed events).

Hope this helps,

Sagi.

WARNING: multiple messages have this Message-ID (diff)

From: Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
To: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Steve Wise
	<swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
Date: Sun, 20 Apr 2014 15:42:55 +0300	[thread overview]
Message-ID: <5353C0CF.9080806@dev.mellanox.co.il> (raw)
In-Reply-To: <593D9BFA-714E-417F-ACA0-05594290C4D1-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On 4/19/2014 7:31 PM, Chuck Lever wrote:
> Hi Sagi-
>
> On Apr 17, 2014, at 3:11 PM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
>> On 4/17/2014 5:34 PM, Steve Wise wrote:
>>
>> <SNIP>
>>> You could use a small array combined with a loop and a budget count.  So the code would
>>> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
>>> desired budget is reached...
>> Bingo... couldn't agree more.
>>
>> Poll Arrays are a nice optimization,
> Typically, a provider's poll_cq implementation takes the CQ lock
> using spin_lock_irqsave().  My goal of using a poll array is to
> reduce the number of times the completion handler invokes
> spin_lock_irqsave / spin_unlock_irqsave pairs when draining a
> large queue.

Yes, hence the optimization.

>
>> but large arrays will just burden the stack (and might even make things worse in high workloads...)
>
> My prototype moves the poll array off the stack and into allocated
> storage.  Making that array as large as a single page would be
> sufficient for 50 or more ib_wc structures on a platform with 4KB
> pages and 64-bit addresses.

You assume here the worst-case workload. In the sparse case you are 
carrying around a redundant storage space...
I would recommend using say 16 wc array or so.

> The xprtrdma completion handler polls twice:
>
>    1.  Drain the CQ completely
>
>    2.  Re-arm
>
>    3.  Drain the CQ completely again
>
> So between steps 1. and 3. a single notification could handle over
> 100 WCs, if we were to budget by draining just a single array's worth
> during each step. (Btw, I'm not opposed to looping while polling
> arrays. This is just an example for discussion).
>
> As for budgeting itself, I wonder if there is a possibility of losing
> notifications.  The purpose of re-arming and then draining again is to
> ensure that any items queued after step 1. and before step 2. are
> captured, as by themselves they would never generate an upcall
> notification, IIUC.

I don't think there is a possibility for implicit loss of completions. 
HCAs that may miss completions
should respond to ib_req_notify flag IB_CQ_REPORT_MISSED_EVENTS.
/**
  * ib_req_notify_cq - Request completion notification on a CQ.
  * @cq: The CQ to generate an event for.
  * @flags:
  *   Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP
  *   to request an event on the next solicited event or next work
  *   completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS
  *   may also be |ed in to request a hint about missed events, as
  *   described below.
  *
  * Return Value:
  *    < 0 means an error occurred while requesting notification
  *   == 0 means notification was requested successfully, and if
  *        IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events
  *        were missed and it is safe to wait for another event.  In
  *        this case is it guaranteed that any work completions added
  *        to the CQ since the last CQ poll will trigger a completion
  *        notification event.
  *    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed
  *        in.  It means that the consumer must poll the CQ again to
  *        make sure it is empty to avoid missing an event because of a
  *        race between requesting notification and an entry being
  *        added to the CQ.  This return value means it is possible
  *        (but not guaranteed) that a work completion has been added
  *        to the CQ since the last poll without triggering a
  *        completion notification event.
  */

Other than that, if one stops polling and requests notify - he should be 
invoked again from the
correct producer index (i.e. no missed events).

Hope this helps,

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2014-04-20 12:43 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-14 22:22 [PATCH 0/8] NFS/RDMA patches for review Chuck Lever
2014-04-14 22:22 ` Chuck Lever
2014-04-14 22:22 ` [PATCH 1/8] xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context Chuck Lever
2014-04-14 22:22   ` Chuck Lever
2014-04-14 22:22 ` [PATCH 2/8] xprtrdma: Remove BOUNCEBUFFERS memory registration mode Chuck Lever
2014-04-14 22:22   ` Chuck Lever
2014-04-14 22:22 ` [PATCH 3/8] xprtrdma: Disable ALLPHYSICAL mode by default Chuck Lever
2014-04-14 22:22   ` Chuck Lever
2014-04-14 22:22 ` [PATCH 4/8] xprtrdma: Remove support for MEMWINDOWS registration mode Chuck Lever
2014-04-14 22:22   ` Chuck Lever
2014-04-14 22:23 ` [PATCH 5/8] xprtrdma: Simplify rpcrdma_deregister_external() synopsis Chuck Lever
2014-04-14 22:23   ` Chuck Lever
2014-04-14 22:23 ` [PATCH 6/8] xprtrdma: Make rpcrdma_ep_destroy() return void Chuck Lever
2014-04-14 22:23   ` Chuck Lever
2014-04-14 22:23 ` [PATCH 7/8] xprtrdma: Split the completion queue Chuck Lever
2014-04-14 22:23   ` Chuck Lever
2014-04-16 12:48   ` Sagi Grimberg
2014-04-16 12:48     ` Sagi Grimberg
2014-04-16 13:30     ` Steve Wise
2014-04-16 13:30       ` Steve Wise
2014-04-16 14:12       ` Sagi Grimberg
2014-04-16 14:12         ` Sagi Grimberg
2014-04-16 14:25         ` Steve Wise
2014-04-16 14:25           ` Steve Wise
2014-04-16 14:35           ` Sagi Grimberg
2014-04-16 14:35             ` Sagi Grimberg
2014-04-16 14:43             ` Steve Wise
2014-04-16 14:43               ` Steve Wise
2014-04-16 15:18               ` Sagi Grimberg
2014-04-16 15:18                 ` Sagi Grimberg
2014-04-16 15:46                 ` Steve Wise
2014-04-16 15:46                   ` Steve Wise
2014-04-16 15:08       ` Chuck Lever
2014-04-16 15:08         ` Chuck Lever
2014-04-16 15:23         ` Sagi Grimberg
2014-04-16 15:23           ` Sagi Grimberg
2014-04-16 18:21           ` Chuck Lever
2014-04-16 18:21             ` Chuck Lever
2014-04-17  7:06             ` Sagi Grimberg
2014-04-17  7:06               ` Sagi Grimberg
2014-04-17 13:55               ` Chuck Lever
2014-04-17 13:55                 ` Chuck Lever
2014-04-17 14:34                 ` Steve Wise
2014-04-17 14:34                   ` Steve Wise
2014-04-17 19:11                   ` Sagi Grimberg
2014-04-17 19:11                     ` Sagi Grimberg
2014-04-19 16:31                     ` Chuck Lever
2014-04-19 16:31                       ` Chuck Lever
2014-04-20 12:42                       ` Sagi Grimberg [this message]
2014-04-20 12:42                         ` Sagi Grimberg
2014-04-17 19:08                 ` Sagi Grimberg
2014-04-17 19:08                   ` Sagi Grimberg
2014-04-14 22:23 ` [PATCH 8/8] xprtrdma: Reduce the number of hardway buffer allocations Chuck Lever
2014-04-14 22:23   ` Chuck Lever
2014-04-15 20:15 ` [PATCH 0/8] NFS/RDMA patches for review Steve Wise
2014-04-15 20:15   ` Steve Wise
2014-04-15 20:20   ` Steve Wise
2014-04-15 20:20     ` Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5353C0CF.9080806@dev.mellanox.co.il \
    --to=sagig@dev.mellanox.co.il \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.