Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

From: Doug Ledford <doug.ledford@hpe.com>
To: Michael Margolin <mrgolin@amazon.com>
Cc: <jgg@nvidia.com>, <leon@kernel.org>, <linux-rdma@vger.kernel.org>,
	<sleybo@amazon.com>, <matua@amazon.com>, <gal.pressman@linux.dev>,
	"Yonatan Nachum" <ynachum@amazon.com>
Subject: Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support
Date: Thu, 30 Apr 2026 13:09:36 -0600	[thread overview]
Message-ID: <d3994658-d679-4205-8b59-a6888bcbc144@hpe.com> (raw)
In-Reply-To: <20260430121833.GA30363@dev-dsk-mrgolin-1c-b2091117.eu-west-1.amazon.com>

[-- Attachment #1.1: Type: text/plain, Size: 8076 bytes --]

On 4/30/26 7:18 AM, Michael Margolin wrote:
> On Wed, Apr 29, 2026 at 06:50:54PM -0600, Doug Ledford wrote:
>> On 4/16/26 4:23 PM, Michael Margolin wrote:
>>> Add core infrastructure for Completion Counters, a light-weight
>>> alternative to polling CQ for tracking operation completions.
>>>
>>> Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy,
>>> set, inc and read methods for both success and error counters. Add a
>>> QP attach method on the QP object to associate a completion counter
>>> with a queue pair.
>>>
>>> The create handler constructs umem from user-provided VA or dmabuf for
>>> each counter, following the CQ buffer pattern.
>>
>> Description here doesn't match implementation.  The umem or dmabuf
>> is optional, while this reads that they are the only two options.
>> If neither is passed in, then the counter is on the hardware and the
>> read operation is used to get the value (as per the code anyway).
> 
> Thanks, I'll make that path more clear in the commit message.
>>
>> Which raises a different scenario our hardware enables.  We can pass
>> in a umem on create, but that doesn't mean the counter exists in
>> umem, it exists on the device and it is copied to umem.  If you copy
>> it on every counter update, that kills PCI-e usage, so we have an
> 
> Why would it load PCIe more than writing CQEs into a CQ?

You don't necessarily signal every single completion, but updating umem 
with every single counter update has that effect.  It's just unnecessary 
PCI-e bandwidth consumed.  And it happens to be in addition to any other 
CQE updates, etc.  From our experience, it adds up when you have lots of 
counters.  Writing every update out for 1,000s of counters doesn't scale 
well.  So while the API works well for you as it is, no doubt if we show 
up in a few months with our hardware wanting to do something similar, we 
will be told "you should use the official interface for this", so we 
need this interface to be acceptable to our use patterns also.

>> option to use a trigger to only update on a periodic basis (but then
>> user space authors start polling on the umem location and killing
>> CPU cycles, so this option is not preferred), or there is a wait
>> option where you can set the target and then in your app use a wait
>> call to wait for the count to be reached (we've found this is about
>> the only performant way to implement these counters).
>>
>> Also, we don't really attach counters to QPs.  That isn't usually
>> what we care about counting.  Given that our EPs are not connected,
>> counters on it are usually only useful for recv operations where you
>> can get aggregate data for a given EP.  For send, it is often that
>> we really want counters on a per-flow basis knowing that we have
>> many flows that go through that one EP (soon to be QP).  So, for us,
>> we create a counter, then during our send operations, if we want a
>> specific transfer to be included in a specific counter, it's flagged
>> in the command we send to the hardware for that send operation.
>> That implies that a proper place to hang a list of counters is
>> probably off of an AH instead of a QP for us.
>>
>> I think we can extend this API to suit our needs, relax some of the
>> current restrictions/assumptions, and be good.  But, as this is a
>> user visible API, if it's taken as-is, I would suggest that the
>> rdma-core portion be marked as experimental until we've made the
>> changes needed for our hardware in order to avoid user API churn.
>>
>> These changes could be summed up as:
>>
>> 1) Make qp attachment optional
> 
> The attachment is already a separate call that can be avoided.

Yes. but the code that tells the user whether or not it is supported 
will refuse to recognize the feature without the attach_qp function 
pointer being registered.  So, while it may be optional to attach the 
qp, the routine to attach it to a qp is not optional.  I was referring 
to that.  I mean, we could make a stub, but it would likely just return 
-EOPNOTSUPP or somesuch.  And, the API as it stands, doesn't have any 
way to consume a counter without attaching it to a QP.  In our case, we 
will be needing to add an extension to the already extended AH that we 
will need for UET and in that we will attach the counter handle to 
specific data transfer commands and that's how our counters will get 
updated.  For this patchset to truly make attaching the counter to a qp 
optional, you would need to add another way to consume it. So you say 
it's optional in this patchset, but it's really not as far as I can tell.

>> 2) Extend create verb to differentiate between on-card counter with
>> umem target and in-umem counter
> 
> Can you elaborate on the extension you have on your mind? This seems to
> me as a totally driver-device level implementation detail. EFA for
> instance has device level counters that are being synced into the
> provided memory on each update. Others may choose a different sync
> strategy.

Allow me to backtrack and rephrase somewhat.  My original comment was 
brought out by the man pages that say, in a nutshell, the counter 
resides in umem but you can't modify it directly, you must use the 
accessor functions to modify it.  This implies to readers that the 
actual counter is in umem.  I think that's inaccurate.  As you say, you 
sync the counter to umem.  The counter resides on the device.  This is 
really why they have to use the accessors, otherwise any direct updates 
would just get overwritten later.

So allow me to rephrase as: Remove the wording in the man pages that the 
counter is in umem (as it really isn't), then add some optional create 
flags for sync method and frequency. With the counters really being on 
the device and synced to umem, optional sync triggers makes sense. 
Doing it on every update could be the default, but I can also see doing 
a sync on: timer interval, count interval, specific trigger event of 
another sort, etc.  Or, conversely, just be prepared for us to bring the 
optional sync triggers as an extension to the base API later.

>> 3) Extend create verb to pass in optional trigger or wait capability
>> to perform limited umem updates based upon passed in option
> 
> I think this can be vendor specific extension rather than a common
> interface. Providers that want to support this mode can easily add
> their own "update frequency" attribute in create ioctl or introduce
> a "sync" verb that will do what's needed for the sequential read to
> return an up-to-date value.

It's actually a useful API (from experience), so I would prefer it 
didn't have to be vendor specific, aka we don't want people having to 
custom code to our hardware for something generally useful.  I wouldn't 
say it should be required from all vendors, but I'm reluctant to make it 
vendor specific instead of just an optional extension to the base API. 
I am, however, perfectly happy to provide the extension to the base API 
as opposed to requiring it be part of the initial patchset.

>> 4) Modify read operation so that it can either return the value
>> directly or just trigger an async update of a buffer backed counter
>> (especially useful if the umem counter is on a GPU, is set for a
>> triggered update, and you just want to force an immediate async
>> update)
> 
> See my suggestion above. I think what you describe here should be a
> separate command.

We can add a flush command.  Same deal as read, only instead of 
returning the value to the caller, it flushes it to whatever the 
destination is supposed to be.  Or, the semantics of read could be 
modified such that a read also triggers a flush if there is a known umem 
or gpu mem target for the counter.  Either way, although I think I might 
prefer the flush variant.

-- 
Doug Ledford <doug.ledford@hpe.com>
     GPG KeyID: B826A3330E572FDD
     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

next prev parent reply	other threads:[~2026-04-30 19:10 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin
2026-04-16 21:23 ` [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support Michael Margolin
2026-04-30  0:50   ` Doug Ledford
2026-04-30  1:49     ` Jason Gunthorpe
2026-04-30 15:38       ` Doug Ledford
2026-04-30 12:18     ` Michael Margolin
2026-04-30 19:09       ` Doug Ledford [this message]
2026-04-30 22:33         ` Sean Hefty
2026-05-04 12:51           ` Michael Margolin
2026-05-04 15:37             ` Sean Hefty
2026-04-16 21:23 ` [PATCH for-next v2 2/5] RDMA/core: Prevent destroying in-use completion counters Michael Margolin
2026-04-16 21:23 ` [PATCH for-next v2 3/5] RDMA/core: Add Completion Counters to resource tracking Michael Margolin
2026-04-16 21:23 ` [PATCH for-next v2 4/5] RDMA/efa: Update device interface Michael Margolin
2026-04-28 22:36 ` [PATCH for-next v2 0/5] Introduce Completion Counters Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d3994658-d679-4205-8b59-a6888bcbc144@hpe.com \
    --to=doug.ledford@hpe.com \
    --cc=gal.pressman@linux.dev \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=matua@amazon.com \
    --cc=mrgolin@amazon.com \
    --cc=sleybo@amazon.com \
    --cc=ynachum@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox