Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Kicinski <kuba@kernel.org>
To: Mina Almasry <almasrymina@google.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>,
	netdev@vger.kernel.org, Andrew Lunn <andrew@lunn.ch>,
	davem@davemloft.net, Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org>,
	Donald Hunter <donald.hunter@gmail.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Pavan Chebbi <pavan.chebbi@broadcom.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Joshua Washington <joshwash@google.com>,
	Harshitha Ramamurthy <hramamurthy@google.com>,
	Jian Shen <shenjian15@huawei.com>,
	Salil Mehta <salil.mehta@huawei.com>,
	Jijie Shao <shaojijie@huawei.com>,
	Sunil Goutham <sgoutham@marvell.com>,
	Geetha sowjanya <gakula@marvell.com>,
	Subbaraya Sundeep <sbhatta@marvell.com>,
	hariprasad <hkelam@marvell.com>,
	Bharat Bhushan <bbhushan2@marvell.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	kernel-team@meta.com,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Joe Damato <joe@dama.to>, David Wei <dw@davidwei.uk>,
	Willem de Bruijn <willemb@google.com>,
	Breno Leitao <leitao@debian.org>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Jonathan Corbet <corbet@lwn.net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Date: Thu, 16 Oct 2025 18:40:31 -0700	[thread overview]
Message-ID: <20251016184031.66c92962@kernel.org> (raw)
In-Reply-To: <CAHS8izOnzxbSuW5=aiTAUja7D2ARgtR13qYWr-bXNYSCvm5Bbg@mail.gmail.com>

On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> I think what you're saying is what I was trying to say, but you said
> it more eloquently and genetically correct. I'm not familiar with the
> GRO packing you're referring to so. I just assumed the 'buffer sizes
> actually posted to the NIC' are the 'buffer sizes we end up seeing in
> the skb frags'.

I don't think that code path exists today, buffers posted are frags
in the skb. But that's easily fixable.

> I guess what I'm trying to say in a different way, is: there are lots
> of buffer sizes in the rx path, AFAICT, at least:
> 
> 1. The size of the allocated netmems from the pp.
> 2. The size of the buffers posted to the NIC (which will be different
> from #1 if the page_pool_fragment_netmem or some other trick like
> hns3).
> 3. The size of the frags that end up in the skb (which will be
> different from #2 for GRO/other things I don't fully understand).
> 
> ...and I'm not sure what rx-buf-len should actually configure. My
> thinking is that it probably should configure #3, since that is what
> the user cares about, I agree with that.
> 
> IIRC when I last looked at this a few weeks ago, I think as written
> this patch series makes rx-buf-len actually configure #1.

#1 or #2. #1 for otx2. For the RFC bnxt implementation they were
equivalent. But hns3's reading would be that it's #2.

From user PoV neither #1 nor #2 is particularly meaningful.
Assuming driver can fragment - #1 only configures memory accounting
blocks. #2 configures buffers passed to the HW, but some HW can pack
payloads into a single buf to save memory. Which means that if previous
frame was small and ate some of a page, subsequent large frame of
size M may not fit into a single buf of size X, even if M < X.

So I think the full set of parameters we should define would be
what you defined as #1 and #2. And on top of that we need some kind of
min alignment enforcement. David Wei mentioned that one of his main use
cases is ZC of a buffer which is then sent to storage, which has strict
alignment requirements. And some NICs will internally fragment the
page.. Maybe let's define the expected device behavior..

Device models
=============
Assume we receive 2 5kB packets, "x" means bytes from first packet,
"y" means bytes from the second packet.

A. Basic-scatter
----------------
Packet uses one or more buffers, so 1:n mapping between packets and
buffers.
                       unused space
                      v
 1kB      [xx] [xx] [x ] [yy] [yy] [y ]
16kB      [xxxxx            ] [yyyyy             ]

B. Multi-packet
---------------
The configurations above are still possible, but we can configure 
the device to place multiple packets in a large page:

                 unused space
                v
16kB, 2kB [xxxxx |yyyyy |...] [..................]
      ^
      alignment / stride

We can probably assume that this model always comes with alignment
cause DMA'ing frames at odd offsets is a bad idea. And also note
that packets smaller that alignment can get scattered to multiple
bufs.

C. Multi-packet HW-GRO
----------------------
For completeness, I guess. We need a third packet here. Assume x-packet
and z-packet are from the same flow and GRO session, y-packet is not.
(Good?) HW-GRO gives us out of order placement and hopefully in this
case we do want to pack:

16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
                     ^
      alignment / stride

End of sidebar. I think / hope these are all practical buffer layouts
we need to care about.

What does user care about? Presumably three things:
 a) efficiency of memory use (larger pages == more chance of low fill)
 b) max size of a buffer (larger buffer = fewer iovecs to pass around)
 c) alignment
I don't think we can make these map 1:1 to any of the knobs we discussed
at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
SW GRO can glue back together).

We could simply let the user control #1 - basically user control
overrides the places where driver would previously use PAGE_SIZE.
I think this is what Stan suggested long ago as well.

But I wonder if user still needs to know #2 (rx-buf-len) because
practically speaking, setting page size >4x the size of rx-buf-len
is likely a lot more fragmentation for little extra aggregation.. ?
Tho, admittedly I think user only needs to know max-rx-buf-len
not necessarily set it.

The last knob is alignment / reuse. For allowing multiple packets in
one buffer we probably need to distinguish these cases to cater to
sufficiently clever adapters:
 - previous and next packets are from the same flow and
   - within one GRO session
   - previous had PSH set (or closed the GRO for another reason,
     this is to allow realigning the buffer on GRO session close)
  or
   - the device doesn't know further distinctions / HW-GRO
 - previous and next are from different flows
And the actions (for each case separately) are one of:
 - no reuse allowed (release buffer = -1?)
 - reuse but must align (align to = N)
 - reuse don't align (pack = 0)

So to restate do we need:
 - "page order" control
 - max-rx-buf-len
 - 4 alignment knobs?

Corner cases
============
I. Non-power of 2 buffer sizes
------------------------------
Looks like multiple devices are limited by width of length fields,
making max buffer size something like 32kB - 1 or 64kB - 1.
Should we allow applications to configure the buffer to 

    power of 2 - alignment 

? It will probably annoy the page pool code a bit. I guess for now
we should just make sure that uAPI doesn't bake in the idea that
buffers are always power of 2.

II. Fractional page sizes
-------------------------
If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
should we support hunking devmem/iouring into less than a PAGE_SIZE?

next prev parent reply	other threads:[~2025-10-17  1:40 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-13 14:54 [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 01/24] net: page_pool: sanitise allocation order Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 02/24] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 03/24] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 04/24] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 05/24] net: hns3: net: use zero " Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 06/24] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-10-13 17:12   ` Randy Dunlap
2025-10-14 12:53     ` Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 07/24] net: add rx_buf_len to netdev config Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 08/24] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 09/24] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 10/24] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 11/24] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 12/24] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 13/24] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 14/24] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 15/24] net: add queue config validation callback Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 16/24] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 17/24] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 18/24] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 19/24] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 20/24] net: wipe the setting of deactived queues Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 21/24] eth: bnxt: use queue op config validate Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 22/24] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 23/24] net: let pp memory provider to specify rx buf len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 24/24] net: validate driver supports passed qcfg params Pavel Begunkov
2025-10-13 15:03 ` [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-10-13 17:54 ` Jakub Kicinski
2025-10-14  4:41   ` Mina Almasry
2025-10-14 12:50     ` Pavel Begunkov
2025-10-15  1:41     ` Jakub Kicinski
2025-10-15 17:44       ` Mina Almasry
2025-10-17  1:40         ` Jakub Kicinski [this message]
2025-10-22 13:17           ` Dragos Tatulea
2025-10-23  0:09             ` Jakub Kicinski
2025-10-14 12:46   ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251016184031.66c92962@kernel.org \
    --to=kuba@kernel.org \
    --cc=alexanderduyck@fb.com \
    --cc=almasrymina@google.com \
    --cc=andrew@lunn.ch \
    --cc=asml.silence@gmail.com \
    --cc=bbhushan2@marvell.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=dtatulea@nvidia.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=gakula@marvell.com \
    --cc=hawk@kernel.org \
    --cc=hkelam@marvell.com \
    --cc=horms@kernel.org \
    --cc=hramamurthy@google.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=joe@dama.to \
    --cc=john.fastabend@gmail.com \
    --cc=joshwash@google.com \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    --cc=saeedm@nvidia.com \
    --cc=salil.mehta@huawei.com \
    --cc=sbhatta@marvell.com \
    --cc=sdf@fomichev.me \
    --cc=sgoutham@marvell.com \
    --cc=shaojijie@huawei.com \
    --cc=shenjian15@huawei.com \
    --cc=tariqt@nvidia.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.