From: Jakub Kicinski <kuba@kernel.org>
To: Mina Almasry <almasrymina@google.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>,
netdev@vger.kernel.org, Andrew Lunn <andrew@lunn.ch>,
davem@davemloft.net, Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org>,
Donald Hunter <donald.hunter@gmail.com>,
Michael Chan <michael.chan@broadcom.com>,
Pavan Chebbi <pavan.chebbi@broadcom.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
Stanislav Fomichev <sdf@fomichev.me>,
Joshua Washington <joshwash@google.com>,
Harshitha Ramamurthy <hramamurthy@google.com>,
Jian Shen <shenjian15@huawei.com>,
Salil Mehta <salil.mehta@huawei.com>,
Jijie Shao <shaojijie@huawei.com>,
Sunil Goutham <sgoutham@marvell.com>,
Geetha sowjanya <gakula@marvell.com>,
Subbaraya Sundeep <sbhatta@marvell.com>,
hariprasad <hkelam@marvell.com>,
Bharat Bhushan <bbhushan2@marvell.com>,
Saeed Mahameed <saeedm@nvidia.com>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
Alexander Duyck <alexanderduyck@fb.com>,
kernel-team@meta.com,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
Joe Damato <joe@dama.to>, David Wei <dw@davidwei.uk>,
Willem de Bruijn <willemb@google.com>,
Breno Leitao <leitao@debian.org>,
Dragos Tatulea <dtatulea@nvidia.com>,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Jonathan Corbet <corbet@lwn.net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Date: Thu, 16 Oct 2025 18:40:31 -0700 [thread overview]
Message-ID: <20251016184031.66c92962@kernel.org> (raw)
In-Reply-To: <CAHS8izOnzxbSuW5=aiTAUja7D2ARgtR13qYWr-bXNYSCvm5Bbg@mail.gmail.com>
On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> I think what you're saying is what I was trying to say, but you said
> it more eloquently and genetically correct. I'm not familiar with the
> GRO packing you're referring to so. I just assumed the 'buffer sizes
> actually posted to the NIC' are the 'buffer sizes we end up seeing in
> the skb frags'.
I don't think that code path exists today, buffers posted are frags
in the skb. But that's easily fixable.
> I guess what I'm trying to say in a different way, is: there are lots
> of buffer sizes in the rx path, AFAICT, at least:
>
> 1. The size of the allocated netmems from the pp.
> 2. The size of the buffers posted to the NIC (which will be different
> from #1 if the page_pool_fragment_netmem or some other trick like
> hns3).
> 3. The size of the frags that end up in the skb (which will be
> different from #2 for GRO/other things I don't fully understand).
>
> ...and I'm not sure what rx-buf-len should actually configure. My
> thinking is that it probably should configure #3, since that is what
> the user cares about, I agree with that.
>
> IIRC when I last looked at this a few weeks ago, I think as written
> this patch series makes rx-buf-len actually configure #1.
#1 or #2. #1 for otx2. For the RFC bnxt implementation they were
equivalent. But hns3's reading would be that it's #2.
From user PoV neither #1 nor #2 is particularly meaningful.
Assuming driver can fragment - #1 only configures memory accounting
blocks. #2 configures buffers passed to the HW, but some HW can pack
payloads into a single buf to save memory. Which means that if previous
frame was small and ate some of a page, subsequent large frame of
size M may not fit into a single buf of size X, even if M < X.
So I think the full set of parameters we should define would be
what you defined as #1 and #2. And on top of that we need some kind of
min alignment enforcement. David Wei mentioned that one of his main use
cases is ZC of a buffer which is then sent to storage, which has strict
alignment requirements. And some NICs will internally fragment the
page.. Maybe let's define the expected device behavior..
Device models
=============
Assume we receive 2 5kB packets, "x" means bytes from first packet,
"y" means bytes from the second packet.
A. Basic-scatter
----------------
Packet uses one or more buffers, so 1:n mapping between packets and
buffers.
unused space
v
1kB [xx] [xx] [x ] [yy] [yy] [y ]
16kB [xxxxx ] [yyyyy ]
B. Multi-packet
---------------
The configurations above are still possible, but we can configure
the device to place multiple packets in a large page:
unused space
v
16kB, 2kB [xxxxx |yyyyy |...] [..................]
^
alignment / stride
We can probably assume that this model always comes with alignment
cause DMA'ing frames at odd offsets is a bad idea. And also note
that packets smaller that alignment can get scattered to multiple
bufs.
C. Multi-packet HW-GRO
----------------------
For completeness, I guess. We need a third packet here. Assume x-packet
and z-packet are from the same flow and GRO session, y-packet is not.
(Good?) HW-GRO gives us out of order placement and hopefully in this
case we do want to pack:
16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
^
alignment / stride
End of sidebar. I think / hope these are all practical buffer layouts
we need to care about.
What does user care about? Presumably three things:
a) efficiency of memory use (larger pages == more chance of low fill)
b) max size of a buffer (larger buffer = fewer iovecs to pass around)
c) alignment
I don't think we can make these map 1:1 to any of the knobs we discussed
at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
SW GRO can glue back together).
We could simply let the user control #1 - basically user control
overrides the places where driver would previously use PAGE_SIZE.
I think this is what Stan suggested long ago as well.
But I wonder if user still needs to know #2 (rx-buf-len) because
practically speaking, setting page size >4x the size of rx-buf-len
is likely a lot more fragmentation for little extra aggregation.. ?
Tho, admittedly I think user only needs to know max-rx-buf-len
not necessarily set it.
The last knob is alignment / reuse. For allowing multiple packets in
one buffer we probably need to distinguish these cases to cater to
sufficiently clever adapters:
- previous and next packets are from the same flow and
- within one GRO session
- previous had PSH set (or closed the GRO for another reason,
this is to allow realigning the buffer on GRO session close)
or
- the device doesn't know further distinctions / HW-GRO
- previous and next are from different flows
And the actions (for each case separately) are one of:
- no reuse allowed (release buffer = -1?)
- reuse but must align (align to = N)
- reuse don't align (pack = 0)
So to restate do we need:
- "page order" control
- max-rx-buf-len
- 4 alignment knobs?
Corner cases
============
I. Non-power of 2 buffer sizes
------------------------------
Looks like multiple devices are limited by width of length fields,
making max buffer size something like 32kB - 1 or 64kB - 1.
Should we allow applications to configure the buffer to
power of 2 - alignment
? It will probably annoy the page pool code a bit. I guess for now
we should just make sure that uAPI doesn't bake in the idea that
buffers are always power of 2.
II. Fractional page sizes
-------------------------
If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
should we support hunking devmem/iouring into less than a PAGE_SIZE?
next prev parent reply other threads:[~2025-10-17 1:40 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 14:54 [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 01/24] net: page_pool: sanitise allocation order Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 02/24] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 03/24] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 04/24] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 05/24] net: hns3: net: use zero " Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 06/24] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-10-13 17:12 ` Randy Dunlap
2025-10-14 12:53 ` Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 07/24] net: add rx_buf_len to netdev config Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 08/24] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 09/24] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 10/24] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 11/24] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 12/24] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 13/24] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 14/24] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 15/24] net: add queue config validation callback Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 16/24] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 17/24] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 18/24] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 19/24] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 20/24] net: wipe the setting of deactived queues Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 21/24] eth: bnxt: use queue op config validate Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 22/24] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 23/24] net: let pp memory provider to specify rx buf len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 24/24] net: validate driver supports passed qcfg params Pavel Begunkov
2025-10-13 15:03 ` [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-10-13 17:54 ` Jakub Kicinski
2025-10-14 4:41 ` Mina Almasry
2025-10-14 12:50 ` Pavel Begunkov
2025-10-15 1:41 ` Jakub Kicinski
2025-10-15 17:44 ` Mina Almasry
2025-10-17 1:40 ` Jakub Kicinski [this message]
2025-10-22 13:17 ` Dragos Tatulea
2025-10-23 0:09 ` Jakub Kicinski
2025-10-14 12:46 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251016184031.66c92962@kernel.org \
--to=kuba@kernel.org \
--cc=alexanderduyck@fb.com \
--cc=almasrymina@google.com \
--cc=andrew@lunn.ch \
--cc=asml.silence@gmail.com \
--cc=bbhushan2@marvell.com \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=donald.hunter@gmail.com \
--cc=dtatulea@nvidia.com \
--cc=dw@davidwei.uk \
--cc=edumazet@google.com \
--cc=gakula@marvell.com \
--cc=hawk@kernel.org \
--cc=hkelam@marvell.com \
--cc=horms@kernel.org \
--cc=hramamurthy@google.com \
--cc=ilias.apalodimas@linaro.org \
--cc=joe@dama.to \
--cc=john.fastabend@gmail.com \
--cc=joshwash@google.com \
--cc=kernel-team@meta.com \
--cc=leitao@debian.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=michael.chan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pavan.chebbi@broadcom.com \
--cc=saeedm@nvidia.com \
--cc=salil.mehta@huawei.com \
--cc=sbhatta@marvell.com \
--cc=sdf@fomichev.me \
--cc=sgoutham@marvell.com \
--cc=shaojijie@huawei.com \
--cc=shenjian15@huawei.com \
--cc=tariqt@nvidia.com \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).