linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Mina Almasry <almasrymina@google.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>,
	netdev@vger.kernel.org, Andrew Lunn <andrew@lunn.ch>,
	davem@davemloft.net, Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org>,
	Donald Hunter <donald.hunter@gmail.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Pavan Chebbi <pavan.chebbi@broadcom.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Joshua Washington <joshwash@google.com>,
	Harshitha Ramamurthy <hramamurthy@google.com>,
	Jian Shen <shenjian15@huawei.com>,
	Salil Mehta <salil.mehta@huawei.com>,
	Jijie Shao <shaojijie@huawei.com>,
	Sunil Goutham <sgoutham@marvell.com>,
	Geetha sowjanya <gakula@marvell.com>,
	Subbaraya Sundeep <sbhatta@marvell.com>,
	hariprasad <hkelam@marvell.com>,
	Bharat Bhushan <bbhushan2@marvell.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	kernel-team@meta.com,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Joe Damato <joe@dama.to>, David Wei <dw@davidwei.uk>,
	Willem de Bruijn <willemb@google.com>,
	Breno Leitao <leitao@debian.org>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Jonathan Corbet <corbet@lwn.net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Date: Thu, 16 Oct 2025 18:40:31 -0700	[thread overview]
Message-ID: <20251016184031.66c92962@kernel.org> (raw)
In-Reply-To: <CAHS8izOnzxbSuW5=aiTAUja7D2ARgtR13qYWr-bXNYSCvm5Bbg@mail.gmail.com>

On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> I think what you're saying is what I was trying to say, but you said
> it more eloquently and genetically correct. I'm not familiar with the
> GRO packing you're referring to so. I just assumed the 'buffer sizes
> actually posted to the NIC' are the 'buffer sizes we end up seeing in
> the skb frags'.

I don't think that code path exists today, buffers posted are frags
in the skb. But that's easily fixable.

> I guess what I'm trying to say in a different way, is: there are lots
> of buffer sizes in the rx path, AFAICT, at least:
> 
> 1. The size of the allocated netmems from the pp.
> 2. The size of the buffers posted to the NIC (which will be different
> from #1 if the page_pool_fragment_netmem or some other trick like
> hns3).
> 3. The size of the frags that end up in the skb (which will be
> different from #2 for GRO/other things I don't fully understand).
> 
> ...and I'm not sure what rx-buf-len should actually configure. My
> thinking is that it probably should configure #3, since that is what
> the user cares about, I agree with that.
> 
> IIRC when I last looked at this a few weeks ago, I think as written
> this patch series makes rx-buf-len actually configure #1.

#1 or #2. #1 for otx2. For the RFC bnxt implementation they were
equivalent. But hns3's reading would be that it's #2.

From user PoV neither #1 nor #2 is particularly meaningful.
Assuming driver can fragment - #1 only configures memory accounting
blocks. #2 configures buffers passed to the HW, but some HW can pack
payloads into a single buf to save memory. Which means that if previous
frame was small and ate some of a page, subsequent large frame of
size M may not fit into a single buf of size X, even if M < X.

So I think the full set of parameters we should define would be
what you defined as #1 and #2. And on top of that we need some kind of
min alignment enforcement. David Wei mentioned that one of his main use
cases is ZC of a buffer which is then sent to storage, which has strict
alignment requirements. And some NICs will internally fragment the
page.. Maybe let's define the expected device behavior..

Device models
=============
Assume we receive 2 5kB packets, "x" means bytes from first packet,
"y" means bytes from the second packet.

A. Basic-scatter
----------------
Packet uses one or more buffers, so 1:n mapping between packets and
buffers.
                       unused space
                      v
 1kB      [xx] [xx] [x ] [yy] [yy] [y ]
16kB      [xxxxx            ] [yyyyy             ]

B. Multi-packet
---------------
The configurations above are still possible, but we can configure 
the device to place multiple packets in a large page:
 
                 unused space
                v
16kB, 2kB [xxxxx |yyyyy |...] [..................]
      ^
      alignment / stride

We can probably assume that this model always comes with alignment
cause DMA'ing frames at odd offsets is a bad idea. And also note
that packets smaller that alignment can get scattered to multiple
bufs.

C. Multi-packet HW-GRO
----------------------
For completeness, I guess. We need a third packet here. Assume x-packet
and z-packet are from the same flow and GRO session, y-packet is not.
(Good?) HW-GRO gives us out of order placement and hopefully in this
case we do want to pack:

16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
                     ^
      alignment / stride


End of sidebar. I think / hope these are all practical buffer layouts
we need to care about.


What does user care about? Presumably three things:
 a) efficiency of memory use (larger pages == more chance of low fill)
 b) max size of a buffer (larger buffer = fewer iovecs to pass around)
 c) alignment
I don't think we can make these map 1:1 to any of the knobs we discussed
at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
SW GRO can glue back together).

We could simply let the user control #1 - basically user control
overrides the places where driver would previously use PAGE_SIZE.
I think this is what Stan suggested long ago as well.

But I wonder if user still needs to know #2 (rx-buf-len) because
practically speaking, setting page size >4x the size of rx-buf-len
is likely a lot more fragmentation for little extra aggregation.. ?
Tho, admittedly I think user only needs to know max-rx-buf-len
not necessarily set it.

The last knob is alignment / reuse. For allowing multiple packets in
one buffer we probably need to distinguish these cases to cater to
sufficiently clever adapters:
 - previous and next packets are from the same flow and
   - within one GRO session
   - previous had PSH set (or closed the GRO for another reason,
     this is to allow realigning the buffer on GRO session close)
  or
   - the device doesn't know further distinctions / HW-GRO
 - previous and next are from different flows
And the actions (for each case separately) are one of:
 - no reuse allowed (release buffer = -1?)
 - reuse but must align (align to = N)
 - reuse don't align (pack = 0)

So to restate do we need:
 - "page order" control
 - max-rx-buf-len
 - 4 alignment knobs?

Corner cases
============
I. Non-power of 2 buffer sizes
------------------------------
Looks like multiple devices are limited by width of length fields,
making max buffer size something like 32kB - 1 or 64kB - 1.
Should we allow applications to configure the buffer to 

    power of 2 - alignment 

? It will probably annoy the page pool code a bit. I guess for now
we should just make sure that uAPI doesn't bake in the idea that
buffers are always power of 2.

II. Fractional page sizes
-------------------------
If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
should we support hunking devmem/iouring into less than a PAGE_SIZE?

  reply	other threads:[~2025-10-17  1:40 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-13 14:54 [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 01/24] net: page_pool: sanitise allocation order Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 02/24] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 03/24] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 04/24] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 05/24] net: hns3: net: use zero " Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 06/24] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-10-13 17:12   ` Randy Dunlap
2025-10-14 12:53     ` Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 07/24] net: add rx_buf_len to netdev config Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 08/24] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 09/24] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 10/24] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 11/24] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 12/24] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 13/24] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 14/24] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 15/24] net: add queue config validation callback Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 16/24] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 17/24] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 18/24] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 19/24] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 20/24] net: wipe the setting of deactived queues Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 21/24] eth: bnxt: use queue op config validate Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 22/24] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 23/24] net: let pp memory provider to specify rx buf len Pavel Begunkov
2025-10-13 14:54 ` [PATCH net-next v4 24/24] net: validate driver supports passed qcfg params Pavel Begunkov
2025-10-13 15:03 ` [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-10-13 17:54 ` Jakub Kicinski
2025-10-14  4:41   ` Mina Almasry
2025-10-14 12:50     ` Pavel Begunkov
2025-10-15  1:41     ` Jakub Kicinski
2025-10-15 17:44       ` Mina Almasry
2025-10-17  1:40         ` Jakub Kicinski [this message]
2025-10-22 13:17           ` Dragos Tatulea
2025-10-23  0:09             ` Jakub Kicinski
2025-10-14 12:46   ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251016184031.66c92962@kernel.org \
    --to=kuba@kernel.org \
    --cc=alexanderduyck@fb.com \
    --cc=almasrymina@google.com \
    --cc=andrew@lunn.ch \
    --cc=asml.silence@gmail.com \
    --cc=bbhushan2@marvell.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=dtatulea@nvidia.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=gakula@marvell.com \
    --cc=hawk@kernel.org \
    --cc=hkelam@marvell.com \
    --cc=horms@kernel.org \
    --cc=hramamurthy@google.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=joe@dama.to \
    --cc=john.fastabend@gmail.com \
    --cc=joshwash@google.com \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    --cc=saeedm@nvidia.com \
    --cc=salil.mehta@huawei.com \
    --cc=sbhatta@marvell.com \
    --cc=sdf@fomichev.me \
    --cc=sgoutham@marvell.com \
    --cc=shaojijie@huawei.com \
    --cc=shenjian15@huawei.com \
    --cc=tariqt@nvidia.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).