[RFC v1 00/22] Large rx buffer support for zcrx

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v1 00/22] Large rx buffer support for zcrx
@ 2025-07-28 11:04 Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
                   ` (23 more replies)
  0 siblings, 24 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

This series implements large rx buffer support for io_uring/zcrx on
top of Jakub's queue configuration changes, but it can also be used
by other memory providers. Large rx buffers can be drastically
beneficial with high-end hw-gro enabled cards that can coalesce traffic
into larger pages, reducing the number of frags traversing the network
stack and resuling in larger contiguous chunks of data for the
userspace. Benchamrks showed up to ~30% improvement in CPU util.

For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
userspace pinned to the same CPU:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

And for napi and userspace on different CPUs:

packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
  1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
  1    1.09    0.00   31.95   62.91    1.42    2.63    0.00

Patch 19 allows to pass queue config from a memory provider. The
zcrx changes are contained in a single patch as I already queued
most of work making it size agnostic into my zcrx branch. The
uAPI is simple and imperative, it'll use the exact value (if)
specified by the user. In the future we might extend it to
"choose the best size in a given range".

The rest (first 20) patches are from Jakub's series implementing
per queue configuration. Quoting Jakub:

"... The direct motivation for the series is that zero-copy Rx queues would
like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
and can coalesce payloads into pages much larger than than the MTU.
Enabling larger buffers globally is a bit precarious as it exposes us
to potentially very inefficient memory use. Also allocating large
buffers may not be easy or cheap under load. Zero-copy queues service
only select traffic and have pre-allocated memory so the concerns don't
apply as much.

The per-queue config has to address 3 problems:
- user API
- driver API
- memory provider API

For user API the main question is whether we expose the config via
ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
than extending the ethtool RINGS_GET API. I worry slightly that queue
GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
settings we have in ethtool which are not going via RINGS_SET is
IRQ coalescing.

My goal for the driver API was to avoid complexity in the drivers.
The queue management API has gained two ops, responsible for preparing
configuration for a given queue, and validating whether the config
is supported. The validating is used both for NIC-wide and per-queue
changes. Queue alloc/start ops have a new "config" argument which
contains the current config for a given queue (we use queue restart
to apply per-queue settings). Outside of queue reset paths drivers
can call netdev_queue_config() which returns the config for an arbitrary
queue. Long story short I anticipate it to be used during ndo_open.

In the core I extended struct netdev_config with per queue settings.
All in all this isn't too far from what was there in my "queue API
prototype" a few years ago ..."

Kernel branch with all dependencies: 
git: https://github.com/isilence/linux.git zcrx/large-buffers
url: https://github.com/isilence/linux/tree/zcrx/large-buffers

Jakub Kicinski (20):
  docs: ethtool: document that rx_buf_len must control payload lengths
  net: ethtool: report max value for rx-buf-len
  net: use zero value to restore rx_buf_len to default
  net: clarify the meaning of netdev_config members
  net: add rx_buf_len to netdev config
  eth: bnxt: read the page size from the adapter struct
  eth: bnxt: set page pool page order based on rx_page_size
  eth: bnxt: support setting size of agg buffers via ethtool
  net: move netdev_config manipulation to dedicated helpers
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: allocate per-queue config structs and pass them thru the queue
    API
  net: pass extack to netdev_rx_queue_restart()
  net: add queue config validation callback
  eth: bnxt: always set the queue mgmt ops
  eth: bnxt: store the rx buf size per queue
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  netdev: add support for setting rx-buf-len per queue
  net: wipe the setting of deactived queues
  eth: bnxt: use queue op config validate
  eth: bnxt: support per queue configuration of rx-buf-len

Pavel Begunkov (2):
  net: parametrise mp open with a queue config
  io_uring/zcrx: implement large rx buffer support

 Documentation/netlink/specs/ethtool.yaml      |   4 +
 Documentation/netlink/specs/netdev.yaml       |  15 ++
 Documentation/networking/ethtool-netlink.rst  |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 135 ++++++++++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 include/linux/ethtool.h                       |   3 +
 include/net/netdev_queues.h                   |  83 ++++++++--
 include/net/netdev_rx_queue.h                 |   3 +-
 include/net/netlink.h                         |  19 +++
 include/net/page_pool/memory_provider.h       |   4 +-
 .../uapi/linux/ethtool_netlink_generated.h    |   1 +
 include/uapi/linux/io_uring.h                 |   2 +-
 include/uapi/linux/netdev.h                   |   2 +
 io_uring/zcrx.c                               |  39 ++++-
 net/core/Makefile                             |   2 +-
 net/core/dev.c                                |  12 +-
 net/core/dev.h                                |  12 ++
 net/core/netdev-genl-gen.c                    |  15 ++
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  92 +++++++++++
 net/core/netdev_config.c                      | 150 ++++++++++++++++++
 net/core/netdev_rx_queue.c                    |  54 +++++--
 net/ethtool/common.c                          |   4 +-
 net/ethtool/netlink.c                         |  14 +-
 net/ethtool/rings.c                           |  14 +-
 tools/include/uapi/linux/netdev.h             |   2 +
 32 files changed, 642 insertions(+), 92 deletions(-)
 create mode 100644 net/core/netdev_config.c

-- 
2.49.0

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 18:11   ` Mina Almasry
  2025-07-28 21:36   ` Mina Almasry
  2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Document the semantics of the rx_buf_len ethtool ring param.
Clarify its meaning in case of HDS, where driver may have
two separate buffer pools.

The various zero-copy TCP Rx schemes we have suffer from memory
management overhead. Specifically applications aren't too impressed
with the number of 4kB buffers they have to juggle. Zero-copy
TCP makes most sense with larger memory transfers so using
16kB or 32kB buffers (with the help of HW-GRO) feels more
natural.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/networking/ethtool-netlink.rst | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index b6e9af4d0f1b..eaa9c17a3cb1 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
 driver. Driver may impose additional constraints and may not support all
 attributes.
 
-
 ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
 Completion queue events (CQE) are the events posted by NIC to indicate the
 completion status of a packet when the packet is sent (like send success or
@@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
 header / data split feature. If a received packet size is larger than this
 threshold value, header and data will be split.
 
+``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
+uses to receive packets. If the device uses different memory polls for headers
+and payload this setting may control the size of the header buffers but must
+control the size of the payload buffers.
+
 CHANNELS_GET
 ============
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 02/22] net: ethtool: report max value for rx-buf-len
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-29  5:00   ` Subbaraya Sundeep
  2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Unlike most of our APIs the rx-buf-len param does not have an associated
max value. In theory user could set this value pretty high, but in
practice most NICs have limits due to the width of the length fields
in the descriptors.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/netlink/specs/ethtool.yaml                  | 4 ++++
 Documentation/networking/ethtool-netlink.rst              | 1 +
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 ++-
 include/linux/ethtool.h                                   | 2 ++
 include/uapi/linux/ethtool_netlink_generated.h            | 1 +
 net/ethtool/rings.c                                       | 5 +++++
 6 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 72a076b0e1b5..cb96b4e7093f 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -361,6 +361,9 @@ attribute-sets:
       -
         name: hds-thresh-max
         type: u32
+      -
+        name: rx-buf-len-max
+        type: u32
 
   -
     name: mm-stat
@@ -1811,6 +1814,7 @@ operations:
             - rx-jumbo
             - tx
             - rx-buf-len
+            - rx-buf-len-max
             - tcp-data-split
             - cqe-size
             - tx-push
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index eaa9c17a3cb1..b7a99dfdffa9 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -893,6 +893,7 @@ Kernel response contents:
   ``ETHTOOL_A_RINGS_RX_JUMBO``              u32     size of RX jumbo ring
   ``ETHTOOL_A_RINGS_TX``                    u32     size of TX ring
   ``ETHTOOL_A_RINGS_RX_BUF_LEN``            u32     size of buffers on the ring
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN_MAX``        u32     max size of rx buffers
   ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``        u8      TCP header / data split
   ``ETHTOOL_A_RINGS_CQE_SIZE``              u32     Size of TX/RX CQE
   ``ETHTOOL_A_RINGS_TX_PUSH``               u8      flag of TX Push mode
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
index 45b8c9230184..7bdef64926c8 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
@@ -376,6 +376,7 @@ static void otx2_get_ringparam(struct net_device *netdev,
 	ring->tx_max_pending = Q_COUNT(Q_SIZE_MAX);
 	ring->tx_pending = qs->sqe_cnt ? qs->sqe_cnt : Q_COUNT(Q_SIZE_4K);
 	kernel_ring->rx_buf_len = pfvf->hw.rbuf_len;
+	kernel_ring->rx_buf_len_max = 32768;
 	kernel_ring->cqe_size = pfvf->hw.xqe_size;
 }
 
@@ -398,7 +399,7 @@ static int otx2_set_ringparam(struct net_device *netdev,
 	/* Hardware supports max size of 32k for a receive buffer
 	 * and 1536 is typical ethernet frame size.
 	 */
-	if (rx_buf_len && (rx_buf_len < 1536 || rx_buf_len > 32768)) {
+	if (rx_buf_len && (rx_buf_len < 1536)) {
 		netdev_err(netdev,
 			   "Receive buffer range is 1536 - 32768");
 		return -EINVAL;
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 5e0dd333ad1f..dd9f253a56ae 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -77,6 +77,7 @@ enum {
 /**
  * struct kernel_ethtool_ringparam - RX/TX ring configuration
  * @rx_buf_len: Current length of buffers on the rx ring.
+ * @rx_buf_len_max: Max length of buffers on the rx ring.
  * @tcp_data_split: Scatter packet headers and data to separate buffers
  * @tx_push: The flag of tx push mode
  * @rx_push: The flag of rx push mode
@@ -89,6 +90,7 @@ enum {
  */
 struct kernel_ethtool_ringparam {
 	u32	rx_buf_len;
+	u32	rx_buf_len_max;
 	u8	tcp_data_split;
 	u8	tx_push;
 	u8	rx_push;
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index aa8ab5227c1e..1a76e6789e33 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -164,6 +164,7 @@ enum {
 	ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
 	ETHTOOL_A_RINGS_HDS_THRESH,
 	ETHTOOL_A_RINGS_HDS_THRESH_MAX,
+	ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
 
 	__ETHTOOL_A_RINGS_CNT,
 	ETHTOOL_A_RINGS_MAX = (__ETHTOOL_A_RINGS_CNT - 1)
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index aeedd5ec6b8c..5e872ceab5dd 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -105,6 +105,9 @@ static int rings_fill_reply(struct sk_buff *skb,
 			  ringparam->tx_pending)))  ||
 	    (kr->rx_buf_len &&
 	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN, kr->rx_buf_len))) ||
+	    (kr->rx_buf_len_max &&
+	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
+			  kr->rx_buf_len_max))) ||
 	    (kr->tcp_data_split &&
 	     (nla_put_u8(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT,
 			 kr->tcp_data_split))) ||
@@ -281,6 +284,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		err_attr = tb[ETHTOOL_A_RINGS_TX];
 	else if (kernel_ringparam.hds_thresh > kernel_ringparam.hds_thresh_max)
 		err_attr = tb[ETHTOOL_A_RINGS_HDS_THRESH];
+	else if (kernel_ringparam.rx_buf_len > kernel_ringparam.rx_buf_len_max)
+		err_attr = tb[ETHTOOL_A_RINGS_RX_BUF_LEN];
 	else
 		err_attr = NULL;
 	if (err_attr) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 03/22] net: use zero value to restore rx_buf_len to default
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-29  5:03   ` Subbaraya Sundeep
  2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Distinguish between rx_buf_len being driver default vs user config.
Use 0 as a special value meaning "unset" or "restore driver default".
This will be necessary later on to configure it per-queue, but
the ability to restore defaults may be useful in itself.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/networking/ethtool-netlink.rst              | 2 +-
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 +++
 include/linux/ethtool.h                                   | 1 +
 net/ethtool/rings.c                                       | 2 +-
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index b7a99dfdffa9..723f8e1a33a7 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -974,7 +974,7 @@ threshold value, header and data will be split.
 ``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
 uses to receive packets. If the device uses different memory polls for headers
 and payload this setting may control the size of the header buffers but must
-control the size of the payload buffers.
+control the size of the payload buffers. Setting to 0 restores driver default.
 
 CHANNELS_GET
 ============
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
index 7bdef64926c8..1a74a7b81ac1 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
@@ -396,6 +396,9 @@ static int otx2_set_ringparam(struct net_device *netdev,
 	if (ring->rx_mini_pending || ring->rx_jumbo_pending)
 		return -EINVAL;
 
+	if (!rx_buf_len)
+		rx_buf_len = OTX2_DEFAULT_RBUF_LEN;
+
 	/* Hardware supports max size of 32k for a receive buffer
 	 * and 1536 is typical ethernet frame size.
 	 */
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index dd9f253a56ae..bbc5c485bfbf 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -77,6 +77,7 @@ enum {
 /**
  * struct kernel_ethtool_ringparam - RX/TX ring configuration
  * @rx_buf_len: Current length of buffers on the rx ring.
+ *		Setting to 0 means reset to driver default.
  * @rx_buf_len_max: Max length of buffers on the rx ring.
  * @tcp_data_split: Scatter packet headers and data to separate buffers
  * @tx_push: The flag of tx push mode
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 5e872ceab5dd..628546a1827b 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -139,7 +139,7 @@ const struct nla_policy ethnl_rings_set_policy[] = {
 	[ETHTOOL_A_RINGS_RX_MINI]		= { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_RX_JUMBO]		= { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_TX]			= { .type = NLA_U32 },
-	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = NLA_POLICY_MIN(NLA_U32, 1),
+	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_TCP_DATA_SPLIT]	=
 		NLA_POLICY_MAX(NLA_U8, ETHTOOL_TCP_DATA_SPLIT_ENABLED),
 	[ETHTOOL_A_RINGS_CQE_SIZE]		= NLA_POLICY_MIN(NLA_U32, 1),
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 04/22] net: clarify the meaning of netdev_config members
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 21:44   ` Mina Almasry
  2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

hds_thresh and hds_config are both inside struct netdev_config
but have quite different semantics. hds_config is the user config
with ternary semantics (on/off/unset). hds_thresh is a straight
up value, populated by the driver at init and only modified by
user space. We don't expect the drivers to have to pick a special
hds_thresh value based on other configuration.

The two approaches have different advantages and downsides.
hds_thresh ("direct value") gives core easy access to current
device settings, but there's no way to express whether the value
comes from the user. It also requires the initialization by
the driver.

hds_config ("user config values") tells us what user wanted, but
doesn't give us the current value in the core.

Try to explain this a bit in the comments, so at we make a conscious
choice for new values which semantics we expect.

Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
returns a hds_thresh value always as 0.") added the setting for the
benefit of netdevsim which doesn't touch the value at all on get.
Again, this is just to clarify the intention, shouldn't cause any
functional change.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 19 +++++++++++++++++--
 net/ethtool/common.c        |  3 ++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index ba2eaf39089b..81df0794d84c 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -6,11 +6,26 @@
 
 /**
  * struct netdev_config - queue-related configuration for a netdev
- * @hds_thresh:		HDS Threshold value.
- * @hds_config:		HDS value from userspace.
  */
 struct netdev_config {
+	/* Direct value
+	 *
+	 * Driver default is expected to be fixed, and set in this struct
+	 * at init. From that point on user may change the value. There is
+	 * no explicit way to "unset" / restore driver default.
+	 */
+	/** @hds_thresh: HDS Threshold value (ETHTOOL_A_RINGS_HDS_THRESH).
+	 */
 	u32	hds_thresh;
+
+	/* User config values
+	 *
+	 * Contain user configuration. If "set" driver must obey.
+	 * If "unset" driver is free to decide, and may change its choice
+	 * as other parameters change.
+	 */
+	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
+	 */
 	u8	hds_config;
 };
 
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index eb253e0fd61b..a87298f659f5 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -825,12 +825,13 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
 	memset(param, 0, sizeof(*param));
 	memset(kparam, 0, sizeof(*kparam));
 
+	kparam->hds_thresh = dev->cfg->hds_thresh;
+
 	param->cmd = ETHTOOL_GRINGPARAM;
 	dev->ethtool_ops->get_ringparam(dev, param, kparam, extack);
 
 	/* Driver gives us current state, we want to return current config */
 	kparam->tcp_data_split = dev->cfg->hds_config;
-	kparam->hds_thresh = dev->cfg->hds_thresh;
 }
 
 static void ethtool_init_tsinfo(struct kernel_ethtool_ts_info *info)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 05/22] net: add rx_buf_len to netdev config
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 21:50   ` Mina Almasry
  2025-07-28 11:04 ` [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Add rx_buf_len to configuration maintained by the core.
Use "three-state" semantics where 0 means "driver default".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 4 ++++
 net/ethtool/common.c        | 1 +
 net/ethtool/rings.c         | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 81df0794d84c..eb3a5ac823e6 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -24,6 +24,10 @@ struct netdev_config {
 	 * If "unset" driver is free to decide, and may change its choice
 	 * as other parameters change.
 	 */
+	/** @rx_buf_len: Size of buffers on the Rx ring
+	 *		 (ETHTOOL_A_RINGS_RX_BUF_LEN).
+	 */
+	u32	rx_buf_len;
 	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
 	 */
 	u8	hds_config;
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index a87298f659f5..8fdffc77e981 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -832,6 +832,7 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
 
 	/* Driver gives us current state, we want to return current config */
 	kparam->tcp_data_split = dev->cfg->hds_config;
+	kparam->rx_buf_len = dev->cfg->rx_buf_len;
 }
 
 static void ethtool_init_tsinfo(struct kernel_ethtool_ts_info *info)
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 628546a1827b..6a74e7e4064e 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -41,6 +41,7 @@ static int rings_prepare_data(const struct ethnl_req_info *req_base,
 		return ret;
 
 	data->kernel_ringparam.tcp_data_split = dev->cfg->hds_config;
+	data->kernel_ringparam.rx_buf_len = dev->cfg->rx_buf_len;
 	data->kernel_ringparam.hds_thresh = dev->cfg->hds_thresh;
 
 	dev->ethtool_ops->get_ringparam(dev, &data->ringparam,
@@ -302,6 +303,7 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		return -EINVAL;
 	}
 
+	dev->cfg_pending->rx_buf_len = kernel_ringparam.rx_buf_len;
 	dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
 	dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (4 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
inside struct bnxt. This will allow configuring the page size
at runtime in subsequent patches.

The MSS size calculation for older chip continues to use the constant.
I'm intending to support the configuration only on more recent HW,
looks like on older chips setting this per queue won't work,
and that's the ultimate goal.

This patch should not change the current behavior as value
read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 27 ++++++++++---------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  4 +--
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 2cb3185c442c..274ebd63bdd9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -895,7 +895,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > BNXT_RX_PAGE_SIZE;
+	return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -905,9 +905,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
+	if (PAGE_SIZE > bp->rx_page_size) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						BNXT_RX_PAGE_SIZE);
+						bp->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -1139,9 +1139,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, BNXT_RX_PAGE_SIZE);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1173,7 +1173,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1187,7 +1187,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE);
+	skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1272,7 +1272,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, BNXT_RX_PAGE_SIZE);
+					       frag_len, bp->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1297,7 +1297,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= BNXT_RX_PAGE_SIZE;
+				skb->truesize -= bp->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1312,7 +1312,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  BNXT_RX_PAGE_SIZE);
+						  bp->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -4448,7 +4448,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)BNXT_RX_PAGE_SIZE << RX_BD_LEN_SHIFT) |
+		type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
 
 		bnxt_init_rxbd_pages(ring, type);
@@ -4710,7 +4710,7 @@ void bnxt_set_ring_params(struct bnxt *bp)
 	bp->rx_agg_nr_pages = 0;
 
 	if (bp->flags & BNXT_FLAG_TPA || bp->flags & BNXT_FLAG_HDS)
-		agg_factor = min_t(u32, 4, 65536 / BNXT_RX_PAGE_SIZE);
+		agg_factor = min_t(u32, 4, 65536 / bp->rx_page_size);
 
 	bp->flags &= ~BNXT_FLAG_JUMBO;
 	if (rx_space > PAGE_SIZE && !(bp->flags & BNXT_FLAG_NO_AGG_RINGS)) {
@@ -7022,7 +7022,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(BNXT_RX_PAGE_SIZE);
+		req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -16573,6 +16573,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bp = netdev_priv(dev);
 	bp->board_idx = ent->driver_data;
 	bp->msg_enable = BNXT_DEF_MSG_ENABLE;
+	bp->rx_page_size = BNXT_RX_PAGE_SIZE;
 	bnxt_set_max_func_irqs(bp, max_irqs);
 
 	if (bnxt_vf_pciid(bp->board_idx))
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index fda0d3cc6227..ac841d02d7ad 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2358,6 +2358,7 @@ struct bnxt {
 	u16			max_tpa;
 	u32			rx_buf_size;
 	u32			rx_buf_use_size;	/* useable size */
+	u16			rx_page_size;
 	u16			rx_offset;
 	u16			rx_dma_offset;
 	enum dma_data_direction	rx_dir;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 4a6d8cb9f970..32bcc3aedee6 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = BNXT_RX_PAGE_SIZE;
+	u32 buflen = bp->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 
 	xdp_update_skb_shared_info(skb, num_frags,
 				   sinfo->xdp_frags_size,
-				   BNXT_RX_PAGE_SIZE * num_frags,
+				   bp->rx_page_size * num_frags,
 				   xdp_buff_is_frag_pfmemalloc(xdp));
 	return skb;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (5 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

If user decides to increase the buffer size for agg ring
we need to ask the page pool for higher order pages.
There is no need to use larger pages for header frags,
if user increase the size of agg ring buffers switch
to separate header page automatically.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 274ebd63bdd9..55685ed60519 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3806,6 +3806,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.pool_size = bp->rx_agg_ring_size;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size;
+	pp.order = get_order(bp->rx_page_size);
 	pp.nid = numa_node;
 	pp.napi = &rxr->bnapi->napi;
 	pp.netdev = bp->dev;
@@ -3822,7 +3823,9 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	rxr->page_pool = pool;
 
 	rxr->need_head_pool = page_pool_is_unreadable(pool);
+	rxr->need_head_pool |= !!pp.order;
 	if (bnxt_separate_head_pool(rxr)) {
+		pp.order = 0;
 		pp.pool_size = max(bp->rx_ring_size, 1024);
 		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 		pool = page_pool_create(&pp);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (6 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

bnxt seems to be able to aggregate data up to 32kB without any issue.
The driver is already capable of doing this for systems with higher
order pages. While for systems with 4k pages we historically preferred
to stick to small buffers because they are easier to allocate, the
zero-copy APIs remove the allocation problem. The ZC mem is
pre-allocated and fixed size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  3 ++-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 21 ++++++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index ac841d02d7ad..56aafae568f8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -758,7 +758,8 @@ struct nqe_cn {
 #define BNXT_RX_PAGE_SHIFT PAGE_SHIFT
 #endif
 
-#define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
+#define BNXT_MAX_RX_PAGE_SIZE	(1 << 15)
+#define BNXT_RX_PAGE_SIZE	(1 << BNXT_RX_PAGE_SHIFT)
 
 #define BNXT_MAX_MTU		9500
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index f5d490bf997e..0e225414d463 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -835,6 +835,8 @@ static void bnxt_get_ringparam(struct net_device *dev,
 	ering->rx_jumbo_pending = bp->rx_agg_ring_size;
 	ering->tx_pending = bp->tx_ring_size;
 
+	kernel_ering->rx_buf_len_max = BNXT_MAX_RX_PAGE_SIZE;
+	kernel_ering->rx_buf_len = bp->rx_page_size;
 	kernel_ering->hds_thresh_max = BNXT_HDS_THRESHOLD_MAX;
 }
 
@@ -862,6 +864,21 @@ static int bnxt_set_ringparam(struct net_device *dev,
 		return -EINVAL;
 	}
 
+	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
+		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
+
+	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
+	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
+	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
+	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
+		return -ERANGE;
+	}
+
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
@@ -874,6 +891,7 @@ static int bnxt_set_ringparam(struct net_device *dev,
 
 	bp->rx_ring_size = ering->rx_pending;
 	bp->tx_ring_size = ering->tx_pending;
+	bp->rx_page_size = kernel_ering->rx_buf_len;
 	bnxt_set_ring_params(bp);
 
 	if (netif_running(dev))
@@ -5489,7 +5507,8 @@ const struct ethtool_ops bnxt_ethtool_ops = {
 				     ETHTOOL_COALESCE_STATS_BLOCK_USECS |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
 				     ETHTOOL_COALESCE_USE_CQE,
-	.supported_ring_params	= ETHTOOL_RING_USE_TCP_DATA_SPLIT |
+	.supported_ring_params	= ETHTOOL_RING_USE_RX_BUF_LEN |
+				  ETHTOOL_RING_USE_TCP_DATA_SPLIT |
 				  ETHTOOL_RING_USE_HDS_THRS,
 	.get_link_ksettings	= bnxt_get_link_ksettings,
 	.set_link_ksettings	= bnxt_set_link_ksettings,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (7 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

netdev_config manipulation will become slightly more complicated
soon and we will need to call if from ethtool as well as queue API.
Encapsulate the logic into helper functions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/Makefile        |  2 +-
 net/core/dev.c           |  7 ++-----
 net/core/dev.h           |  5 +++++
 net/core/netdev_config.c | 43 ++++++++++++++++++++++++++++++++++++++++
 net/ethtool/netlink.c    | 14 ++++++-------
 5 files changed, 57 insertions(+), 14 deletions(-)
 create mode 100644 net/core/netdev_config.c

diff --git a/net/core/Makefile b/net/core/Makefile
index b2a76ce33932..4db487396094 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,7 +19,7 @@ obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
 
 obj-y += net-sysfs.o
 obj-y += hotdata.o
-obj-y += netdev_rx_queue.o
+obj-y += netdev_config.o netdev_rx_queue.o
 obj-$(CONFIG_PAGE_POOL) += page_pool.o page_pool_user.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
diff --git a/net/core/dev.c b/net/core/dev.c
index be97c440ecd5..757fa06d7392 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11784,10 +11784,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool)
 		goto free_all;
 
-	dev->cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
-	if (!dev->cfg)
+	if (netdev_alloc_config(dev))
 		goto free_all;
-	dev->cfg_pending = dev->cfg;
 
 	napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
 	dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
@@ -11857,8 +11855,7 @@ void free_netdev(struct net_device *dev)
 		return;
 	}
 
-	WARN_ON(dev->cfg != dev->cfg_pending);
-	kfree(dev->cfg);
+	netdev_free_config(dev);
 	kfree(dev->ethtool);
 	netif_free_tx_queues(dev);
 	netif_free_rx_queues(dev);
diff --git a/net/core/dev.h b/net/core/dev.h
index e93f36b7ddf3..c8971c6f1fcd 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -92,6 +92,11 @@ extern struct rw_semaphore dev_addr_sem;
 extern struct list_head net_todo_list;
 void netdev_run_todo(void);
 
+int netdev_alloc_config(struct net_device *dev);
+void __netdev_free_config(struct netdev_config *cfg);
+void netdev_free_config(struct net_device *dev);
+int netdev_reconfig_start(struct net_device *dev);
+
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
 	struct hlist_node hlist;
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
new file mode 100644
index 000000000000..270b7f10a192
--- /dev/null
+++ b/net/core/netdev_config.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/netdevice.h>
+#include <net/netdev_queues.h>
+
+#include "dev.h"
+
+int netdev_alloc_config(struct net_device *dev)
+{
+	struct netdev_config *cfg;
+
+	cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg)
+		return -ENOMEM;
+
+	dev->cfg = cfg;
+	dev->cfg_pending = cfg;
+	return 0;
+}
+
+void __netdev_free_config(struct netdev_config *cfg)
+{
+	kfree(cfg);
+}
+
+void netdev_free_config(struct net_device *dev)
+{
+	WARN_ON(dev->cfg != dev->cfg_pending);
+	__netdev_free_config(dev->cfg);
+}
+
+int netdev_reconfig_start(struct net_device *dev)
+{
+	struct netdev_config *cfg;
+
+	WARN_ON(dev->cfg != dev->cfg_pending);
+	cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg)
+		return -ENOMEM;
+
+	dev->cfg_pending = cfg;
+	return 0;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 9de828df46cd..2f1eb5748cb6 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -6,6 +6,7 @@
 #include <linux/ethtool_netlink.h>
 #include <linux/phy_link_topology.h>
 #include <linux/pm_runtime.h>
+#include "../core/dev.h"
 #include "netlink.h"
 #include "module_fw.h"
 
@@ -891,12 +892,9 @@ static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 
 	rtnl_lock();
 	netdev_lock_ops(dev);
-	dev->cfg_pending = kmemdup(dev->cfg, sizeof(*dev->cfg),
-				   GFP_KERNEL_ACCOUNT);
-	if (!dev->cfg_pending) {
-		ret = -ENOMEM;
-		goto out_tie_cfg;
-	}
+	ret = netdev_reconfig_start(dev);
+	if (ret)
+		goto out_unlock;
 
 	ret = ethnl_ops_begin(dev);
 	if (ret < 0)
@@ -915,9 +913,9 @@ static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 out_ops:
 	ethnl_ops_complete(dev);
 out_free_cfg:
-	kfree(dev->cfg_pending);
-out_tie_cfg:
+	__netdev_free_config(dev->cfg_pending);
 	dev->cfg_pending = dev->cfg;
+out_unlock:
 	netdev_unlock_ops(dev);
 	rtnl_unlock();
 out_dev:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (8 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Trivial change, reduce the indent. I think the original is copied
from real NDOs. It's unnecessarily deep, makes passing struct args
problematic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index eb3a5ac823e6..070a1150241d 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -151,18 +151,18 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
-	size_t			ndo_queue_mem_size;
-	int			(*ndo_queue_mem_alloc)(struct net_device *dev,
-						       void *per_queue_mem,
-						       int idx);
-	void			(*ndo_queue_mem_free)(struct net_device *dev,
-						      void *per_queue_mem);
-	int			(*ndo_queue_start)(struct net_device *dev,
-						   void *per_queue_mem,
-						   int idx);
-	int			(*ndo_queue_stop)(struct net_device *dev,
-						  void *per_queue_mem,
-						  int idx);
+	size_t	ndo_queue_mem_size;
+	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       void *per_queue_mem,
+				       int idx);
+	void	(*ndo_queue_mem_free)(struct net_device *dev,
+				      void *per_queue_mem);
+	int	(*ndo_queue_start)(struct net_device *dev,
+				   void *per_queue_mem,
+				   int idx);
+	int	(*ndo_queue_stop)(struct net_device *dev,
+				  void *per_queue_mem,
+				  int idx);
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (9 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Create an array of config structs to store per-queue config.
Pass these structs in the queue API. Drivers can also retrieve
the config for a single queue calling netdev_queue_config()
directly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  8 ++-
 drivers/net/ethernet/google/gve/gve_main.c |  9 ++--
 drivers/net/netdevsim/netdev.c             |  6 ++-
 include/net/netdev_queues.h                | 19 +++++++
 net/core/dev.h                             |  3 ++
 net/core/netdev_config.c                   | 58 ++++++++++++++++++++++
 net/core/netdev_rx_queue.c                 | 11 ++--
 7 files changed, 104 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 55685ed60519..c3195be0ac26 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15820,7 +15820,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
 	.get_base_stats		= bnxt_get_base_stats,
 };
 
-static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_mem_alloc(struct net_device *dev,
+				struct netdev_queue_config *qcfg,
+				void *qmem, int idx)
 {
 	struct bnxt_rx_ring_info *rxr, *clone;
 	struct bnxt *bp = netdev_priv(dev);
@@ -15988,7 +15990,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	dst->rx_agg_bmap = src->rx_agg_bmap;
 }
 
-static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_start(struct net_device *dev,
+			    struct netdev_queue_config *qcfg,
+			    void *qmem, int idx)
 {
 	struct bnxt *bp = netdev_priv(dev);
 	struct bnxt_rx_ring_info *rxr, *clone;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index dc35a23ec47f..dffe3ebc456b 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2429,8 +2429,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
 		gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
 }
 
-static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
-				  int idx)
+static int gve_rx_queue_mem_alloc(struct net_device *dev,
+				  struct netdev_queue_config *qcfg,
+				  void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_alloc_rings_cfg cfg = {0};
@@ -2451,7 +2452,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
 	return err;
 }
 
-static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
+static int gve_rx_queue_start(struct net_device *dev,
+			      struct netdev_queue_config *qcfg,
+			      void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_ring *gve_per_q_mem;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index fa5fbd97ad69..03003adc41fb 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -664,7 +664,8 @@ struct nsim_queue_mem {
 };
 
 static int
-nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_mem_alloc(struct net_device *dev, struct netdev_queue_config *qcfg,
+		     void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
@@ -713,7 +714,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
 }
 
 static int
-nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
+		 void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 070a1150241d..e3e7ecf91bac 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -31,6 +31,13 @@ struct netdev_config {
 	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
 	 */
 	u8	hds_config;
+
+	/** @qcfg: per-queue configuration */
+	struct netdev_queue_config *qcfg;
+};
+
+/* Same semantics as fields in struct netdev_config */
+struct netdev_queue_config {
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -135,6 +142,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  *
  * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
  *
+ * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
+ *			defaults. Queue config structs are passed to this
+ *			helper before the user-requested settings are applied.
+ *
  * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
  *			 The new memory is written at the specified address.
  *
@@ -152,12 +163,17 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  */
 struct netdev_queue_mgmt_ops {
 	size_t	ndo_queue_mem_size;
+	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
+					  int idx,
+					  struct netdev_queue_config *qcfg);
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
 				       int idx);
 	void	(*ndo_queue_mem_free)(struct net_device *dev,
 				      void *per_queue_mem);
 	int	(*ndo_queue_start)(struct net_device *dev,
+				   struct netdev_queue_config *qcfg,
 				   void *per_queue_mem,
 				   int idx);
 	int	(*ndo_queue_stop)(struct net_device *dev,
@@ -165,6 +181,9 @@ struct netdev_queue_mgmt_ops {
 				  int idx);
 };
 
+void netdev_queue_config(struct net_device *dev, int rxq,
+			 struct netdev_queue_config *qcfg);
+
 /**
  * DOC: Lockless queue stopping / waking helpers.
  *
diff --git a/net/core/dev.h b/net/core/dev.h
index c8971c6f1fcd..6d7f5e920018 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -9,6 +9,7 @@
 #include <net/netdev_lock.h>
 
 struct net;
+struct netdev_queue_config;
 struct netlink_ext_ack;
 struct cpumask;
 
@@ -96,6 +97,8 @@ int netdev_alloc_config(struct net_device *dev);
 void __netdev_free_config(struct netdev_config *cfg);
 void netdev_free_config(struct net_device *dev);
 int netdev_reconfig_start(struct net_device *dev);
+void __netdev_queue_config(struct net_device *dev, int rxq,
+			   struct netdev_queue_config *qcfg, bool pending);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index 270b7f10a192..bad2d53522f0 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -8,18 +8,29 @@
 int netdev_alloc_config(struct net_device *dev)
 {
 	struct netdev_config *cfg;
+	unsigned int maxqs;
 
 	cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
 	if (!cfg)
 		return -ENOMEM;
 
+	maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
+	cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg->qcfg)
+		goto err_free_cfg;
+
 	dev->cfg = cfg;
 	dev->cfg_pending = cfg;
 	return 0;
+
+err_free_cfg:
+	kfree(cfg);
+	return -ENOMEM;
 }
 
 void __netdev_free_config(struct netdev_config *cfg)
 {
+	kfree(cfg->qcfg);
 	kfree(cfg);
 }
 
@@ -32,12 +43,59 @@ void netdev_free_config(struct net_device *dev)
 int netdev_reconfig_start(struct net_device *dev)
 {
 	struct netdev_config *cfg;
+	unsigned int maxqs;
 
 	WARN_ON(dev->cfg != dev->cfg_pending);
 	cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
 	if (!cfg)
 		return -ENOMEM;
 
+	maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
+	cfg->qcfg = kmemdup_array(dev->cfg->qcfg, maxqs, sizeof(*cfg->qcfg),
+				  GFP_KERNEL_ACCOUNT);
+	if (!cfg->qcfg)
+		goto err_free_cfg;
+
 	dev->cfg_pending = cfg;
 	return 0;
+
+err_free_cfg:
+	kfree(cfg);
+	return -ENOMEM;
+}
+
+void __netdev_queue_config(struct net_device *dev, int rxq,
+			   struct netdev_queue_config *qcfg, bool pending)
+{
+	memset(qcfg, 0, sizeof(*qcfg));
+
+	/* Get defaults from the driver, in case user config not set */
+	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
+		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+}
+
+/**
+ * netdev_queue_config() - get configuration for a given queue
+ * @dev:  net_device instance
+ * @rxq:  index of the queue of interest
+ * @qcfg: queue configuration struct (output)
+ *
+ * Render the configuration for a given queue. This helper should be used
+ * by drivers which support queue configuration to retrieve config for
+ * a particular queue.
+ *
+ * @qcfg is an output parameter and is always fully initialized by this
+ * function. Some values may not be set by the user, drivers may either
+ * deal with the "unset" values in @qcfg, or provide the callback
+ * to populate defaults in queue_management_ops.
+ *
+ * Note that this helper returns pending config, as it is expected that
+ * "old" queues are retained until config is successful so they can
+ * be restored directly without asking for the config.
+ */
+void netdev_queue_config(struct net_device *dev, int rxq,
+			 struct netdev_queue_config *qcfg)
+{
+	__netdev_queue_config(dev, rxq, qcfg, true);
 }
+EXPORT_SYMBOL(netdev_queue_config);
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index d126f10197bf..d8a710db21cd 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -7,12 +7,14 @@
 #include <net/netdev_rx_queue.h>
 #include <net/page_pool/memory_provider.h>
 
+#include "dev.h"
 #include "page_pool_priv.h"
 
 int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
 	void *new_mem, *old_mem;
 	int err;
 
@@ -32,7 +34,9 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		goto err_free_new_mem;
 	}
 
-	err = qops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx);
+	netdev_queue_config(dev, rxq_idx, &qcfg);
+
+	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
 
@@ -45,7 +49,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		if (err)
 			goto err_free_new_queue_mem;
 
-		err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
+		err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
 		if (err)
 			goto err_start_queue;
 	} else {
@@ -60,6 +64,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	return 0;
 
 err_start_queue:
+	__netdev_queue_config(dev, rxq_idx, &qcfg, false);
 	/* Restarting the queue with old_mem should be successful as we haven't
 	 * changed any of the queue configuration, and there is not much we can
 	 * do to recover from a failure here.
@@ -67,7 +72,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	 * WARN if we fail to recover the old rx queue, and at least free
 	 * old_mem so we don't also leak that.
 	 */
-	if (qops->ndo_queue_start(dev, old_mem, rxq_idx)) {
+	if (qops->ndo_queue_start(dev, &qcfg, old_mem, rxq_idx)) {
 		WARN(1,
 		     "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
 		     rxq_idx);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart()
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (10 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Pass extack to netdev_rx_queue_restart(). Subsequent change will need it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 drivers/net/netdevsim/netdev.c            | 2 +-
 include/net/netdev_rx_queue.h             | 3 ++-
 net/core/netdev_rx_queue.c                | 7 ++++---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c3195be0ac26..b5f7a65bf678 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11518,7 +11518,7 @@ static void bnxt_irq_affinity_notify(struct irq_affinity_notify *notify,
 
 	netdev_lock(irq->bp->dev);
 	if (netif_running(irq->bp->dev)) {
-		err = netdev_rx_queue_restart(irq->bp->dev, irq->ring_nr);
+		err = netdev_rx_queue_restart(irq->bp->dev, irq->ring_nr, NULL);
 		if (err)
 			netdev_err(irq->bp->dev,
 				   "RX queue restart failed: err=%d\n", err);
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 03003adc41fb..a759424cfde5 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -800,7 +800,7 @@ nsim_qreset_write(struct file *file, const char __user *data,
 	}
 
 	ns->rq_reset_mode = mode;
-	ret = netdev_rx_queue_restart(ns->netdev, queue);
+	ret = netdev_rx_queue_restart(ns->netdev, queue, NULL);
 	ns->rq_reset_mode = 0;
 	if (ret)
 		goto exit_unlock;
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 8cdcd138b33f..a7def1f94823 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -56,6 +56,7 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
 	return index;
 }
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq,
+			    struct netlink_ext_ack *extack);
 
 #endif
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index d8a710db21cd..b0523eb44e10 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -10,7 +10,8 @@
 #include "dev.h"
 #include "page_pool_priv.h"
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
+			    struct netlink_ext_ack *extack)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
@@ -136,7 +137,7 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 #endif
 
 	rxq->mp_params = *p;
-	ret = netdev_rx_queue_restart(dev, rxq_idx);
+	ret = netdev_rx_queue_restart(dev, rxq_idx, extack);
 	if (ret) {
 		rxq->mp_params.mp_ops = NULL;
 		rxq->mp_params.mp_priv = NULL;
@@ -179,7 +180,7 @@ void __net_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
 
 	rxq->mp_params.mp_ops = NULL;
 	rxq->mp_params.mp_priv = NULL;
-	err = netdev_rx_queue_restart(dev, ifq_idx);
+	err = netdev_rx_queue_restart(dev, ifq_idx, NULL);
 	WARN_ON(err && err != -ENETDOWN);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 13/22] net: add queue config validation callback
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (11 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 22:26   ` Mina Almasry
  2025-07-28 11:04 ` [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before each queue restart
and when ring params are modified.

For NIC-wide changes the callback gets invoked for each active
(or active to-be) queue, and additionally with a negative queue
index for NIC-wide defaults. The NIC-wide check is needed in
case all queues have an override active when NIC-wide setting
is changed to an unsupported one. Alternatively we could check
the settings when new queues are enabled (in the channel API),
but accepting invalid config is a bad idea. Users may expect
that resetting a queue override will always work.

The "trick" of passing a negative index is a bit ugly, we may
want to revisit if it causes confusion and bugs. Existing drivers
don't care about the index so it "just works".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 12 ++++++++++++
 net/core/dev.h              |  2 ++
 net/core/netdev_config.c    | 20 ++++++++++++++++++++
 net/core/netdev_rx_queue.c  |  6 ++++++
 net/ethtool/rings.c         |  5 +++++
 5 files changed, 45 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index e3e7ecf91bac..f75313fc78ba 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -146,6 +146,14 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  *			defaults. Queue config structs are passed to this
  *			helper before the user-requested settings are applied.
  *
+ * @ndo_queue_cfg_validate: (Optional) Check if queue config is supported.
+ *			Called when configuration affecting a queue may be
+ *			changing, either due to NIC-wide config, or config
+ *			scoped to the queue at a specified index.
+ *			When NIC-wide config is changed the callback will
+ *			be invoked for all queues, and in addition to that
+ *			with a negative queue index for the base settings.
+ *
  * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
  *			 The new memory is written at the specified address.
  *
@@ -166,6 +174,10 @@ struct netdev_queue_mgmt_ops {
 	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
 					  int idx,
 					  struct netdev_queue_config *qcfg);
+	int	(*ndo_queue_cfg_validate)(struct net_device *dev,
+					  int idx,
+					  struct netdev_queue_config *qcfg,
+					  struct netlink_ext_ack *extack);
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
 				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
diff --git a/net/core/dev.h b/net/core/dev.h
index 6d7f5e920018..e0d433fb6325 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -99,6 +99,8 @@ void netdev_free_config(struct net_device *dev);
 int netdev_reconfig_start(struct net_device *dev);
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending);
+int netdev_queue_config_revalidate(struct net_device *dev,
+				   struct netlink_ext_ack *extack);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index bad2d53522f0..fc700b77e4eb 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
 	__netdev_queue_config(dev, rxq, qcfg, true);
 }
 EXPORT_SYMBOL(netdev_queue_config);
+
+int netdev_queue_config_revalidate(struct net_device *dev,
+				   struct netlink_ext_ack *extack)
+{
+	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
+	int i, err;
+
+	if (!qops || !qops->ndo_queue_cfg_validate)
+		return 0;
+
+	for (i = -1; i < (int)dev->real_num_rx_queues; i++) {
+		netdev_queue_config(dev, i, &qcfg);
+		err = qops->ndo_queue_cfg_validate(dev, i, &qcfg, extack);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index b0523eb44e10..7c691eb1a48b 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -37,6 +37,12 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 
 	netdev_queue_config(dev, rxq_idx, &qcfg);
 
+	if (qops->ndo_queue_cfg_validate) {
+		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, &qcfg, extack);
+		if (err)
+			goto err_free_old_mem;
+	}
+
 	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 6a74e7e4064e..7884d10c090f 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -4,6 +4,7 @@
 
 #include "netlink.h"
 #include "common.h"
+#include "../core/dev.h"
 
 struct rings_req_info {
 	struct ethnl_req_info		base;
@@ -307,6 +308,10 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 	dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
 	dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
 
+	ret = netdev_queue_config_revalidate(dev, info->extack);
+	if (ret)
+		return ret;
+
 	ret = dev->ethtool_ops->set_ringparam(dev, &ringparam,
 					      &kernel_ringparam, info->extack);
 	return ret < 0 ? ret : 1;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (12 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Core provides a centralized callback for validating per-queue settings
but the callback is part of the queue management ops. Having the ops
conditionally set complicates the parts of the driver which could
otherwise lean on the core to feel it the correct settings.

Always set the queue ops, but provide no restart-related callbacks if
queue ops are not supported by the device. This should maintain current
behavior, the check in netdev_rx_queue_restart() looks both at op struct
and individual ops.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b5f7a65bf678..884fb3e99e65 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16126,6 +16126,9 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_stop		= bnxt_queue_stop,
 };
 
+static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
+};
+
 static void bnxt_remove_one(struct pci_dev *pdev)
 {
 	struct net_device *dev = pci_get_drvdata(pdev);
@@ -16781,7 +16784,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		bp->rss_cap |= BNXT_RSS_CAP_MULTI_RSS_CTX;
 	if (BNXT_SUPPORTS_QUEUE_API(bp))
 		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;
-	dev->request_ops_lock = true;
+	else
+		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
 
 	rc = register_netdev(dev);
 	if (rc)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 15/22] eth: bnxt: store the rx buf size per queue
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (13 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 22:33   ` Mina Almasry
  2025-07-28 11:04 ` [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

In normal operation only a subset of queues is configured for
zero-copy. Since zero-copy is the main use for larger buffer
sizes we need to configure the sizes per queue.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 46 ++++++++++---------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  6 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |  2 +-
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 884fb3e99e65..26fc275fb44b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -895,7 +895,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
+	return rxr->need_head_pool || PAGE_SIZE > rxr->rx_page_size;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -905,9 +905,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > bp->rx_page_size) {
+	if (PAGE_SIZE > rxr->rx_page_size) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						bp->rx_page_size);
+						rxr->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -1139,9 +1139,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1173,7 +1173,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1187,7 +1187,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
+	skb_add_rx_frag(skb, 0, page, off, len, rxr->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1272,7 +1272,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, bp->rx_page_size);
+					       frag_len, rxr->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1297,7 +1297,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= bp->rx_page_size;
+				skb->truesize -= rxr->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1312,7 +1312,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  bp->rx_page_size);
+						  rxr->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -2265,8 +2265,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 			if (!skb)
 				goto oom_next_rx;
 		} else {
-			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs,
-						 rxr->page_pool, &xdp);
+			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr, &xdp);
 			if (!skb) {
 				/* we should be able to free the old skb here */
 				bnxt_xdp_buff_frags_free(rxr, &xdp);
@@ -3806,7 +3805,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.pool_size = bp->rx_agg_ring_size;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size;
-	pp.order = get_order(bp->rx_page_size);
+	pp.order = get_order(rxr->rx_page_size);
 	pp.nid = numa_node;
 	pp.napi = &rxr->bnapi->napi;
 	pp.netdev = bp->dev;
@@ -4292,6 +4291,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
+		rxr->rx_page_size = bp->rx_page_size;
+
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
 		rmem->nr_pages = bp->rx_nr_pages;
@@ -4451,7 +4452,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
+		type = ((u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
 
 		bnxt_init_rxbd_pages(ring, type);
@@ -7016,6 +7017,7 @@ static void bnxt_hwrm_ring_grp_free(struct bnxt *bp)
 
 static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 				       struct hwrm_ring_alloc_input *req,
+				       struct bnxt_rx_ring_info *rxr,
 				       struct bnxt_ring_struct *ring)
 {
 	struct bnxt_ring_grp_info *grp_info = &bp->grp_info[ring->grp_idx];
@@ -7025,7 +7027,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
+		req->rx_buf_size = cpu_to_le16(rxr->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -7039,6 +7041,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 }
 
 static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
+				    struct bnxt_rx_ring_info *rxr,
 				    struct bnxt_ring_struct *ring,
 				    u32 ring_type, u32 map_index)
 {
@@ -7095,7 +7098,8 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
 			      cpu_to_le32(bp->rx_ring_mask + 1) :
 			      cpu_to_le32(bp->rx_agg_ring_mask + 1);
 		if (bp->flags & BNXT_FLAG_CHIP_P5_PLUS)
-			bnxt_set_rx_ring_params_p5(bp, ring_type, req, ring);
+			bnxt_set_rx_ring_params_p5(bp, ring_type, req,
+						   rxr, ring);
 		break;
 	case HWRM_RING_ALLOC_CMPL:
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL;
@@ -7243,7 +7247,7 @@ static int bnxt_hwrm_rx_ring_alloc(struct bnxt *bp,
 	u32 map_idx = bnapi->index;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7263,7 +7267,7 @@ static int bnxt_hwrm_rx_agg_ring_alloc(struct bnxt *bp,
 	int rc;
 
 	map_idx = grp_idx + bp->rx_nr_rings;
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7287,7 +7291,7 @@ static int bnxt_hwrm_cp_ring_alloc_p5(struct bnxt *bp,
 
 	ring = &cpr->cp_ring_struct;
 	ring->handle = BNXT_SET_NQ_HDL(cpr);
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &cpr->cp_db, type, map_idx, ring->fw_ring_id);
@@ -7302,7 +7306,7 @@ static int bnxt_hwrm_tx_ring_alloc(struct bnxt *bp,
 	const u32 type = HWRM_RING_ALLOC_TX;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, tx_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, tx_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &txr->tx_db, type, tx_idx, ring->fw_ring_id);
@@ -7328,7 +7332,7 @@ static int bnxt_hwrm_ring_alloc(struct bnxt *bp)
 
 		vector = bp->irq_tbl[map_idx].vector;
 		disable_irq_nosync(vector);
-		rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+		rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 		if (rc) {
 			enable_irq(vector);
 			goto err_out;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 56aafae568f8..4f9d4c71c0e2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
 
 	unsigned long		*rx_agg_bmap;
 	u16			rx_agg_bmap_size;
+	u16			rx_page_size;
 	bool                    need_head_pool;
 
 	dma_addr_t		rx_desc_mapping[MAX_RX_PAGES];
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 32bcc3aedee6..d18cc698c1c7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = bp->rx_page_size;
+	u32 buflen = rxr->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -461,7 +461,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 
 struct sk_buff *
 bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
-		   struct page_pool *pool, struct xdp_buff *xdp)
+		   struct bnxt_rx_ring_info *rxr, struct xdp_buff *xdp)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 
@@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 
 	xdp_update_skb_shared_info(skb, num_frags,
 				   sinfo->xdp_frags_size,
-				   bp->rx_page_size * num_frags,
+				   rxr->rx_page_size * num_frags,
 				   xdp_buff_is_frag_pfmemalloc(xdp));
 	return skb;
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
index 220285e190fc..8933a0dec09a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
@@ -32,6 +32,6 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 void bnxt_xdp_buff_frags_free(struct bnxt_rx_ring_info *rxr,
 			      struct xdp_buff *xdp);
 struct sk_buff *bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb,
-				   u8 num_frags, struct page_pool *pool,
+				   u8 num_frags, struct bnxt_rx_ring_info *rxr,
 				   struct xdp_buff *xdp);
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (14 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

The driver tries to provision more agg buffers than header buffers
since multiple agg segments can reuse the same header. The calculation
/ heuristic tries to provide enough pages for 65k of data for each header
(or 4 frags per header if the result is too big). This calculation is
currently global to the adapter. If we increase the buffer sizes 8x
we don't want 8x the amount of memory sitting on the rings.
Luckily we don't have to fill the rings completely, adjust
the fill level dynamically in case particular queue has buffers
larger than the global size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 26fc275fb44b..017f08ca8d1d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3795,6 +3795,21 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 	}
 }
 
+static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
+				       struct bnxt_rx_ring_info *rxr)
+{
+	/* User may have chosen larger than default rx_page_size,
+	 * we keep the ring sizes uniform and also want uniform amount
+	 * of bytes consumed per ring, so cap how much of the rings we fill.
+	 */
+	int fill_level = bp->rx_agg_ring_size;
+
+	if (rxr->rx_page_size > bp->rx_page_size)
+		fill_level /= rxr->rx_page_size / bp->rx_page_size;
+
+	return fill_level;
+}
+
 static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 				   struct bnxt_rx_ring_info *rxr,
 				   int numa_node)
@@ -3802,7 +3817,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	struct page_pool_params pp = { 0 };
 	struct page_pool *pool;
 
-	pp.pool_size = bp->rx_agg_ring_size;
+	pp.pool_size = bnxt_rx_agg_ring_fill_level(bp, rxr);
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size;
 	pp.order = get_order(rxr->rx_page_size);
@@ -4370,11 +4385,13 @@ static void bnxt_alloc_one_rx_ring_netmem(struct bnxt *bp,
 					  struct bnxt_rx_ring_info *rxr,
 					  int ring_nr)
 {
+	int fill_level, i;
 	u32 prod;
-	int i;
+
+	fill_level = bnxt_rx_agg_ring_fill_level(bp, rxr);
 
 	prod = rxr->rx_agg_prod;
-	for (i = 0; i < bp->rx_agg_ring_size; i++) {
+	for (i = 0; i < fill_level; i++) {
 		if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
 			netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
 				    ring_nr, i, bp->rx_ring_size);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (15 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 23:10   ` Mina Almasry
  2025-07-28 11:04 ` [RFC v1 18/22] net: wipe the setting of deactived queues Pavel Begunkov
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Zero-copy APIs increase the cost of buffer management. They also extend
this cost to user space applications which may be used to dealing with
much larger buffers. Allow setting rx-buf-len per queue, devices with
HW-GRO support can commonly fill buffers up to 32k (or rather 64k - 1
but that's not a power of 2..)

The implementation adds a new option to the netdev netlink, rather
than ethtool. The NIC-wide setting lives in ethtool ringparams so
one could argue that we should be extending the ethtool API.
OTOH netdev API is where we already have queue-get, and it's how
zero-copy applications bind memory providers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/netlink/specs/netdev.yaml | 15 ++++
 include/net/netdev_queues.h             |  5 ++
 include/net/netlink.h                   | 19 +++++
 include/uapi/linux/netdev.h             |  2 +
 net/core/netdev-genl-gen.c              | 15 ++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 92 +++++++++++++++++++++++++
 net/core/netdev_config.c                | 16 +++++
 tools/include/uapi/linux/netdev.h       |  2 +
 9 files changed, 167 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index c0ef6d0d7786..5dd1eb5909cd 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -324,6 +324,10 @@ attribute-sets:
         doc: XSK information for this queue, if any.
         type: nest
         nested-attributes: xsk-info
+      -
+        name: rx-buf-len
+        doc: Per-queue configuration of ETHTOOL_A_RINGS_RX_BUF_LEN.
+        type: u32
   -
     name: qstats
     doc: |
@@ -755,6 +759,17 @@ operations:
         reply:
           attributes:
             - id
+    -
+      name: queue-set
+      doc: Set per-queue configurable options.
+      attribute-set: queue
+      do:
+        request:
+          attributes:
+            - ifindex
+            - type
+            - id
+            - rx-buf-len
 
 kernel-family:
   headers: [ "net/netdev_netlink.h"]
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index f75313fc78ba..cfd2d59861e1 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -38,6 +38,7 @@ struct netdev_config {
 
 /* Same semantics as fields in struct netdev_config */
 struct netdev_queue_config {
+	u32	rx_buf_len;
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -140,6 +141,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
 /**
  * struct netdev_queue_mgmt_ops - netdev ops for queue management
  *
+ * @supported_ring_params: ring params supported per queue (ETHTOOL_RING_USE_*).
+ *
  * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
  *
  * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
@@ -170,6 +173,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
+	u32     supported_ring_params;
+
 	size_t	ndo_queue_mem_size;
 	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
 					  int idx,
diff --git a/include/net/netlink.h b/include/net/netlink.h
index 90a560dc167a..c892cae8f592 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -2186,6 +2186,25 @@ static inline struct nla_bitfield32 nla_get_bitfield32(const struct nlattr *nla)
 	return tmp;
 }
 
+/**
+ * nla_update_u32() - update u32 value from NLA_U32 attribute
+ * @dst:  value to update
+ * @attr: netlink attribute with new value or null
+ *
+ * Copy the u32 value from NLA_U32 netlink attribute @attr into variable
+ * pointed to by @dst; do nothing if @attr is null.
+ *
+ * Return: true if this function changed the value of @dst, otherwise false.
+ */
+static inline bool nla_update_u32(u32 *dst, const struct nlattr *attr)
+{
+	u32 old_val = *dst;
+
+	if (attr)
+		*dst = nla_get_u32(attr);
+	return *dst != old_val;
+}
+
 /**
  * nla_memdup - duplicate attribute memory (kmemdup)
  * @src: netlink attribute to duplicate from
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 7eb9571786b8..98fa988c8db2 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -152,6 +152,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_RX_BUF_LEN,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
@@ -220,6 +221,7 @@ enum {
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
 	NETDEV_CMD_BIND_TX,
+	NETDEV_CMD_QUEUE_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 4fc44587f493..ac25584a829d 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -105,6 +105,14 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
 	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
 };
 
+/* NETDEV_CMD_QUEUE_SET - do */
+static const struct nla_policy netdev_queue_set_nl_policy[NETDEV_A_QUEUE_RX_BUF_LEN + 1] = {
+	[NETDEV_A_QUEUE_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+	[NETDEV_A_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_QUEUE_ID] = { .type = NLA_U32, },
+	[NETDEV_A_QUEUE_RX_BUF_LEN] = { .type = NLA_U32, },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -203,6 +211,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_DMABUF_FD,
 		.flags		= GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_QUEUE_SET,
+		.doit		= netdev_nl_queue_set_doit,
+		.policy		= netdev_queue_set_nl_policy,
+		.maxattr	= NETDEV_A_QUEUE_RX_BUF_LEN,
+		.flags		= GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index cf3fad74511f..b7f5e5d9fca9 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 2afa7b2141aa..52ec8287e835 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -372,6 +372,30 @@ static int nla_put_napi_id(struct sk_buff *skb, const struct napi_struct *napi)
 	return 0;
 }
 
+static int
+netdev_nl_queue_fill_cfg(struct sk_buff *rsp, struct net_device *netdev,
+			 u32 q_idx, u32 q_type)
+{
+	struct netdev_queue_config *qcfg;
+
+	if (!netdev_need_ops_lock(netdev))
+		return 0;
+
+	qcfg = &netdev->cfg->qcfg[q_idx];
+	switch (q_type) {
+	case NETDEV_QUEUE_TYPE_RX:
+		if (qcfg->rx_buf_len &&
+		    nla_put_u32(rsp, NETDEV_A_QUEUE_RX_BUF_LEN,
+				qcfg->rx_buf_len))
+			return -EMSGSIZE;
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
@@ -419,6 +443,9 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 		break;
 	}
 
+	if (netdev_nl_queue_fill_cfg(rsp, netdev, q_idx, q_type))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
@@ -558,6 +585,71 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr * const *tb = info->attrs;
+	struct netdev_queue_config *qcfg;
+	u32 q_id, q_type, ifindex;
+	struct net_device *netdev;
+	bool mod;
+	int ret;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_TYPE) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_IFINDEX))
+		return -EINVAL;
+
+	q_id = nla_get_u32(tb[NETDEV_A_QUEUE_ID]);
+	q_type = nla_get_u32(tb[NETDEV_A_QUEUE_TYPE]);
+	ifindex = nla_get_u32(tb[NETDEV_A_QUEUE_IFINDEX]);
+
+	if (q_type != NETDEV_QUEUE_TYPE_RX) {
+		/* Only Rx params exist right now */
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_TYPE]);
+		return -EINVAL;
+	}
+
+	ret = 0;
+	netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex);
+	if (!netdev || !netif_device_present(netdev))
+		ret = -ENODEV;
+	else if (!netdev->queue_mgmt_ops)
+		ret = -EOPNOTSUPP;
+	if (ret) {
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_IFINDEX]);
+		goto exit_unlock;
+	}
+
+	ret = netdev_nl_queue_validate(netdev, q_id, q_type);
+	if (ret) {
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_ID]);
+		goto exit_unlock;
+	}
+
+	ret = netdev_reconfig_start(netdev);
+	if (ret)
+		goto exit_unlock;
+
+	qcfg = &netdev->cfg_pending->qcfg[q_id];
+	mod = nla_update_u32(&qcfg->rx_buf_len, tb[NETDEV_A_QUEUE_RX_BUF_LEN]);
+	if (!mod)
+		goto exit_free_cfg;
+
+	ret = netdev_rx_queue_restart(netdev, q_id, info->extack);
+	if (ret)
+		goto exit_free_cfg;
+
+	swap(netdev->cfg, netdev->cfg_pending);
+
+exit_free_cfg:
+	__netdev_free_config(netdev->cfg_pending);
+	netdev->cfg_pending = netdev->cfg;
+exit_unlock:
+	if (netdev)
+		netdev_unlock(netdev);
+	return ret;
+}
+
 #define NETDEV_STAT_NOT_SET		(~0ULL)
 
 static void netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index fc700b77e4eb..ede02b77470e 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -67,11 +67,27 @@ int netdev_reconfig_start(struct net_device *dev)
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
+	const struct netdev_config *cfg;
+
+	cfg = pending ? dev->cfg_pending : dev->cfg;
+
 	memset(qcfg, 0, sizeof(*qcfg));
 
 	/* Get defaults from the driver, in case user config not set */
 	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
 		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+
+	/* Set config based on device-level settings */
+	if (cfg->rx_buf_len)
+		qcfg->rx_buf_len = cfg->rx_buf_len;
+
+	/* Set config dedicated to this queue */
+	if (rxq >= 0) {
+		const struct netdev_queue_config *user_cfg = &cfg->qcfg[rxq];
+
+		if (user_cfg->rx_buf_len)
+			qcfg->rx_buf_len = user_cfg->rx_buf_len;
+	}
 }
 
 /**
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 7eb9571786b8..98fa988c8db2 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -152,6 +152,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_RX_BUF_LEN,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
@@ -220,6 +221,7 @@ enum {
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
 	NETDEV_CMD_BIND_TX,
+	NETDEV_CMD_QUEUE_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 18/22] net: wipe the setting of deactived queues
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (16 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 19/22] eth: bnxt: use queue op config validate Pavel Begunkov
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Clear out all settings of deactived queues when user changes
the number of channels. We already perform similar cleanup
for shapers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/dev.c           |  5 +++++
 net/core/dev.h           |  2 ++
 net/core/netdev_config.c | 13 +++++++++++++
 3 files changed, 20 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 757fa06d7392..2446e7136bd8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3190,6 +3190,8 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 		if (dev->num_tc)
 			netif_setup_tc(dev, txq);
 
+		netdev_queue_config_update_cnt(dev, txq,
+					       dev->real_num_rx_queues);
 		net_shaper_set_real_num_tx_queues(dev, txq);
 
 		dev_qdisc_change_real_num_tx(dev, txq);
@@ -3236,6 +3238,9 @@ int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
 						  rxq);
 		if (rc)
 			return rc;
+
+		netdev_queue_config_update_cnt(dev, dev->real_num_tx_queues,
+					       rxq);
 	}
 
 	dev->real_num_rx_queues = rxq;
diff --git a/net/core/dev.h b/net/core/dev.h
index e0d433fb6325..4cdd8ac7df4f 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -101,6 +101,8 @@ void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending);
 int netdev_queue_config_revalidate(struct net_device *dev,
 				   struct netlink_ext_ack *extack);
+void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
+				    unsigned int rxq);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index ede02b77470e..c5ae39e76f40 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -64,6 +64,19 @@ int netdev_reconfig_start(struct net_device *dev)
 	return -ENOMEM;
 }
 
+void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
+				    unsigned int rxq)
+{
+	size_t len;
+
+	if (rxq < dev->real_num_rx_queues) {
+		len = (dev->real_num_rx_queues - rxq) * sizeof(*dev->cfg->qcfg);
+
+		memset(&dev->cfg->qcfg[rxq], 0, len);
+		memset(&dev->cfg_pending->qcfg[rxq], 0, len);
+	}
+}
+
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 19/22] eth: bnxt: use queue op config validate
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (17 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 18/22] net: wipe the setting of deactived queues Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Move the rx-buf-len config validation to the queue ops.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 40 +++++++++++++++++++
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 ------
 2 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 017f08ca8d1d..5788518fe407 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16139,8 +16139,46 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
 	return 0;
 }
 
+static int
+bnxt_queue_cfg_validate(struct net_device *dev, int idx,
+			struct netdev_queue_config *qcfg,
+			struct netlink_ext_ack *extack)
+{
+	struct bnxt *bp = netdev_priv(dev);
+
+	/* Older chips need MSS calc so rx_buf_len is not supported,
+	 * but we don't set queue ops for them so we should never get here.
+	 */
+	if (qcfg->rx_buf_len != bp->rx_page_size &&
+	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(qcfg->rx_buf_len)) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
+		return -ERANGE;
+	}
+	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
+	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
+		return -ERANGE;
+	}
+	return 0;
+}
+
+static void
+bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
+			struct netdev_queue_config *qcfg)
+{
+	qcfg->rx_buf_len	= BNXT_RX_PAGE_SIZE;
+}
+
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
+
+	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
+	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
 	.ndo_queue_mem_alloc	= bnxt_queue_mem_alloc,
 	.ndo_queue_mem_free	= bnxt_queue_mem_free,
 	.ndo_queue_start	= bnxt_queue_start,
@@ -16148,6 +16186,8 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 };
 
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
+	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
+	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
 };
 
 static void bnxt_remove_one(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 0e225414d463..38178051e0d3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -867,18 +867,6 @@ static int bnxt_set_ringparam(struct net_device *dev,
 	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
 		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
 
-	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
-	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
-		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
-		return -EINVAL;
-	}
-	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
-	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
-	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
-		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
-		return -ERANGE;
-	}
-
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (18 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 19/22] eth: bnxt: use queue op config validate Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

From: Jakub Kicinski <kuba@kernel.org>

Now that the rx_buf_len is stored and validated per queue allow
it being set differently for different queues. Instead of copying
the device setting for each queue ask the core for the config
via netdev_queue_config().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 5788518fe407..8d2cae59c4d5 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4284,6 +4284,7 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 
 	for (i = 0; i < bp->cp_nr_rings; i++) {
 		struct bnxt_napi *bnapi = bp->bnapi[i];
+		struct netdev_queue_config qcfg;
 		struct bnxt_ring_mem_info *rmem;
 		struct bnxt_cp_ring_info *cpr;
 		struct bnxt_rx_ring_info *rxr;
@@ -4306,7 +4307,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
-		rxr->rx_page_size = bp->rx_page_size;
+		netdev_queue_config(bp->dev, i,	&qcfg);
+		rxr->rx_page_size = qcfg.rx_buf_len;
 
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
@@ -15863,6 +15865,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev,
 	clone->rx_agg_prod = 0;
 	clone->rx_sw_agg_prod = 0;
 	clone->rx_next_cons = 0;
+	clone->rx_page_size = qcfg->rx_buf_len;
 	clone->need_head_pool = false;
 
 	rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
@@ -15969,6 +15972,8 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	src_ring = &src->rx_ring_struct;
 	src_rmem = &src_ring->ring_mem;
 
+	dst->rx_page_size = src->rx_page_size;
+
 	WARN_ON(dst_rmem->nr_pages != src_rmem->nr_pages);
 	WARN_ON(dst_rmem->page_size != src_rmem->page_size);
 	WARN_ON(dst_rmem->flags != src_rmem->flags);
@@ -16175,6 +16180,7 @@ bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
 }
 
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
+	.supported_ring_params	= ETHTOOL_RING_USE_RX_BUF_LEN,
 	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
 
 	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (19 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-08-02  0:10   ` Jakub Kicinski
  2025-07-28 11:04 ` [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support Pavel Begunkov
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

This patch allows memory providers to pass a queue config when opening a
queue. It'll be used in the next patch to pass a custom rx buffer length
from zcrx. As there are many users of netdev_rx_queue_restart(), it's
allowed to pass a NULL qcfg, in which case the function will use the
default configuration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/page_pool/memory_provider.h |  4 +-
 io_uring/zcrx.c                         |  2 +-
 net/core/netdev_rx_queue.c              | 50 +++++++++++++++++--------
 3 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index ada4f968960a..c08ba208f67d 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -5,6 +5,7 @@
 #include <net/netmem.h>
 #include <net/page_pool/types.h>
 
+struct netdev_queue_config;
 struct netdev_rx_queue;
 struct netlink_ext_ack;
 struct sk_buff;
@@ -24,7 +25,8 @@ void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
 void net_mp_niov_clear_page_pool(struct net_iov *niov);
 
 int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
-		    struct pp_memory_provider_params *p);
+		    struct pp_memory_provider_params *p,
+		    struct netdev_queue_config *qcfg);
 int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 		      const struct pp_memory_provider_params *p,
 		      struct netlink_ext_ack *extack);
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 985c7386e24b..a00243e10164 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -595,7 +595,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 
 	mp_param.mp_ops = &io_uring_pp_zc_ops;
 	mp_param.mp_priv = ifq;
-	ret = net_mp_open_rxq(ifq->netdev, reg.if_rxq, &mp_param);
+	ret = net_mp_open_rxq(ifq->netdev, reg.if_rxq, &mp_param, NULL);
 	if (ret)
 		goto err;
 	ifq->if_rxq = reg.if_rxq;
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 7c691eb1a48b..0dbfdb5f5b91 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -10,12 +10,14 @@
 #include "dev.h"
 #include "page_pool_priv.h"
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
-			    struct netlink_ext_ack *extack)
+static int netdev_rx_queue_restart_cfg(struct net_device *dev,
+				unsigned int rxq_idx,
+				struct netlink_ext_ack *extack,
+				struct netdev_queue_config *qcfg)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
-	struct netdev_queue_config qcfg;
+	struct netdev_queue_config tmp_qcfg;
 	void *new_mem, *old_mem;
 	int err;
 
@@ -35,15 +37,18 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 		goto err_free_new_mem;
 	}
 
-	netdev_queue_config(dev, rxq_idx, &qcfg);
+	if (!qcfg) {
+		qcfg = &tmp_qcfg;
+		netdev_queue_config(dev, rxq_idx, qcfg);
+	}
 
 	if (qops->ndo_queue_cfg_validate) {
-		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, &qcfg, extack);
+		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, qcfg, extack);
 		if (err)
 			goto err_free_old_mem;
 	}
 
-	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
+	err = qops->ndo_queue_mem_alloc(dev, qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
 
@@ -56,7 +61,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 		if (err)
 			goto err_free_new_queue_mem;
 
-		err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
+		err = qops->ndo_queue_start(dev, qcfg, new_mem, rxq_idx);
 		if (err)
 			goto err_start_queue;
 	} else {
@@ -71,7 +76,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 	return 0;
 
 err_start_queue:
-	__netdev_queue_config(dev, rxq_idx, &qcfg, false);
+	__netdev_queue_config(dev, rxq_idx, qcfg, false);
 	/* Restarting the queue with old_mem should be successful as we haven't
 	 * changed any of the queue configuration, and there is not much we can
 	 * do to recover from a failure here.
@@ -79,7 +84,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 	 * WARN if we fail to recover the old rx queue, and at least free
 	 * old_mem so we don't also leak that.
 	 */
-	if (qops->ndo_queue_start(dev, &qcfg, old_mem, rxq_idx)) {
+	if (qops->ndo_queue_start(dev, qcfg, old_mem, rxq_idx)) {
 		WARN(1,
 		     "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
 		     rxq_idx);
@@ -97,11 +102,18 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 
 	return err;
 }
+
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
+			    struct netlink_ext_ack *extack)
+{
+	return netdev_rx_queue_restart_cfg(dev, rxq_idx, extack, NULL);
+}
 EXPORT_SYMBOL_NS_GPL(netdev_rx_queue_restart, "NETDEV_INTERNAL");
 
-int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
-		      const struct pp_memory_provider_params *p,
-		      struct netlink_ext_ack *extack)
+static int __net_mp_open_rxq_cfg(struct net_device *dev, unsigned int rxq_idx,
+				const struct pp_memory_provider_params *p,
+				struct netlink_ext_ack *extack,
+				struct netdev_queue_config *qcfg)
 {
 	struct netdev_rx_queue *rxq;
 	int ret;
@@ -143,7 +155,7 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 #endif
 
 	rxq->mp_params = *p;
-	ret = netdev_rx_queue_restart(dev, rxq_idx, extack);
+	ret = netdev_rx_queue_restart_cfg(dev, rxq_idx, extack, qcfg);
 	if (ret) {
 		rxq->mp_params.mp_ops = NULL;
 		rxq->mp_params.mp_priv = NULL;
@@ -151,13 +163,21 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 	return ret;
 }
 
+int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
+		      const struct pp_memory_provider_params *p,
+		      struct netlink_ext_ack *extack)
+{
+	return __net_mp_open_rxq_cfg(dev, rxq_idx, p, extack, NULL);
+}
+
 int net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
-		    struct pp_memory_provider_params *p)
+		    struct pp_memory_provider_params *p,
+		    struct netdev_queue_config *qcfg)
 {
 	int ret;
 
 	netdev_lock(dev);
-	ret = __net_mp_open_rxq(dev, rxq_idx, p, NULL);
+	ret = __net_mp_open_rxq_cfg(dev, rxq_idx, p, NULL, qcfg);
 	netdev_unlock(dev);
 	return ret;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (20 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
@ 2025-07-28 11:04 ` Pavel Begunkov
  2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
  2025-07-28 18:54 ` Mina Almasry
  23 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 11:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

There are network cards that support receive buffers larger than 4K, and
that can be vastly beneficial for performance, and benchmarks for this
patch showed up to 30% CPU util improvement for 32K vs 4K buffers.

Allows zcrx users to specify the size in struct
io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a
default value. zcrx will check and fail if the memory backing the area
can't be split into physically contiguous chunks of the required size.
It's more restrictive as it only needs dma addresses to be contig, but
that's beyond this series.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  2 +-
 io_uring/zcrx.c               | 39 +++++++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9d306eb5251c..8e3a342a4ad8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -1041,7 +1041,7 @@ struct io_uring_zcrx_ifq_reg {
 
 	struct io_uring_zcrx_offsets offsets;
 	__u32	zcrx_id;
-	__u32	__resv2;
+	__u32	rx_buf_len;
 	__u64	__resv[3];
 };
 
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index a00243e10164..3caa3f472af1 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -13,6 +13,7 @@
 #include <net/page_pool/memory_provider.h>
 #include <net/netlink.h>
 #include <net/netdev_rx_queue.h>
+#include <net/netdev_queues.h>
 #include <net/tcp.h>
 #include <net/rps.h>
 
@@ -53,6 +54,18 @@ static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
 	return area->mem.pages[net_iov_idx(niov) << niov_pages_shift];
 }
 
+static int io_area_max_shift(struct io_zcrx_mem *mem)
+{
+	struct sg_table *sgt = mem->sgt;
+	struct scatterlist *sg;
+	unsigned order = -1U;
+	unsigned i;
+
+	for_each_sgtable_dma_sg(sgt, sg, i)
+		order = min(order, __ffs(sg->length));
+	return order;
+}
+
 static int io_populate_area_dma(struct io_zcrx_ifq *ifq,
 				struct io_zcrx_area *area)
 {
@@ -384,8 +397,10 @@ static int io_zcrx_append_area(struct io_zcrx_ifq *ifq,
 }
 
 static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
-			       struct io_uring_zcrx_area_reg *area_reg)
+			       struct io_uring_zcrx_area_reg *area_reg,
+			       struct io_uring_zcrx_ifq_reg *reg)
 {
+	int buf_size_shift = PAGE_SHIFT;
 	struct io_zcrx_area *area;
 	unsigned nr_iovs;
 	int i, ret;
@@ -400,7 +415,16 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
 	if (ret)
 		goto err;
 
-	ifq->niov_shift = PAGE_SHIFT;
+	if (reg->rx_buf_len) {
+		if (!is_power_of_2(reg->rx_buf_len) ||
+		     reg->rx_buf_len < PAGE_SIZE)
+			return -EINVAL;
+		buf_size_shift = ilog2(reg->rx_buf_len);
+	}
+	if (buf_size_shift > io_area_max_shift(&area->mem))
+		return -EINVAL;
+
+	ifq->niov_shift = buf_size_shift;
 	nr_iovs = area->mem.size >> ifq->niov_shift;
 	area->nia.num_niovs = nr_iovs;
 
@@ -522,6 +546,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zcrx_ifq_reg __user *arg)
 {
 	struct pp_memory_provider_params mp_param = {};
+	struct netdev_queue_config qcfg = {};
 	struct io_uring_zcrx_area_reg area;
 	struct io_uring_zcrx_ifq_reg reg;
 	struct io_uring_region_desc rd;
@@ -544,8 +569,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		return -EFAULT;
 	if (copy_from_user(&rd, u64_to_user_ptr(reg.region_ptr), sizeof(rd)))
 		return -EFAULT;
-	if (!mem_is_zero(&reg.__resv, sizeof(reg.__resv)) ||
-	    reg.__resv2 || reg.zcrx_id)
+	if (!mem_is_zero(&reg.__resv, sizeof(reg.__resv)) || reg.zcrx_id)
 		return -EINVAL;
 	if (reg.if_rxq == -1 || !reg.rq_entries || reg.flags)
 		return -EINVAL;
@@ -589,13 +613,14 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	}
 	get_device(ifq->dev);
 
-	ret = io_zcrx_create_area(ifq, &area);
+	ret = io_zcrx_create_area(ifq, &area, &reg);
 	if (ret)
 		goto err;
 
 	mp_param.mp_ops = &io_uring_pp_zc_ops;
 	mp_param.mp_priv = ifq;
-	ret = net_mp_open_rxq(ifq->netdev, reg.if_rxq, &mp_param, NULL);
+	qcfg.rx_buf_len = 1U << ifq->niov_shift;
+	ret = net_mp_open_rxq(ifq->netdev, reg.if_rxq, &mp_param, &qcfg);
 	if (ret)
 		goto err;
 	ifq->if_rxq = reg.if_rxq;
@@ -612,6 +637,8 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 			goto err;
 	}
 
+	reg.rx_buf_len = 1U << ifq->niov_shift;
+
 	if (copy_to_user(arg, &reg, sizeof(reg)) ||
 	    copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd)) ||
 	    copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (21 preceding siblings ...)
  2025-07-28 11:04 ` [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support Pavel Begunkov
@ 2025-07-28 17:13 ` Stanislav Fomichev
  2025-07-28 18:18   ` Pavel Begunkov
  2025-07-28 18:54 ` Mina Almasry
  23 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-07-28 17:13 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 07/28, Pavel Begunkov wrote:
> This series implements large rx buffer support for io_uring/zcrx on
> top of Jakub's queue configuration changes, but it can also be used
> by other memory providers. Large rx buffers can be drastically
> beneficial with high-end hw-gro enabled cards that can coalesce traffic
> into larger pages, reducing the number of frags traversing the network
> stack and resuling in larger contiguous chunks of data for the
> userspace. Benchamrks showed up to ~30% improvement in CPU util.
> 
> For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
> userspace pinned to the same CPU:
> 
> packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
> packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    0.69    0.00    8.26   31.65    1.83   57.00    0.57
> 
> And for napi and userspace on different CPUs:
> 
> packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
>   1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
> packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
>   1    1.09    0.00   31.95   62.91    1.42    2.63    0.00
> 
> Patch 19 allows to pass queue config from a memory provider. The
> zcrx changes are contained in a single patch as I already queued
> most of work making it size agnostic into my zcrx branch. The
> uAPI is simple and imperative, it'll use the exact value (if)
> specified by the user. In the future we might extend it to
> "choose the best size in a given range".
> 
> The rest (first 20) patches are from Jakub's series implementing
> per queue configuration. Quoting Jakub:
> 
> "... The direct motivation for the series is that zero-copy Rx queues would
> like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
> and can coalesce payloads into pages much larger than than the MTU.
> Enabling larger buffers globally is a bit precarious as it exposes us
> to potentially very inefficient memory use. Also allocating large
> buffers may not be easy or cheap under load. Zero-copy queues service
> only select traffic and have pre-allocated memory so the concerns don't
> apply as much.
> 
> The per-queue config has to address 3 problems:
> - user API
> - driver API
> - memory provider API
> 
> For user API the main question is whether we expose the config via
> ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
> than extending the ethtool RINGS_GET API. I worry slightly that queue
> GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
> settings we have in ethtool which are not going via RINGS_SET is
> IRQ coalescing.
> 
> My goal for the driver API was to avoid complexity in the drivers.
> The queue management API has gained two ops, responsible for preparing
> configuration for a given queue, and validating whether the config
> is supported. The validating is used both for NIC-wide and per-queue
> changes. Queue alloc/start ops have a new "config" argument which
> contains the current config for a given queue (we use queue restart
> to apply per-queue settings). Outside of queue reset paths drivers
> can call netdev_queue_config() which returns the config for an arbitrary
> queue. Long story short I anticipate it to be used during ndo_open.
> 
> In the core I extended struct netdev_config with per queue settings.
> All in all this isn't too far from what was there in my "queue API
> prototype" a few years ago ..."

Supporting big buffers is the right direction, but I have the same
feedback: it would be nice to fit a cohesive story for the devmem as well.
We should also aim for another use-case where we allocate page pool
chunks from the huge page(s), this should push the perf even more.

We need some way to express these things from the UAPI point of view.
Flipping the rx-buf-len value seems too fragile - there needs to be
something to request 32K chunks only for devmem case, not for the (default)
CPU memory. And the queues should go back to default 4K pages when the dmabuf
is detached from the queue.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
@ 2025-07-28 18:11   ` Mina Almasry
  2025-07-28 21:36   ` Mina Almasry
  1 sibling, 0 replies; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 18:11 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Document the semantics of the rx_buf_len ethtool ring param.
> Clarify its meaning in case of HDS, where driver may have
> two separate buffer pools.
>
> The various zero-copy TCP Rx schemes we have suffer from memory
> management overhead. Specifically applications aren't too impressed
> with the number of 4kB buffers they have to juggle. Zero-copy
> TCP makes most sense with larger memory transfers so using
> 16kB or 32kB buffers (with the help of HW-GRO) feels more
> natural.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/networking/ethtool-netlink.rst | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index b6e9af4d0f1b..eaa9c17a3cb1 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
>  driver. Driver may impose additional constraints and may not support all
>  attributes.
>
> -
>  ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
>  Completion queue events (CQE) are the events posted by NIC to indicate the
>  completion status of a packet when the packet is sent (like send success or
> @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
>  header / data split feature. If a received packet size is larger than this
>  threshold value, header and data will be split.
>
> +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> +uses to receive packets. If the device uses different memory polls for headers

pools, not polls.

> +and payload this setting may control the size of the header buffers but must
> +control the size of the payload buffers.
> +


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
@ 2025-07-28 18:18   ` Pavel Begunkov
  2025-07-28 20:21     ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 18:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 7/28/25 18:13, Stanislav Fomichev wrote:
> On 07/28, Pavel Begunkov wrote:
>> This series implements large rx buffer support for io_uring/zcrx on
>> top of Jakub's queue configuration changes, but it can also be used
>> by other memory providers. Large rx buffers can be drastically
>> beneficial with high-end hw-gro enabled cards that can coalesce traffic
>> into larger pages, reducing the number of frags traversing the network
>> stack and resuling in larger contiguous chunks of data for the
>> userspace. Benchamrks showed up to ~30% improvement in CPU util.
>>
>> For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
>> userspace pinned to the same CPU:
>>
>> packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
>> packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    0.69    0.00    8.26   31.65    1.83   57.00    0.57
>>
>> And for napi and userspace on different CPUs:
>>
>> packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
>>    1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
>> packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
>>    1    1.09    0.00   31.95   62.91    1.42    2.63    0.00
>>
>> Patch 19 allows to pass queue config from a memory provider. The
>> zcrx changes are contained in a single patch as I already queued
>> most of work making it size agnostic into my zcrx branch. The
>> uAPI is simple and imperative, it'll use the exact value (if)
>> specified by the user. In the future we might extend it to
>> "choose the best size in a given range".
>>
>> The rest (first 20) patches are from Jakub's series implementing
>> per queue configuration. Quoting Jakub:
>>
>> "... The direct motivation for the series is that zero-copy Rx queues would
>> like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
>> and can coalesce payloads into pages much larger than than the MTU.
>> Enabling larger buffers globally is a bit precarious as it exposes us
>> to potentially very inefficient memory use. Also allocating large
>> buffers may not be easy or cheap under load. Zero-copy queues service
>> only select traffic and have pre-allocated memory so the concerns don't
>> apply as much.
>>
>> The per-queue config has to address 3 problems:
>> - user API
>> - driver API
>> - memory provider API
>>
>> For user API the main question is whether we expose the config via
>> ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
>> than extending the ethtool RINGS_GET API. I worry slightly that queue
>> GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
>> settings we have in ethtool which are not going via RINGS_SET is
>> IRQ coalescing.
>>
>> My goal for the driver API was to avoid complexity in the drivers.
>> The queue management API has gained two ops, responsible for preparing
>> configuration for a given queue, and validating whether the config
>> is supported. The validating is used both for NIC-wide and per-queue
>> changes. Queue alloc/start ops have a new "config" argument which
>> contains the current config for a given queue (we use queue restart
>> to apply per-queue settings). Outside of queue reset paths drivers
>> can call netdev_queue_config() which returns the config for an arbitrary
>> queue. Long story short I anticipate it to be used during ndo_open.
>>
>> In the core I extended struct netdev_config with per queue settings.
>> All in all this isn't too far from what was there in my "queue API
>> prototype" a few years ago ..."
> 
> Supporting big buffers is the right direction, but I have the same
> feedback: 

Let me actually check the feedback for the queue config RFC...

it would be nice to fit a cohesive story for the devmem as well.

Only the last patch is zcrx specific, the rest is agnostic,
devmem can absolutely reuse that. I don't think there are any
issues wiring up devmem?

> We should also aim for another use-case where we allocate page pool
> chunks from the huge page(s), 

Separate huge page pool is a bit beyond the scope of this series.

this should push the perf even more.

And not sure about "even more" is from, you can already
register a huge page with zcrx, and this will allow to chunk
them to 32K or so for hardware. Is it in terms of applicability
or you have some perf optimisation ideas?

> We need some way to express these things from the UAPI point of view.

Can you elaborate?

> Flipping the rx-buf-len value seems too fragile - there needs to be
> something to request 32K chunks only for devmem case, not for the (default)
> CPU memory. And the queues should go back to default 4K pages when the dmabuf
> is detached from the queue.

That's what the per-queue config is solving. It's not default, zcrx
configures it only for the specific queue it allocated, and the value
is cleared on restart in netdev_rx_queue_restart(), if not even too
aggressively. Maybe I should just stash it into mp_params to make
sure it's not cleared if a provider is still attached on a spurious
restart.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
                   ` (22 preceding siblings ...)
  2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
@ 2025-07-28 18:54 ` Mina Almasry
  2025-07-28 19:42   ` Pavel Begunkov
  23 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 18:54 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> This series implements large rx buffer support for io_uring/zcrx on
> top of Jakub's queue configuration changes, but it can also be used
> by other memory providers. Large rx buffers can be drastically
> beneficial with high-end hw-gro enabled cards that can coalesce traffic
> into larger pages, reducing the number of frags traversing the network
> stack and resuling in larger contiguous chunks of data for the
> userspace. Benchamrks showed up to ~30% improvement in CPU util.
>

Very exciting.

I have not yet had a chance to thoroughly look, but even still I have
a few high level questions/concerns. Maybe you already have answers to
them that can make my life a bit easier as I try to take a thorough
look.

- I'm a bit confused that you're not making changes to the core net
stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
that there are potentially a ton of places in the net stack that
assume PAGE_SIZE:

cd net
ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
468

Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
concerning?

- You're not adding a field in the net_iov that tells us how big the
net_iov is. It seems to me you're configuring the driver to set the rx
buffer size, then assuming all the pp allocations are of that size,
then assuming in the rxzc code that all the net_iov are of that size.
I think a few problems may happen?

(a) what happens if the rx buffer size is re-configured? Does the
io_uring rxrc instance get recreated as well?
(b) what happens with skb coalescing? skb coalescing is already a bit
of a mess. We don't allow coalescing unreadable and readable skbs, but
we do allow coalescing devmem and iozcrx skbs which could lead to some
bugs I'm guessing already. AFAICT as of this patch series we may allow
coalescing of skbs with netmems inside of them of different sizes, but
AFAICT so far, the iozcrx assume the size is constant across all the
netmems it gets, which I'm not sure is always true?

For all these reasons I had assumed that we'd need space in the
net_iov that tells us its size: net_iov->size.

And then netmem_size(netmem) would replace all the PAGE_SIZE
assumptions in the net stack, and then we'd disallow coalescing of
skbs with different-sized netmems (else we need to handle them
correctly per the netmem_size).

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 18:54 ` Mina Almasry
@ 2025-07-28 19:42   ` Pavel Begunkov
  2025-07-28 20:23     ` Mina Almasry
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 19:42 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On 7/28/25 19:54, Mina Almasry wrote:
> On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> This series implements large rx buffer support for io_uring/zcrx on
>> top of Jakub's queue configuration changes, but it can also be used
>> by other memory providers. Large rx buffers can be drastically
>> beneficial with high-end hw-gro enabled cards that can coalesce traffic
>> into larger pages, reducing the number of frags traversing the network
>> stack and resuling in larger contiguous chunks of data for the
>> userspace. Benchamrks showed up to ~30% improvement in CPU util.
>>
> 
> Very exciting.
> 
> I have not yet had a chance to thoroughly look, but even still I have
> a few high level questions/concerns. Maybe you already have answers to
> them that can make my life a bit easier as I try to take a thorough
> look.
> 
> - I'm a bit confused that you're not making changes to the core net
> stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
> that there are potentially a ton of places in the net stack that
> assume PAGE_SIZE:

The stack already supports large frags and it's not new. Page pools
has higher order allocations, see __page_pool_alloc_page_order. The
tx path can allocate large pages / coalesce user pages. Any specific
place that concerns you? There are many places legitimately using
PAGE_SIZE: kmap'ing folios, shifting it by order to get the size,
linear allocations, etc.

> cd net
> ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
> 468
> 
> Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
> concerning?
> 
> - You're not adding a field in the net_iov that tells us how big the
> net_iov is. It seems to me you're configuring the driver to set the rx
> buffer size, then assuming all the pp allocations are of that size,
> then assuming in the rxzc code that all the net_iov are of that size.
> I think a few problems may happen?
> 
> (a) what happens if the rx buffer size is re-configured? Does the
> io_uring rxrc instance get recreated as well?

Any reason you even want it to work? You can't and frankly
shouldn't be allowed to, at least in case of io_uring. Unless it's
rejected somewhere earlier, in this case it'll fail on the order
check while trying to create a page pool with a zcrx provider.

> (b) what happens with skb coalescing? skb coalescing is already a bit
> of a mess. We don't allow coalescing unreadable and readable skbs, but
> we do allow coalescing devmem and iozcrx skbs which could lead to some
> bugs I'm guessing already. AFAICT as of this patch series we may allow
> coalescing of skbs with netmems inside of them of different sizes, but
> AFAICT so far, the iozcrx assume the size is constant across all the
> netmems it gets, which I'm not sure is always true?

It rejects niovs from other providers incl. from any other io_uring
instances, so it only assume a uniform size for its own niovs. The
backing memory is verified that it can be chunked.
  > For all these reasons I had assumed that we'd need space in the
> net_iov that tells us its size: net_iov->size.

Nope, not in this case.

> And then netmem_size(netmem) would replace all the PAGE_SIZE
> assumptions in the net stack, and then we'd disallow coalescing of
> skbs with different-sized netmems (else we need to handle them
> correctly per the netmem_size).
I'm not even sure what's the concern. What's the difference b/w
tcp_recvmsg_dmabuf() getting one skb with differently sized frags
or same frags in separate skbs? You still need to handle it
somehow, even if by failing.

Also, we should never coalesce different niovs together regardless
of sizes. And for coalescing two chunks of the same niov, it should
work just fine even without knowing the length.

skb_can_coalesce_netmem {
	...
	return netmem == skb_frag_netmem(frag) &&
	       off == skb_frag_off(frag) + skb_frag_size(frag);
}

Essentially, for devmem only tcp_recvmsg_dmabuf() and other
devmem specific code would need to know about the niov size.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 18:18   ` Pavel Begunkov
@ 2025-07-28 20:21     ` Stanislav Fomichev
  2025-07-28 21:28       ` Pavel Begunkov
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-07-28 20:21 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 07/28, Pavel Begunkov wrote:
> On 7/28/25 18:13, Stanislav Fomichev wrote:
> > On 07/28, Pavel Begunkov wrote:
> > > This series implements large rx buffer support for io_uring/zcrx on
> > > top of Jakub's queue configuration changes, but it can also be used
> > > by other memory providers. Large rx buffers can be drastically
> > > beneficial with high-end hw-gro enabled cards that can coalesce traffic
> > > into larger pages, reducing the number of frags traversing the network
> > > stack and resuling in larger contiguous chunks of data for the
> > > userspace. Benchamrks showed up to ~30% improvement in CPU util.
> > > 
> > > For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
> > > userspace pinned to the same CPU:
> > > 
> > > packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
> > > packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    0.69    0.00    8.26   31.65    1.83   57.00    0.57
> > > 
> > > And for napi and userspace on different CPUs:
> > > 
> > > packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
> > >    1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
> > > packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
> > >    1    1.09    0.00   31.95   62.91    1.42    2.63    0.00
> > > 
> > > Patch 19 allows to pass queue config from a memory provider. The
> > > zcrx changes are contained in a single patch as I already queued
> > > most of work making it size agnostic into my zcrx branch. The
> > > uAPI is simple and imperative, it'll use the exact value (if)
> > > specified by the user. In the future we might extend it to
> > > "choose the best size in a given range".
> > > 
> > > The rest (first 20) patches are from Jakub's series implementing
> > > per queue configuration. Quoting Jakub:
> > > 
> > > "... The direct motivation for the series is that zero-copy Rx queues would
> > > like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
> > > and can coalesce payloads into pages much larger than than the MTU.
> > > Enabling larger buffers globally is a bit precarious as it exposes us
> > > to potentially very inefficient memory use. Also allocating large
> > > buffers may not be easy or cheap under load. Zero-copy queues service
> > > only select traffic and have pre-allocated memory so the concerns don't
> > > apply as much.
> > > 
> > > The per-queue config has to address 3 problems:
> > > - user API
> > > - driver API
> > > - memory provider API
> > > 
> > > For user API the main question is whether we expose the config via
> > > ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
> > > than extending the ethtool RINGS_GET API. I worry slightly that queue
> > > GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
> > > settings we have in ethtool which are not going via RINGS_SET is
> > > IRQ coalescing.
> > > 
> > > My goal for the driver API was to avoid complexity in the drivers.
> > > The queue management API has gained two ops, responsible for preparing
> > > configuration for a given queue, and validating whether the config
> > > is supported. The validating is used both for NIC-wide and per-queue
> > > changes. Queue alloc/start ops have a new "config" argument which
> > > contains the current config for a given queue (we use queue restart
> > > to apply per-queue settings). Outside of queue reset paths drivers
> > > can call netdev_queue_config() which returns the config for an arbitrary
> > > queue. Long story short I anticipate it to be used during ndo_open.
> > > 
> > > In the core I extended struct netdev_config with per queue settings.
> > > All in all this isn't too far from what was there in my "queue API
> > > prototype" a few years ago ..."
> > 
> > Supporting big buffers is the right direction, but I have the same
> > feedback:
> 
> Let me actually check the feedback for the queue config RFC...
> 
> it would be nice to fit a cohesive story for the devmem as well.
> 
> Only the last patch is zcrx specific, the rest is agnostic,
> devmem can absolutely reuse that. I don't think there are any
> issues wiring up devmem?

Right, but the patch number 2 exposes per-queue rx-buf-len which
I'm not sure is the right fit for devmem, see below. If all you
care is exposing it via io_uring, maybe don't expose it from netlink for
now? Although I'm not sure I understand why you're also passing
this per-queue value via io_uring. Can you not inherit it from the
queue config?

> > We should also aim for another use-case where we allocate page pool
> > chunks from the huge page(s),
> 
> Separate huge page pool is a bit beyond the scope of this series.
> 
> this should push the perf even more.
> 
> And not sure about "even more" is from, you can already
> register a huge page with zcrx, and this will allow to chunk
> them to 32K or so for hardware. Is it in terms of applicability
> or you have some perf optimisation ideas?

What I'm looking for is a generic system-wide solution where we can
set up the host to use huge pages to back all (even non-zc) networking queues.
Not necessary needed, but might be an option to try.

> > We need some way to express these things from the UAPI point of view.
> 
> Can you elaborate?
> 
> > Flipping the rx-buf-len value seems too fragile - there needs to be
> > something to request 32K chunks only for devmem case, not for the (default)
> > CPU memory. And the queues should go back to default 4K pages when the dmabuf
> > is detached from the queue.
> 
> That's what the per-queue config is solving. It's not default, zcrx
> configures it only for the specific queue it allocated, and the value
> is cleared on restart in netdev_rx_queue_restart(), if not even too
> aggressively. Maybe I should just stash it into mp_params to make
> sure it's not cleared if a provider is still attached on a spurious
> restart.

If we assume that at some point niov can be backed up by chunks larger
than PAGE_SIZE, the assumed workflow for devemem is:
1. change rx-buf-len to 32K
  - this is needed only for devmem, but not for CPU RAM, but we'll have
    to refill the queues from the main memory anyway
  - there is also a question on whether we need to do anything about
    MAX_PAGE_ORDER/PAGE_ALLOC_COSTLY_ORDER - do we just let the driver
    allocations fail?
2. attach dmabuf to the queue to refill from dmabuf sgt, essentially wasting
   all the effort on (1)
3. on detach, something needs to also not forget to reset the rx-buf-len
  back to PAGE_SIZE

I was hoping that maybe we can bind rx-buf-len to dmabuf for devmem,
that should avoid all that useless refill from the main memory with
large chunks. But I'm not sure it's the right way to go either.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 19:42   ` Pavel Begunkov
@ 2025-07-28 20:23     ` Mina Almasry
  2025-07-28 20:57       ` Pavel Begunkov
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 20:23 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 12:40 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 7/28/25 19:54, Mina Almasry wrote:
> > On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> This series implements large rx buffer support for io_uring/zcrx on
> >> top of Jakub's queue configuration changes, but it can also be used
> >> by other memory providers. Large rx buffers can be drastically
> >> beneficial with high-end hw-gro enabled cards that can coalesce traffic
> >> into larger pages, reducing the number of frags traversing the network
> >> stack and resuling in larger contiguous chunks of data for the
> >> userspace. Benchamrks showed up to ~30% improvement in CPU util.
> >>
> >
> > Very exciting.
> >
> > I have not yet had a chance to thoroughly look, but even still I have
> > a few high level questions/concerns. Maybe you already have answers to
> > them that can make my life a bit easier as I try to take a thorough
> > look.
> >
> > - I'm a bit confused that you're not making changes to the core net
> > stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
> > that there are potentially a ton of places in the net stack that
> > assume PAGE_SIZE:
>
> The stack already supports large frags and it's not new. Page pools
> has higher order allocations, see __page_pool_alloc_page_order. The
> tx path can allocate large pages / coalesce user pages.

Right, large order allocations are not new, but I'm not sure they
actually work reliably. AFAICT most drivers set pp_params.order = 0;
I'm not sure how well tested multi-order pages are.

It may be reasonable to assume multi order pages just work and see
what blows up, though.

> Any specific
> place that concerns you? There are many places legitimately using
> PAGE_SIZE: kmap'ing folios, shifting it by order to get the size,
> linear allocations, etc.
>

From a 5-min look:

- skb_splice_from_iter, this line: size_t part = min_t(size_t,
PAGE_SIZE - off, len);
- skb_pp_cow_data, this line: max_head_size =
SKB_WITH_OVERHEAD(PAGE_SIZE - headroom);
- skb_seq_read, this line: pg_sz = min_t(unsigned int, pg_sz -
st->frag_off, PAGE_SIZE - pg_off
- zerocopy_fill_skb_from_iter, this line: int size = min_t(int,
copied, PAGE_SIZE - start);

I think the `PAGE_SIZE -` logic in general assumes the memory is
PAGE_SIZEd. Although in these cases it seems page specifics, i.e.
net_iovs wouldn't be exposed to these particular call sites.

I spent a few weeks acking the net stack for all page-access to prune
all of them to add unreadable netmem... are you somewhat confident
there are no PAGE_SIZE assumptions in the net stack that affect
net_iovs that require a deep look? Or is the approach here to merge
this and see what/if breaks?

> > cd net
> > ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
> > 468
> >
> > Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
> > concerning?
> >
> > - You're not adding a field in the net_iov that tells us how big the
> > net_iov is. It seems to me you're configuring the driver to set the rx
> > buffer size, then assuming all the pp allocations are of that size,
> > then assuming in the rxzc code that all the net_iov are of that size.
> > I think a few problems may happen?
> >
> > (a) what happens if the rx buffer size is re-configured? Does the
> > io_uring rxrc instance get recreated as well?
>
> Any reason you even want it to work? You can't and frankly
> shouldn't be allowed to, at least in case of io_uring. Unless it's
> rejected somewhere earlier, in this case it'll fail on the order
> check while trying to create a page pool with a zcrx provider.
>

I think it's reasonable to disallow rx-buffer-size reconfiguration
when the queue is memory-config bound. I can check to see what this
code is doing.

> > (b) what happens with skb coalescing? skb coalescing is already a bit
> > of a mess. We don't allow coalescing unreadable and readable skbs, but
> > we do allow coalescing devmem and iozcrx skbs which could lead to some
> > bugs I'm guessing already. AFAICT as of this patch series we may allow
> > coalescing of skbs with netmems inside of them of different sizes, but
> > AFAICT so far, the iozcrx assume the size is constant across all the
> > netmems it gets, which I'm not sure is always true?
>
> It rejects niovs from other providers incl. from any other io_uring
> instances, so it only assume a uniform size for its own niovs.

Thanks. What is 'it' and where is the code that does the rejection?

> The
> backing memory is verified that it can be chunked.
>   > For all these reasons I had assumed that we'd need space in the
> > net_iov that tells us its size: net_iov->size.
>
> Nope, not in this case.
>
> > And then netmem_size(netmem) would replace all the PAGE_SIZE
> > assumptions in the net stack, and then we'd disallow coalescing of
> > skbs with different-sized netmems (else we need to handle them
> > correctly per the netmem_size).
> I'm not even sure what's the concern. What's the difference b/w
> tcp_recvmsg_dmabuf() getting one skb with differently sized frags
> or same frags in separate skbs? You still need to handle it
> somehow, even if by failing.
>

Right, I just wanted to understand what the design is. I guess the
design is allowing the netmems in the same skb to have different max
frag lens, yes?

I am guessing that it works, even in tcp_recvmsg_dmabuf. I guess the
frag len is actually in frag->len, so already it may vary from frag to
frag. Even if coalescing happens, some frags would have a frag->len =
PAGE_SIZE and some > PAGE_SIZE. Seems fine to me off the bat.

> Also, we should never coalesce different niovs together regardless
> of sizes. And for coalescing two chunks of the same niov, it should
> work just fine even without knowing the length.
>

Yeah, we should probably not coalesce 2 netmems together, although I
vaguely remember reading code in a net stack hepler that does that
somewhere already. Whatever.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 20:23     ` Mina Almasry
@ 2025-07-28 20:57       ` Pavel Begunkov
  0 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 20:57 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On 7/28/25 21:23, Mina Almasry wrote:
...>>>
>>> - I'm a bit confused that you're not making changes to the core net
>>> stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
>>> that there are potentially a ton of places in the net stack that
>>> assume PAGE_SIZE:
>>
>> The stack already supports large frags and it's not new. Page pools
>> has higher order allocations, see __page_pool_alloc_page_order. The
>> tx path can allocate large pages / coalesce user pages.
> 
> Right, large order allocations are not new, but I'm not sure they
> actually work reliably. AFAICT most drivers set pp_params.order = 0;
> I'm not sure how well tested multi-order pages are.
> 
> It may be reasonable to assume multi order pages just work and see
> what blows up, though.
> 
>> Any specific
>> place that concerns you? There are many places legitimately using
>> PAGE_SIZE: kmap'ing folios, shifting it by order to get the size,
>> linear allocations, etc.
>>
> 
>  From a 5-min look:
> 
> - skb_splice_from_iter, this line: size_t part = min_t(size_t,
> PAGE_SIZE - off, len);

It does it for pages that it got from
iov_iter_extract_pages() a few lines above, those are PAGE_SIZE'd.

> - skb_pp_cow_data, this line: max_head_size =
> SKB_WITH_OVERHEAD(PAGE_SIZE - headroom);

This one should be about the linear part, not frags

> - skb_seq_read, this line: pg_sz = min_t(unsigned int, pg_sz -
> st->frag_off, PAGE_SIZE - pg_off

That's kmap handling, it can iterate a frag multiple times
in PAGE_SIZE chunks for high mem archs.

> - zerocopy_fill_skb_from_iter, this line: int size = min_t(int,
> copied, PAGE_SIZE - start);

Pages from iov_iter_get_pages2(), same as with
skb_splice_from_iter()

> I think the `PAGE_SIZE -` logic in general assumes the memory is
> PAGE_SIZEd. Although in these cases it seems page specifics, i.e.
> net_iovs wouldn't be exposed to these particular call sites.
> 
> I spent a few weeks acking the net stack for all page-access to prune
> all of them to add unreadable netmem... are you somewhat confident
> there are no PAGE_SIZE assumptions in the net stack that affect
> net_iovs that require a deep look? Or is the approach here to merge

The difference is that this one is already supported and the
stack is large page aware, while unreadable frags was a new
concept.

> this and see what/if breaks?

No reason for it not to work. Even if breaks somewhere on that,
it should be a pre-existent problem, which needs to be fixed
either way.

>>> cd net
>>> ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
>>> 468
>>>
>>> Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
>>> concerning?
>>>
>>> - You're not adding a field in the net_iov that tells us how big the
>>> net_iov is. It seems to me you're configuring the driver to set the rx
>>> buffer size, then assuming all the pp allocations are of that size,
>>> then assuming in the rxzc code that all the net_iov are of that size.
>>> I think a few problems may happen?
>>>
>>> (a) what happens if the rx buffer size is re-configured? Does the
>>> io_uring rxrc instance get recreated as well?
>>
>> Any reason you even want it to work? You can't and frankly
>> shouldn't be allowed to, at least in case of io_uring. Unless it's
>> rejected somewhere earlier, in this case it'll fail on the order
>> check while trying to create a page pool with a zcrx provider.
>>
> 
> I think it's reasonable to disallow rx-buffer-size reconfiguration
> when the queue is memory-config bound. I can check to see what this
> code is doing.

Right, it doesn't make sense to reconfigure zcrx, and we can
only fail the operation one way or another.

>>> (b) what happens with skb coalescing? skb coalescing is already a bit
>>> of a mess. We don't allow coalescing unreadable and readable skbs, but
>>> we do allow coalescing devmem and iozcrx skbs which could lead to some
>>> bugs I'm guessing already. AFAICT as of this patch series we may allow
>>> coalescing of skbs with netmems inside of them of different sizes, but
>>> AFAICT so far, the iozcrx assume the size is constant across all the
>>> netmems it gets, which I'm not sure is always true?
>>
>> It rejects niovs from other providers incl. from any other io_uring
>> instances, so it only assume a uniform size for its own niovs.
> 
> Thanks. What is 'it' and where is the code that does the rejection?

zcrx does, you're familiar with this chunk:

io_uring/zcrx.c:

io_zcrx_recv_frag() {
	if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
	    io_pp_to_ifq(niov->pp) != ifq)
		return -EFAULT;
}

>> The
>> backing memory is verified that it can be chunked.
>>    > For all these reasons I had assumed that we'd need space in the
>>> net_iov that tells us its size: net_iov->size.
>>
>> Nope, not in this case.
>>
>>> And then netmem_size(netmem) would replace all the PAGE_SIZE
>>> assumptions in the net stack, and then we'd disallow coalescing of
>>> skbs with different-sized netmems (else we need to handle them
>>> correctly per the netmem_size).
>> I'm not even sure what's the concern. What's the difference b/w
>> tcp_recvmsg_dmabuf() getting one skb with differently sized frags
>> or same frags in separate skbs? You still need to handle it
>> somehow, even if by failing.
>>
> 
> Right, I just wanted to understand what the design is. I guess the
> design is allowing the netmems in the same skb to have different max
> frag lens, yes?

Yeah, and it's already allowed for higher order pages.

> I am guessing that it works, even in tcp_recvmsg_dmabuf. I guess the

And you won't see it unless adds support for that, that's why
I added this:

if (!net_is_devmem_iov(niov)) {
	err = -ENODEV;
	goto out;
}

> frag len is actually in frag->len, so already it may vary from frag to
> frag. Even if coalescing happens, some frags would have a frag->len =
> PAGE_SIZE and some > PAGE_SIZE. Seems fine to me off the bat.
> 
>> Also, we should never coalesce different niovs together regardless
>> of sizes. And for coalescing two chunks of the same niov, it should
>> work just fine even without knowing the length.
>>
> 
> Yeah, we should probably not coalesce 2 netmems together, although I
> vaguely remember reading code in a net stack hepler that does that
> somewhere already. Whatever.

Let know if that turns out to be true, because it should already
be broken. You shouldn't coalesce pages from different folios,
and to check that you need to get the head page / etc., which
niovs obviously don't have.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 20:21     ` Stanislav Fomichev
@ 2025-07-28 21:28       ` Pavel Begunkov
  2025-07-28 22:06         ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 21:28 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 7/28/25 21:21, Stanislav Fomichev wrote:
> On 07/28, Pavel Begunkov wrote:
>> On 7/28/25 18:13, Stanislav Fomichev wrote:
...>>> Supporting big buffers is the right direction, but I have the same
>>> feedback:
>>
>> Let me actually check the feedback for the queue config RFC...
>>
>> it would be nice to fit a cohesive story for the devmem as well.
>>
>> Only the last patch is zcrx specific, the rest is agnostic,
>> devmem can absolutely reuse that. I don't think there are any
>> issues wiring up devmem?
> 
> Right, but the patch number 2 exposes per-queue rx-buf-len which
> I'm not sure is the right fit for devmem, see below. If all you

I guess you're talking about uapi setting it, because as an
internal per queue parameter IMHO it does make sense for devmem.

> care is exposing it via io_uring, maybe don't expose it from netlink for

Sure, I can remove the set operation.

> now? Although I'm not sure I understand why you're also passing
> this per-queue value via io_uring. Can you not inherit it from the
> queue config?

It's not a great option. It complicates user space with netlink.
And there are convenience configuration features in the future
that requires io_uring to parse memory first. E.g. instead of
user specifying a particular size, it can say "choose the largest
length under 32K that the backing memory allows".

>>> We should also aim for another use-case where we allocate page pool
>>> chunks from the huge page(s),
>>
>> Separate huge page pool is a bit beyond the scope of this series.
>>
>> this should push the perf even more.
>>
>> And not sure about "even more" is from, you can already
>> register a huge page with zcrx, and this will allow to chunk
>> them to 32K or so for hardware. Is it in terms of applicability
>> or you have some perf optimisation ideas?
> 
> What I'm looking for is a generic system-wide solution where we can
> set up the host to use huge pages to back all (even non-zc) networking queues.
> Not necessary needed, but might be an option to try.

Probably like what Jakub was once suggesting with the initial memory
provider patch, got it.

>>> We need some way to express these things from the UAPI point of view.
>>
>> Can you elaborate?
>>
>>> Flipping the rx-buf-len value seems too fragile - there needs to be
>>> something to request 32K chunks only for devmem case, not for the (default)
>>> CPU memory. And the queues should go back to default 4K pages when the dmabuf
>>> is detached from the queue.
>>
>> That's what the per-queue config is solving. It's not default, zcrx
>> configures it only for the specific queue it allocated, and the value
>> is cleared on restart in netdev_rx_queue_restart(), if not even too
>> aggressively. Maybe I should just stash it into mp_params to make
>> sure it's not cleared if a provider is still attached on a spurious
>> restart.
> 
> If we assume that at some point niov can be backed up by chunks larger
> than PAGE_SIZE, the assumed workflow for devemem is:
> 1. change rx-buf-len to 32K
>    - this is needed only for devmem, but not for CPU RAM, but we'll have
>      to refill the queues from the main memory anyway

Urgh, that's another reason why I prefer to just pass it through
zcrx and not netlink. So maybe you can just pass the len to devmem
on creation, and internally it sets up its queues with it.

>    - there is also a question on whether we need to do anything about
>      MAX_PAGE_ORDER/PAGE_ALLOC_COSTLY_ORDER - do we just let the driver
>      allocations fail?
> 2. attach dmabuf to the queue to refill from dmabuf sgt, essentially wasting
>     all the effort on (1)
> 3. on detach, something needs to also not forget to reset the rx-buf-len
>    back to PAGE_SIZE

Sure

> I was hoping that maybe we can bind rx-buf-len to dmabuf for devmem,
> that should avoid all that useless refill from the main memory with
> large chunks. But I'm not sure it's the right way to go either.
-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
  2025-07-28 18:11   ` Mina Almasry
@ 2025-07-28 21:36   ` Mina Almasry
  2025-08-01 23:13     ` Jakub Kicinski
  1 sibling, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 21:36 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Document the semantics of the rx_buf_len ethtool ring param.
> Clarify its meaning in case of HDS, where driver may have
> two separate buffer pools.
>
> The various zero-copy TCP Rx schemes we have suffer from memory
> management overhead. Specifically applications aren't too impressed
> with the number of 4kB buffers they have to juggle. Zero-copy
> TCP makes most sense with larger memory transfers so using
> 16kB or 32kB buffers (with the help of HW-GRO) feels more
> natural.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/networking/ethtool-netlink.rst | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index b6e9af4d0f1b..eaa9c17a3cb1 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
>  driver. Driver may impose additional constraints and may not support all
>  attributes.
>
> -
>  ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
>  Completion queue events (CQE) are the events posted by NIC to indicate the
>  completion status of a packet when the packet is sent (like send success or
> @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
>  header / data split feature. If a received packet size is larger than this
>  threshold value, header and data will be split.
>
> +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> +uses to receive packets. If the device uses different memory polls for headers
> +and payload this setting may control the size of the header buffers but must
> +control the size of the payload buffers.
> +

To be honest I'm not a big fan of the ambiguity here? Could this
configure just the payload buffer sizes? And a new one to configure
the header buffer sizes eventually?

Also, IIUC in this patchset, actually the size applied will be the
order that is larger than the size configured, no? So a setting of 9KB
will actually result in 16KB, no? Should this be documented? Or do we
expect non power of 2 sizes to be rejected by the driver and this API
fail?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 04/22] net: clarify the meaning of netdev_config members
  2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
@ 2025-07-28 21:44   ` Mina Almasry
  2025-08-01 23:14     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 21:44 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> hds_thresh and hds_config are both inside struct netdev_config
> but have quite different semantics. hds_config is the user config
> with ternary semantics (on/off/unset). hds_thresh is a straight
> up value, populated by the driver at init and only modified by
> user space. We don't expect the drivers to have to pick a special
> hds_thresh value based on other configuration.
>
> The two approaches have different advantages and downsides.
> hds_thresh ("direct value") gives core easy access to current
> device settings, but there's no way to express whether the value
> comes from the user. It also requires the initialization by
> the driver.
>
> hds_config ("user config values") tells us what user wanted, but
> doesn't give us the current value in the core.
>
> Try to explain this a bit in the comments, so at we make a conscious
> choice for new values which semantics we expect.
>
> Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
> Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
> returns a hds_thresh value always as 0.") added the setting for the
> benefit of netdevsim which doesn't touch the value at all on get.
> Again, this is just to clarify the intention, shouldn't cause any
> functional change.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/net/netdev_queues.h | 19 +++++++++++++++++--
>  net/ethtool/common.c        |  3 ++-
>  2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index ba2eaf39089b..81df0794d84c 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -6,11 +6,26 @@
>
>  /**
>   * struct netdev_config - queue-related configuration for a netdev
> - * @hds_thresh:                HDS Threshold value.
> - * @hds_config:                HDS value from userspace.
>   */
>  struct netdev_config {
> +       /* Direct value
> +        *
> +        * Driver default is expected to be fixed, and set in this struct
> +        * at init. From that point on user may change the value. There is
> +        * no explicit way to "unset" / restore driver default.
> +        */

Does the user setting hds_thres imply turning hds_config to "on"? Or
is hds_thres only used when hds_config is actually on?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 05/22] net: add rx_buf_len to netdev config
  2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
@ 2025-07-28 21:50   ` Mina Almasry
  2025-08-01 23:18     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 21:50 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Add rx_buf_len to configuration maintained by the core.
> Use "three-state" semantics where 0 means "driver default".
>

What are three states in the semantics here?

- 0 = driver default.
- non-zero means value set by userspace

What is the 3rd state here?

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/net/netdev_queues.h | 4 ++++
>  net/ethtool/common.c        | 1 +
>  net/ethtool/rings.c         | 2 ++
>  3 files changed, 7 insertions(+)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index 81df0794d84c..eb3a5ac823e6 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -24,6 +24,10 @@ struct netdev_config {
>          * If "unset" driver is free to decide, and may change its choice
>          * as other parameters change.
>          */
> +       /** @rx_buf_len: Size of buffers on the Rx ring
> +        *               (ETHTOOL_A_RINGS_RX_BUF_LEN).
> +        */
> +       u32     rx_buf_len;
>         /** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
>          */
>         u8      hds_config;
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index a87298f659f5..8fdffc77e981 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -832,6 +832,7 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
>
>         /* Driver gives us current state, we want to return current config */
>         kparam->tcp_data_split = dev->cfg->hds_config;
> +       kparam->rx_buf_len = dev->cfg->rx_buf_len;

I'm confused that struct netdev_config is defined in netdev_queues.h,
and is documented to be a queue-related configuration, but doesn't
seem to be actually per queue? This line is grabbing the current
config for this queue from dev->cfg which looks like a shared value.

I don't think rx_buf_len should be a shared value between all the
queues. I strongly think it should a per-queue value. The
devmem/io_uring queues will probably want large rx_buf_len, but normal
queues will want 0 buf len, me thinks.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 21:28       ` Pavel Begunkov
@ 2025-07-28 22:06         ` Stanislav Fomichev
  2025-07-28 22:44           ` Pavel Begunkov
  2025-07-28 23:22           ` Mina Almasry
  0 siblings, 2 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2025-07-28 22:06 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 07/28, Pavel Begunkov wrote:
> On 7/28/25 21:21, Stanislav Fomichev wrote:
> > On 07/28, Pavel Begunkov wrote:
> > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> ...>>> Supporting big buffers is the right direction, but I have the same
> > > > feedback:
> > > 
> > > Let me actually check the feedback for the queue config RFC...
> > > 
> > > it would be nice to fit a cohesive story for the devmem as well.
> > > 
> > > Only the last patch is zcrx specific, the rest is agnostic,
> > > devmem can absolutely reuse that. I don't think there are any
> > > issues wiring up devmem?
> > 
> > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > I'm not sure is the right fit for devmem, see below. If all you
> 
> I guess you're talking about uapi setting it, because as an
> internal per queue parameter IMHO it does make sense for devmem.
> 
> > care is exposing it via io_uring, maybe don't expose it from netlink for
> 
> Sure, I can remove the set operation.
> 
> > now? Although I'm not sure I understand why you're also passing
> > this per-queue value via io_uring. Can you not inherit it from the
> > queue config?
> 
> It's not a great option. It complicates user space with netlink.
> And there are convenience configuration features in the future
> that requires io_uring to parse memory first. E.g. instead of
> user specifying a particular size, it can say "choose the largest
> length under 32K that the backing memory allows".

Don't you already need a bunch of netlink to setup rss and flow
steering? And if we end up adding queue api, you'll have to call that
one over netlink also.

> > > > We should also aim for another use-case where we allocate page pool
> > > > chunks from the huge page(s),
> > > 
> > > Separate huge page pool is a bit beyond the scope of this series.
> > > 
> > > this should push the perf even more.
> > > 
> > > And not sure about "even more" is from, you can already
> > > register a huge page with zcrx, and this will allow to chunk
> > > them to 32K or so for hardware. Is it in terms of applicability
> > > or you have some perf optimisation ideas?
> > 
> > What I'm looking for is a generic system-wide solution where we can
> > set up the host to use huge pages to back all (even non-zc) networking queues.
> > Not necessary needed, but might be an option to try.
> 
> Probably like what Jakub was once suggesting with the initial memory
> provider patch, got it.
> 
> > > > We need some way to express these things from the UAPI point of view.
> > > 
> > > Can you elaborate?
> > > 
> > > > Flipping the rx-buf-len value seems too fragile - there needs to be
> > > > something to request 32K chunks only for devmem case, not for the (default)
> > > > CPU memory. And the queues should go back to default 4K pages when the dmabuf
> > > > is detached from the queue.
> > > 
> > > That's what the per-queue config is solving. It's not default, zcrx
> > > configures it only for the specific queue it allocated, and the value
> > > is cleared on restart in netdev_rx_queue_restart(), if not even too
> > > aggressively. Maybe I should just stash it into mp_params to make
> > > sure it's not cleared if a provider is still attached on a spurious
> > > restart.
> > 
> > If we assume that at some point niov can be backed up by chunks larger
> > than PAGE_SIZE, the assumed workflow for devemem is:
> > 1. change rx-buf-len to 32K
> >    - this is needed only for devmem, but not for CPU RAM, but we'll have
> >      to refill the queues from the main memory anyway
> 
> Urgh, that's another reason why I prefer to just pass it through
> zcrx and not netlink. So maybe you can just pass the len to devmem
> on creation, and internally it sets up its queues with it.

But you still need to solve MAX_PAGE_ORDER/PAGE_ALLOC_COSTLY_ORDER I
think? We don't want the drivers to do PAGE_ALLOC_COSTLY_ORDER costly
allocation presumably?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 13/22] net: add queue config validation callback
  2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
@ 2025-07-28 22:26   ` Mina Almasry
  0 siblings, 0 replies; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 22:26 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> I imagine (tm) that as the number of per-queue configuration
> options grows some of them may conflict for certain drivers.
> While the drivers can obviously do all the validation locally
> doing so is fairly inconvenient as the config is fed to drivers
> piecemeal via different ops (for different params and NIC-wide
> vs per-queue).
>
> Add a centralized callback for validating the queue config
> in queue ops. The callback gets invoked before each queue restart
> and when ring params are modified.
>
> For NIC-wide changes the callback gets invoked for each active
> (or active to-be) queue, and additionally with a negative queue
> index for NIC-wide defaults. The NIC-wide check is needed in
> case all queues have an override active when NIC-wide setting
> is changed to an unsupported one. Alternatively we could check
> the settings when new queues are enabled (in the channel API),
> but accepting invalid config is a bad idea. Users may expect
> that resetting a queue override will always work.
>
> The "trick" of passing a negative index is a bit ugly, we may
> want to revisit if it causes confusion and bugs. Existing drivers
> don't care about the index so it "just works".
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/net/netdev_queues.h | 12 ++++++++++++
>  net/core/dev.h              |  2 ++
>  net/core/netdev_config.c    | 20 ++++++++++++++++++++
>  net/core/netdev_rx_queue.c  |  6 ++++++
>  net/ethtool/rings.c         |  5 +++++
>  5 files changed, 45 insertions(+)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index e3e7ecf91bac..f75313fc78ba 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -146,6 +146,14 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   *                     defaults. Queue config structs are passed to this
>   *                     helper before the user-requested settings are applied.
>   *
> + * @ndo_queue_cfg_validate: (Optional) Check if queue config is supported.
> + *                     Called when configuration affecting a queue may be
> + *                     changing, either due to NIC-wide config, or config
> + *                     scoped to the queue at a specified index.
> + *                     When NIC-wide config is changed the callback will
> + *                     be invoked for all queues, and in addition to that
> + *                     with a negative queue index for the base settings.
> + *
>   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
>   *                      The new memory is written at the specified address.
>   *
> @@ -166,6 +174,10 @@ struct netdev_queue_mgmt_ops {
>         void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
>                                           int idx,
>                                           struct netdev_queue_config *qcfg);
> +       int     (*ndo_queue_cfg_validate)(struct net_device *dev,
> +                                         int idx,
> +                                         struct netdev_queue_config *qcfg,
> +                                         struct netlink_ext_ack *extack);
>         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
>                                        struct netdev_queue_config *qcfg,
>                                        void *per_queue_mem,
> diff --git a/net/core/dev.h b/net/core/dev.h
> index 6d7f5e920018..e0d433fb6325 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -99,6 +99,8 @@ void netdev_free_config(struct net_device *dev);
>  int netdev_reconfig_start(struct net_device *dev);
>  void __netdev_queue_config(struct net_device *dev, int rxq,
>                            struct netdev_queue_config *qcfg, bool pending);
> +int netdev_queue_config_revalidate(struct net_device *dev,
> +                                  struct netlink_ext_ack *extack);
>
>  /* netdev management, shared between various uAPI entry points */
>  struct netdev_name_node {
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> index bad2d53522f0..fc700b77e4eb 100644
> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
>         __netdev_queue_config(dev, rxq, qcfg, true);
>  }
>  EXPORT_SYMBOL(netdev_queue_config);
> +
> +int netdev_queue_config_revalidate(struct net_device *dev,
> +                                  struct netlink_ext_ack *extack)
> +{
> +       const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
> +       struct netdev_queue_config qcfg;
> +       int i, err;
> +
> +       if (!qops || !qops->ndo_queue_cfg_validate)
> +               return 0;
> +

Shouldn't this be like return -EOPNOSUPP; or something? Otherwise how
do you protect drivers (GVE) that support queue API but don't support
a configuring a particular netdev_queue_config from core assuming that
the configuration took place?

> +       for (i = -1; i < (int)dev->real_num_rx_queues; i++) {
> +               netdev_queue_config(dev, i, &qcfg);
> +               err = qops->ndo_queue_cfg_validate(dev, i, &qcfg, extack);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index b0523eb44e10..7c691eb1a48b 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -37,6 +37,12 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
>
>         netdev_queue_config(dev, rxq_idx, &qcfg);
>
> +       if (qops->ndo_queue_cfg_validate) {
> +               err = qops->ndo_queue_cfg_validate(dev, rxq_idx, &qcfg, extack);
> +               if (err)
> +                       goto err_free_old_mem;
> +       }
> +
>         err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
>         if (err)
>                 goto err_free_old_mem;
> diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
> index 6a74e7e4064e..7884d10c090f 100644
> --- a/net/ethtool/rings.c
> +++ b/net/ethtool/rings.c
> @@ -4,6 +4,7 @@
>
>  #include "netlink.h"
>  #include "common.h"
> +#include "../core/dev.h"
>
>  struct rings_req_info {
>         struct ethnl_req_info           base;
> @@ -307,6 +308,10 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
>         dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
>         dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
>
> +       ret = netdev_queue_config_revalidate(dev, info->extack);
> +       if (ret)
> +               return ret;
> +
>         ret = dev->ethtool_ops->set_ringparam(dev, &ringparam,
>                                               &kernel_ringparam, info->extack);
>         return ret < 0 ? ret : 1;
> --
> 2.49.0
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 15/22] eth: bnxt: store the rx buf size per queue
  2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
@ 2025-07-28 22:33   ` Mina Almasry
  2025-08-01 23:20     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 22:33 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> In normal operation only a subset of queues is configured for
> zero-copy. Since zero-copy is the main use for larger buffer
> sizes we need to configure the sizes per queue.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

I wonder if this is necessary for some reason, or is it better to
expect the driver to refer to the netdev->qcfgs directly?

By my count the configs can now live in 4 places: the core netdev
config, the core per-queue config, the driver netdev config, and the
driver per-queue config.

I honestly I'm not sure about duplicating settings between the netdev
configs and the per-queue configs in the first place (seems like
configs should be either driver wide or per-queue to me, and not
both), and I'm less sure about again duplicating the settings between
core structs and in-driver structs. Seems like the same information
duplicated in many places and a nightmare to keep it all in sync.
-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 22:06         ` Stanislav Fomichev
@ 2025-07-28 22:44           ` Pavel Begunkov
  2025-07-29 16:33             ` Stanislav Fomichev
  2025-07-28 23:22           ` Mina Almasry
  1 sibling, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-28 22:44 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 7/28/25 23:06, Stanislav Fomichev wrote:
> On 07/28, Pavel Begunkov wrote:
>> On 7/28/25 21:21, Stanislav Fomichev wrote:
>>> On 07/28, Pavel Begunkov wrote:
>>>> On 7/28/25 18:13, Stanislav Fomichev wrote:
>> ...>>> Supporting big buffers is the right direction, but I have the same
>>>>> feedback:
>>>>
>>>> Let me actually check the feedback for the queue config RFC...
>>>>
>>>> it would be nice to fit a cohesive story for the devmem as well.
>>>>
>>>> Only the last patch is zcrx specific, the rest is agnostic,
>>>> devmem can absolutely reuse that. I don't think there are any
>>>> issues wiring up devmem?
>>>
>>> Right, but the patch number 2 exposes per-queue rx-buf-len which
>>> I'm not sure is the right fit for devmem, see below. If all you
>>
>> I guess you're talking about uapi setting it, because as an
>> internal per queue parameter IMHO it does make sense for devmem.
>>
>>> care is exposing it via io_uring, maybe don't expose it from netlink for
>>
>> Sure, I can remove the set operation.
>>
>>> now? Although I'm not sure I understand why you're also passing
>>> this per-queue value via io_uring. Can you not inherit it from the
>>> queue config?
>>
>> It's not a great option. It complicates user space with netlink.
>> And there are convenience configuration features in the future
>> that requires io_uring to parse memory first. E.g. instead of
>> user specifying a particular size, it can say "choose the largest
>> length under 32K that the backing memory allows".
> 
> Don't you already need a bunch of netlink to setup rss and flow

Could be needed, but there are cases where configuration and
virtual queue selection is done outside the program. I'll need
to ask which option we currently use.

> steering? And if we end up adding queue api, you'll have to call that
> one over netlink also.

There is already a queue api, even though it's cropped IIUC.
What kind of extra setup you have in mind?

>>>
>>> If we assume that at some point niov can be backed up by chunks larger
>>> than PAGE_SIZE, the assumed workflow for devemem is:
>>> 1. change rx-buf-len to 32K
>>>     - this is needed only for devmem, but not for CPU RAM, but we'll have
>>>       to refill the queues from the main memory anyway
>>
>> Urgh, that's another reason why I prefer to just pass it through
>> zcrx and not netlink. So maybe you can just pass the len to devmem
>> on creation, and internally it sets up its queues with it.
> 
> But you still need to solve MAX_PAGE_ORDER/PAGE_ALLOC_COSTLY_ORDER I
> think? We don't want the drivers to do PAGE_ALLOC_COSTLY_ORDER costly
> allocation presumably?

#define PAGE_ALLOC_COSTLY_ORDER 3

It's "costly" for the page allocator and not a custom specially
cooked memory providers. Nobody should care as long as the length
applies to the given provider only. MAX_PAGE_ORDER also seems to
be a page allocator thing.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue
  2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
@ 2025-07-28 23:10   ` Mina Almasry
  2025-08-01 23:37     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 23:10 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Zero-copy APIs increase the cost of buffer management. They also extend
> this cost to user space applications which may be used to dealing with
> much larger buffers. Allow setting rx-buf-len per queue, devices with
> HW-GRO support can commonly fill buffers up to 32k (or rather 64k - 1
> but that's not a power of 2..)
>
> The implementation adds a new option to the netdev netlink, rather
> than ethtool. The NIC-wide setting lives in ethtool ringparams so
> one could argue that we should be extending the ethtool API.
> OTOH netdev API is where we already have queue-get, and it's how
> zero-copy applications bind memory providers.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/netlink/specs/netdev.yaml | 15 ++++
>  include/net/netdev_queues.h             |  5 ++
>  include/net/netlink.h                   | 19 +++++
>  include/uapi/linux/netdev.h             |  2 +
>  net/core/netdev-genl-gen.c              | 15 ++++
>  net/core/netdev-genl-gen.h              |  1 +
>  net/core/netdev-genl.c                  | 92 +++++++++++++++++++++++++
>  net/core/netdev_config.c                | 16 +++++
>  tools/include/uapi/linux/netdev.h       |  2 +
>  9 files changed, 167 insertions(+)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index c0ef6d0d7786..5dd1eb5909cd 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -324,6 +324,10 @@ attribute-sets:
>          doc: XSK information for this queue, if any.
>          type: nest
>          nested-attributes: xsk-info
> +      -
> +        name: rx-buf-len
> +        doc: Per-queue configuration of ETHTOOL_A_RINGS_RX_BUF_LEN.
> +        type: u32
>    -
>      name: qstats
>      doc: |
> @@ -755,6 +759,17 @@ operations:
>          reply:
>            attributes:
>              - id
> +    -
> +      name: queue-set
> +      doc: Set per-queue configurable options.
> +      attribute-set: queue
> +      do:
> +        request:
> +          attributes:
> +            - ifindex
> +            - type
> +            - id
> +            - rx-buf-len
>
>  kernel-family:
>    headers: [ "net/netdev_netlink.h"]
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index f75313fc78ba..cfd2d59861e1 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -38,6 +38,7 @@ struct netdev_config {
>
>  /* Same semantics as fields in struct netdev_config */
>  struct netdev_queue_config {
> +       u32     rx_buf_len;
>  };
>
>  /* See the netdev.yaml spec for definition of each statistic */
> @@ -140,6 +141,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>  /**
>   * struct netdev_queue_mgmt_ops - netdev ops for queue management
>   *
> + * @supported_ring_params: ring params supported per queue (ETHTOOL_RING_USE_*).
> + *

I don't see this used anywhere.

But more generally, I'm a bit concerned about protecting drivers that
don't support configuring one particular queue config. I think likely
supported_ring_params needs to be moved earlier to the patch which
adds per queue netdev_configs to the queue API, and probably as part
of that patch core needs to make sure it's never asking a driver that
doesn't support changing a netdev_queue_config to do so?

Some thought may be given to moving the entire configuration story
outside of queue_mem_alloc/free queue_start/stop altogether to new
ndos where core can easily check if the ndo is supported otherwise
per-queue config is not supported. Otherwise core needs to be careful
never to attempt a config that is not supported?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 22:06         ` Stanislav Fomichev
  2025-07-28 22:44           ` Pavel Begunkov
@ 2025-07-28 23:22           ` Mina Almasry
  2025-07-29 16:41             ` Stanislav Fomichev
  1 sibling, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-28 23:22 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Pavel Begunkov, Jakub Kicinski, netdev, io-uring, Eric Dumazet,
	Willem de Bruijn, Paolo Abeni, andrew+netdev, horms, davem, sdf,
	dw, michael.chan, dtatulea, ap420073

On Mon, Jul 28, 2025 at 3:06 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 07/28, Pavel Begunkov wrote:
> > On 7/28/25 21:21, Stanislav Fomichev wrote:
> > > On 07/28, Pavel Begunkov wrote:
> > > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> > ...>>> Supporting big buffers is the right direction, but I have the same
> > > > > feedback:
> > > >
> > > > Let me actually check the feedback for the queue config RFC...
> > > >
> > > > it would be nice to fit a cohesive story for the devmem as well.
> > > >
> > > > Only the last patch is zcrx specific, the rest is agnostic,
> > > > devmem can absolutely reuse that. I don't think there are any
> > > > issues wiring up devmem?
> > >
> > > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > > I'm not sure is the right fit for devmem, see below. If all you
> >
> > I guess you're talking about uapi setting it, because as an
> > internal per queue parameter IMHO it does make sense for devmem.
> >
> > > care is exposing it via io_uring, maybe don't expose it from netlink for
> >
> > Sure, I can remove the set operation.
> >
> > > now? Although I'm not sure I understand why you're also passing
> > > this per-queue value via io_uring. Can you not inherit it from the
> > > queue config?
> >
> > It's not a great option. It complicates user space with netlink.
> > And there are convenience configuration features in the future
> > that requires io_uring to parse memory first. E.g. instead of
> > user specifying a particular size, it can say "choose the largest
> > length under 32K that the backing memory allows".
>
> Don't you already need a bunch of netlink to setup rss and flow
> steering? And if we end up adding queue api, you'll have to call that
> one over netlink also.
>

I'm thinking one thing that could work is extending bind-rx with an
optional rx-buf-len arg, which in the code translates into devmem
using the new net_mp_open_rxq variant which not only restarts the
queue but also sets the size. From there the implementation should be
fairly straightforward in devmem. devmem currently rejects any pp for
which pp.order != 0. It would need to start accepting that and
forwarding the order to the gen_pool doing the allocations, etc.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 02/22] net: ethtool: report max value for rx-buf-len
  2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
@ 2025-07-29  5:00   ` Subbaraya Sundeep
  0 siblings, 0 replies; 66+ messages in thread
From: Subbaraya Sundeep @ 2025-07-29  5:00 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

LGTM.

Thanks,
Sundeep

On 2025-07-28 at 11:04:06, Pavel Begunkov (asml.silence@gmail.com) wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> 
> Unlike most of our APIs the rx-buf-len param does not have an associated
> max value. In theory user could set this value pretty high, but in
> practice most NICs have limits due to the width of the length fields
> in the descriptors.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/netlink/specs/ethtool.yaml                  | 4 ++++
>  Documentation/networking/ethtool-netlink.rst              | 1 +
>  drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 ++-
>  include/linux/ethtool.h                                   | 2 ++
>  include/uapi/linux/ethtool_netlink_generated.h            | 1 +
>  net/ethtool/rings.c                                       | 5 +++++
>  6 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
> index 72a076b0e1b5..cb96b4e7093f 100644
> --- a/Documentation/netlink/specs/ethtool.yaml
> +++ b/Documentation/netlink/specs/ethtool.yaml
> @@ -361,6 +361,9 @@ attribute-sets:
>        -
>          name: hds-thresh-max
>          type: u32
> +      -
> +        name: rx-buf-len-max
> +        type: u32
>  
>    -
>      name: mm-stat
> @@ -1811,6 +1814,7 @@ operations:
>              - rx-jumbo
>              - tx
>              - rx-buf-len
> +            - rx-buf-len-max
>              - tcp-data-split
>              - cqe-size
>              - tx-push
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index eaa9c17a3cb1..b7a99dfdffa9 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -893,6 +893,7 @@ Kernel response contents:
>    ``ETHTOOL_A_RINGS_RX_JUMBO``              u32     size of RX jumbo ring
>    ``ETHTOOL_A_RINGS_TX``                    u32     size of TX ring
>    ``ETHTOOL_A_RINGS_RX_BUF_LEN``            u32     size of buffers on the ring
> +  ``ETHTOOL_A_RINGS_RX_BUF_LEN_MAX``        u32     max size of rx buffers
>    ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``        u8      TCP header / data split
>    ``ETHTOOL_A_RINGS_CQE_SIZE``              u32     Size of TX/RX CQE
>    ``ETHTOOL_A_RINGS_TX_PUSH``               u8      flag of TX Push mode
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
> index 45b8c9230184..7bdef64926c8 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
> @@ -376,6 +376,7 @@ static void otx2_get_ringparam(struct net_device *netdev,
>  	ring->tx_max_pending = Q_COUNT(Q_SIZE_MAX);
>  	ring->tx_pending = qs->sqe_cnt ? qs->sqe_cnt : Q_COUNT(Q_SIZE_4K);
>  	kernel_ring->rx_buf_len = pfvf->hw.rbuf_len;
> +	kernel_ring->rx_buf_len_max = 32768;
>  	kernel_ring->cqe_size = pfvf->hw.xqe_size;
>  }
>  
> @@ -398,7 +399,7 @@ static int otx2_set_ringparam(struct net_device *netdev,
>  	/* Hardware supports max size of 32k for a receive buffer
>  	 * and 1536 is typical ethernet frame size.
>  	 */
> -	if (rx_buf_len && (rx_buf_len < 1536 || rx_buf_len > 32768)) {
> +	if (rx_buf_len && (rx_buf_len < 1536)) {
>  		netdev_err(netdev,
>  			   "Receive buffer range is 1536 - 32768");
>  		return -EINVAL;
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index 5e0dd333ad1f..dd9f253a56ae 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -77,6 +77,7 @@ enum {
>  /**
>   * struct kernel_ethtool_ringparam - RX/TX ring configuration
>   * @rx_buf_len: Current length of buffers on the rx ring.
> + * @rx_buf_len_max: Max length of buffers on the rx ring.
>   * @tcp_data_split: Scatter packet headers and data to separate buffers
>   * @tx_push: The flag of tx push mode
>   * @rx_push: The flag of rx push mode
> @@ -89,6 +90,7 @@ enum {
>   */
>  struct kernel_ethtool_ringparam {
>  	u32	rx_buf_len;
> +	u32	rx_buf_len_max;
>  	u8	tcp_data_split;
>  	u8	tx_push;
>  	u8	rx_push;
> diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
> index aa8ab5227c1e..1a76e6789e33 100644
> --- a/include/uapi/linux/ethtool_netlink_generated.h
> +++ b/include/uapi/linux/ethtool_netlink_generated.h
> @@ -164,6 +164,7 @@ enum {
>  	ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
>  	ETHTOOL_A_RINGS_HDS_THRESH,
>  	ETHTOOL_A_RINGS_HDS_THRESH_MAX,
> +	ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
>  
>  	__ETHTOOL_A_RINGS_CNT,
>  	ETHTOOL_A_RINGS_MAX = (__ETHTOOL_A_RINGS_CNT - 1)
> diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
> index aeedd5ec6b8c..5e872ceab5dd 100644
> --- a/net/ethtool/rings.c
> +++ b/net/ethtool/rings.c
> @@ -105,6 +105,9 @@ static int rings_fill_reply(struct sk_buff *skb,
>  			  ringparam->tx_pending)))  ||
>  	    (kr->rx_buf_len &&
>  	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN, kr->rx_buf_len))) ||
> +	    (kr->rx_buf_len_max &&
> +	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
> +			  kr->rx_buf_len_max))) ||
>  	    (kr->tcp_data_split &&
>  	     (nla_put_u8(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT,
>  			 kr->tcp_data_split))) ||
> @@ -281,6 +284,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
>  		err_attr = tb[ETHTOOL_A_RINGS_TX];
>  	else if (kernel_ringparam.hds_thresh > kernel_ringparam.hds_thresh_max)
>  		err_attr = tb[ETHTOOL_A_RINGS_HDS_THRESH];
> +	else if (kernel_ringparam.rx_buf_len > kernel_ringparam.rx_buf_len_max)
> +		err_attr = tb[ETHTOOL_A_RINGS_RX_BUF_LEN];
>  	else
>  		err_attr = NULL;
>  	if (err_attr) {
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 03/22] net: use zero value to restore rx_buf_len to default
  2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
@ 2025-07-29  5:03   ` Subbaraya Sundeep
  0 siblings, 0 replies; 66+ messages in thread
From: Subbaraya Sundeep @ 2025-07-29  5:03 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

LGTM.

Thanks,
Sundeep

On 2025-07-28 at 11:04:07, Pavel Begunkov (asml.silence@gmail.com) wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> 
> Distinguish between rx_buf_len being driver default vs user config.
> Use 0 as a special value meaning "unset" or "restore driver default".
> This will be necessary later on to configure it per-queue, but
> the ability to restore defaults may be useful in itself.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/networking/ethtool-netlink.rst              | 2 +-
>  drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 +++
>  include/linux/ethtool.h                                   | 1 +
>  net/ethtool/rings.c                                       | 2 +-
>  4 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index b7a99dfdffa9..723f8e1a33a7 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -974,7 +974,7 @@ threshold value, header and data will be split.
>  ``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
>  uses to receive packets. If the device uses different memory polls for headers
>  and payload this setting may control the size of the header buffers but must
> -control the size of the payload buffers.
> +control the size of the payload buffers. Setting to 0 restores driver default.
>  
>  CHANNELS_GET
>  ============
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
> index 7bdef64926c8..1a74a7b81ac1 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
> @@ -396,6 +396,9 @@ static int otx2_set_ringparam(struct net_device *netdev,
>  	if (ring->rx_mini_pending || ring->rx_jumbo_pending)
>  		return -EINVAL;
>  
> +	if (!rx_buf_len)
> +		rx_buf_len = OTX2_DEFAULT_RBUF_LEN;
> +
>  	/* Hardware supports max size of 32k for a receive buffer
>  	 * and 1536 is typical ethernet frame size.
>  	 */
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index dd9f253a56ae..bbc5c485bfbf 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -77,6 +77,7 @@ enum {
>  /**
>   * struct kernel_ethtool_ringparam - RX/TX ring configuration
>   * @rx_buf_len: Current length of buffers on the rx ring.
> + *		Setting to 0 means reset to driver default.
>   * @rx_buf_len_max: Max length of buffers on the rx ring.
>   * @tcp_data_split: Scatter packet headers and data to separate buffers
>   * @tx_push: The flag of tx push mode
> diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
> index 5e872ceab5dd..628546a1827b 100644
> --- a/net/ethtool/rings.c
> +++ b/net/ethtool/rings.c
> @@ -139,7 +139,7 @@ const struct nla_policy ethnl_rings_set_policy[] = {
>  	[ETHTOOL_A_RINGS_RX_MINI]		= { .type = NLA_U32 },
>  	[ETHTOOL_A_RINGS_RX_JUMBO]		= { .type = NLA_U32 },
>  	[ETHTOOL_A_RINGS_TX]			= { .type = NLA_U32 },
> -	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = NLA_POLICY_MIN(NLA_U32, 1),
> +	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = { .type = NLA_U32 },
>  	[ETHTOOL_A_RINGS_TCP_DATA_SPLIT]	=
>  		NLA_POLICY_MAX(NLA_U8, ETHTOOL_TCP_DATA_SPLIT_ENABLED),
>  	[ETHTOOL_A_RINGS_CQE_SIZE]		= NLA_POLICY_MIN(NLA_U32, 1),
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 22:44           ` Pavel Begunkov
@ 2025-07-29 16:33             ` Stanislav Fomichev
  2025-07-30 14:16               ` Pavel Begunkov
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-07-29 16:33 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 07/28, Pavel Begunkov wrote:
> On 7/28/25 23:06, Stanislav Fomichev wrote:
> > On 07/28, Pavel Begunkov wrote:
> > > On 7/28/25 21:21, Stanislav Fomichev wrote:
> > > > On 07/28, Pavel Begunkov wrote:
> > > > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> > > ...>>> Supporting big buffers is the right direction, but I have the same
> > > > > > feedback:
> > > > > 
> > > > > Let me actually check the feedback for the queue config RFC...
> > > > > 
> > > > > it would be nice to fit a cohesive story for the devmem as well.
> > > > > 
> > > > > Only the last patch is zcrx specific, the rest is agnostic,
> > > > > devmem can absolutely reuse that. I don't think there are any
> > > > > issues wiring up devmem?
> > > > 
> > > > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > > > I'm not sure is the right fit for devmem, see below. If all you
> > > 
> > > I guess you're talking about uapi setting it, because as an
> > > internal per queue parameter IMHO it does make sense for devmem.
> > > 
> > > > care is exposing it via io_uring, maybe don't expose it from netlink for
> > > 
> > > Sure, I can remove the set operation.
> > > 
> > > > now? Although I'm not sure I understand why you're also passing
> > > > this per-queue value via io_uring. Can you not inherit it from the
> > > > queue config?
> > > 
> > > It's not a great option. It complicates user space with netlink.
> > > And there are convenience configuration features in the future
> > > that requires io_uring to parse memory first. E.g. instead of
> > > user specifying a particular size, it can say "choose the largest
> > > length under 32K that the backing memory allows".
> > 
> > Don't you already need a bunch of netlink to setup rss and flow
> 
> Could be needed, but there are cases where configuration and
> virtual queue selection is done outside the program. I'll need
> to ask which option we currently use.

If the setup is done outside, you can also setup rx-buf-len outside, no?

> > steering? And if we end up adding queue api, you'll have to call that
> > one over netlink also.
> 
> There is already a queue api, even though it's cropped IIUC.
> What kind of extra setup you have in mind?

I'm talking about allocating the queues. Currently the zc/devmem setup is
a bit complicated, we need to partition the queues and rss+flow
steer into a subset of zerocopy ones. In the future we might add some apis
to request a new dedicated queue for the specific flow(s). That should
hopefully simplify the design (and make the cleanup of the queues more
robust if the application dies).

> > > > If we assume that at some point niov can be backed up by chunks larger
> > > > than PAGE_SIZE, the assumed workflow for devemem is:
> > > > 1. change rx-buf-len to 32K
> > > >     - this is needed only for devmem, but not for CPU RAM, but we'll have
> > > >       to refill the queues from the main memory anyway
> > > 
> > > Urgh, that's another reason why I prefer to just pass it through
> > > zcrx and not netlink. So maybe you can just pass the len to devmem
> > > on creation, and internally it sets up its queues with it.
> > 
> > But you still need to solve MAX_PAGE_ORDER/PAGE_ALLOC_COSTLY_ORDER I
> > think? We don't want the drivers to do PAGE_ALLOC_COSTLY_ORDER costly
> > allocation presumably?
> 
> #define PAGE_ALLOC_COSTLY_ORDER 3
> 
> It's "costly" for the page allocator and not a custom specially
> cooked memory providers. Nobody should care as long as the length
> applies to the given provider only. MAX_PAGE_ORDER also seems to
> be a page allocator thing.

By custom memory providers you mean page pool? Thinking about it more,
maybe it's fine as is as long as we have ndo_queue_cfg_validate that
enforces sensible ranges..

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-28 23:22           ` Mina Almasry
@ 2025-07-29 16:41             ` Stanislav Fomichev
  2025-07-29 17:01               ` Mina Almasry
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-07-29 16:41 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, Jakub Kicinski, netdev, io-uring, Eric Dumazet,
	Willem de Bruijn, Paolo Abeni, andrew+netdev, horms, davem, sdf,
	dw, michael.chan, dtatulea, ap420073

On 07/28, Mina Almasry wrote:
> On Mon, Jul 28, 2025 at 3:06 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 07/28, Pavel Begunkov wrote:
> > > On 7/28/25 21:21, Stanislav Fomichev wrote:
> > > > On 07/28, Pavel Begunkov wrote:
> > > > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> > > ...>>> Supporting big buffers is the right direction, but I have the same
> > > > > > feedback:
> > > > >
> > > > > Let me actually check the feedback for the queue config RFC...
> > > > >
> > > > > it would be nice to fit a cohesive story for the devmem as well.
> > > > >
> > > > > Only the last patch is zcrx specific, the rest is agnostic,
> > > > > devmem can absolutely reuse that. I don't think there are any
> > > > > issues wiring up devmem?
> > > >
> > > > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > > > I'm not sure is the right fit for devmem, see below. If all you
> > >
> > > I guess you're talking about uapi setting it, because as an
> > > internal per queue parameter IMHO it does make sense for devmem.
> > >
> > > > care is exposing it via io_uring, maybe don't expose it from netlink for
> > >
> > > Sure, I can remove the set operation.
> > >
> > > > now? Although I'm not sure I understand why you're also passing
> > > > this per-queue value via io_uring. Can you not inherit it from the
> > > > queue config?
> > >
> > > It's not a great option. It complicates user space with netlink.
> > > And there are convenience configuration features in the future
> > > that requires io_uring to parse memory first. E.g. instead of
> > > user specifying a particular size, it can say "choose the largest
> > > length under 32K that the backing memory allows".
> >
> > Don't you already need a bunch of netlink to setup rss and flow
> > steering? And if we end up adding queue api, you'll have to call that
> > one over netlink also.
> >
> 
> I'm thinking one thing that could work is extending bind-rx with an
> optional rx-buf-len arg, which in the code translates into devmem
> using the new net_mp_open_rxq variant which not only restarts the
> queue but also sets the size. From there the implementation should be
> fairly straightforward in devmem. devmem currently rejects any pp for
> which pp.order != 0. It would need to start accepting that and
> forwarding the order to the gen_pool doing the allocations, etc.

Right, that's the logical alternative, to put that rx-buf-len on the
binding to control the size of the niovs. But then what do we do with
the queue's rx-buf-len? bnxt patch in the series does
page_pool_dev_alloc_frag(..., bp->rx_page_size). bp->rx_page_size comes
from netlink. Does it need to be inherited from the pp in the devmem
case somehow?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-29 16:41             ` Stanislav Fomichev
@ 2025-07-29 17:01               ` Mina Almasry
  0 siblings, 0 replies; 66+ messages in thread
From: Mina Almasry @ 2025-07-29 17:01 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Pavel Begunkov, Jakub Kicinski, netdev, io-uring, Eric Dumazet,
	Willem de Bruijn, Paolo Abeni, andrew+netdev, horms, davem, sdf,
	dw, michael.chan, dtatulea, ap420073

On Tue, Jul 29, 2025 at 9:41 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 07/28, Mina Almasry wrote:
> > On Mon, Jul 28, 2025 at 3:06 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 07/28, Pavel Begunkov wrote:
> > > > On 7/28/25 21:21, Stanislav Fomichev wrote:
> > > > > On 07/28, Pavel Begunkov wrote:
> > > > > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> > > > ...>>> Supporting big buffers is the right direction, but I have the same
> > > > > > > feedback:
> > > > > >
> > > > > > Let me actually check the feedback for the queue config RFC...
> > > > > >
> > > > > > it would be nice to fit a cohesive story for the devmem as well.
> > > > > >
> > > > > > Only the last patch is zcrx specific, the rest is agnostic,
> > > > > > devmem can absolutely reuse that. I don't think there are any
> > > > > > issues wiring up devmem?
> > > > >
> > > > > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > > > > I'm not sure is the right fit for devmem, see below. If all you
> > > >
> > > > I guess you're talking about uapi setting it, because as an
> > > > internal per queue parameter IMHO it does make sense for devmem.
> > > >
> > > > > care is exposing it via io_uring, maybe don't expose it from netlink for
> > > >
> > > > Sure, I can remove the set operation.
> > > >
> > > > > now? Although I'm not sure I understand why you're also passing
> > > > > this per-queue value via io_uring. Can you not inherit it from the
> > > > > queue config?
> > > >
> > > > It's not a great option. It complicates user space with netlink.
> > > > And there are convenience configuration features in the future
> > > > that requires io_uring to parse memory first. E.g. instead of
> > > > user specifying a particular size, it can say "choose the largest
> > > > length under 32K that the backing memory allows".
> > >
> > > Don't you already need a bunch of netlink to setup rss and flow
> > > steering? And if we end up adding queue api, you'll have to call that
> > > one over netlink also.
> > >
> >
> > I'm thinking one thing that could work is extending bind-rx with an
> > optional rx-buf-len arg, which in the code translates into devmem
> > using the new net_mp_open_rxq variant which not only restarts the
> > queue but also sets the size. From there the implementation should be
> > fairly straightforward in devmem. devmem currently rejects any pp for
> > which pp.order != 0. It would need to start accepting that and
> > forwarding the order to the gen_pool doing the allocations, etc.
>
> Right, that's the logical alternative, to put that rx-buf-len on the
> binding to control the size of the niovs. But then what do we do with
> the queue's rx-buf-len? bnxt patch in the series does
> page_pool_dev_alloc_frag(..., bp->rx_page_size). bp->rx_page_size comes
> from netlink. Does it need to be inherited from the pp in the devmem
> case somehow?

I need to review the series closely, but the only thing that makes
sense to me off the bat is that the rx-buf-len option sets the
rx-buf-len of the queue as if you called the queue-set API in a
separate call (and the unbind would reset the value to default).

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-29 16:33             ` Stanislav Fomichev
@ 2025-07-30 14:16               ` Pavel Begunkov
  2025-07-30 15:50                 ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-30 14:16 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 7/29/25 17:33, Stanislav Fomichev wrote:
> On 07/28, Pavel Begunkov wrote:
>> On 7/28/25 23:06, Stanislav Fomichev wrote:
>>> On 07/28, Pavel Begunkov wrote:
>>>> On 7/28/25 21:21, Stanislav Fomichev wrote:
>>>>> On 07/28, Pavel Begunkov wrote:
>>>>>> On 7/28/25 18:13, Stanislav Fomichev wrote:
>>>> ...>>> Supporting big buffers is the right direction, but I have the same
>>>>>>> feedback:
>>>>>>
>>>>>> Let me actually check the feedback for the queue config RFC...
>>>>>>
>>>>>> it would be nice to fit a cohesive story for the devmem as well.
>>>>>>
>>>>>> Only the last patch is zcrx specific, the rest is agnostic,
>>>>>> devmem can absolutely reuse that. I don't think there are any
>>>>>> issues wiring up devmem?
>>>>>
>>>>> Right, but the patch number 2 exposes per-queue rx-buf-len which
>>>>> I'm not sure is the right fit for devmem, see below. If all you
>>>>
>>>> I guess you're talking about uapi setting it, because as an
>>>> internal per queue parameter IMHO it does make sense for devmem.
>>>>
>>>>> care is exposing it via io_uring, maybe don't expose it from netlink for
>>>>
>>>> Sure, I can remove the set operation.
>>>>
>>>>> now? Although I'm not sure I understand why you're also passing
>>>>> this per-queue value via io_uring. Can you not inherit it from the
>>>>> queue config?
>>>>
>>>> It's not a great option. It complicates user space with netlink.
>>>> And there are convenience configuration features in the future
>>>> that requires io_uring to parse memory first. E.g. instead of
>>>> user specifying a particular size, it can say "choose the largest
>>>> length under 32K that the backing memory allows".
>>>
>>> Don't you already need a bunch of netlink to setup rss and flow
>>
>> Could be needed, but there are cases where configuration and
>> virtual queue selection is done outside the program. I'll need
>> to ask which option we currently use.
> 
> If the setup is done outside, you can also setup rx-buf-len outside, no?

You can't do it without assuming the memory layout, and that's
the application's role to allocate buffers. Not to mention that
often the app won't know about all specifics either and it'd be
resolved on zcrx registration.

>>> steering? And if we end up adding queue api, you'll have to call that
>>> one over netlink also.
>>
>> There is already a queue api, even though it's cropped IIUC.
>> What kind of extra setup you have in mind?
> 
> I'm talking about allocating the queues. Currently the zc/devmem setup is
> a bit complicated, we need to partition the queues and rss+flow
> steer into a subset of zerocopy ones. In the future we might add some apis
> to request a new dedicated queue for the specific flow(s). That should
> hopefully simplify the design (and make the cleanup of the queues more
> robust if the application dies).

I see, would be useful indeed, but let's not over complicate things
until we have to, especially since there are reasons not to. For
the configuration, I was arguing for a while that it'd be great to
have an allocated queue wrapped into an fd, so that all
containerisation / queue passing / security / etc. questions just
solved in a generic and ubiquitous way.

>>>>> If we assume that at some point niov can be backed up by chunks larger
>>>>> than PAGE_SIZE, the assumed workflow for devemem is:
>>>>> 1. change rx-buf-len to 32K
>>>>>      - this is needed only for devmem, but not for CPU RAM, but we'll have
>>>>>        to refill the queues from the main memory anyway
>>>>
>>>> Urgh, that's another reason why I prefer to just pass it through
>>>> zcrx and not netlink. So maybe you can just pass the len to devmem
>>>> on creation, and internally it sets up its queues with it.
>>>
>>> But you still need to solve MAX_PAGE_ORDER/PAGE_ALLOC_COSTLY_ORDER I
>>> think? We don't want the drivers to do PAGE_ALLOC_COSTLY_ORDER costly
>>> allocation presumably?
>>
>> #define PAGE_ALLOC_COSTLY_ORDER 3
>>
>> It's "costly" for the page allocator and not a custom specially
>> cooked memory providers. Nobody should care as long as the length
>> applies to the given provider only. MAX_PAGE_ORDER also seems to
>> be a page allocator thing.
> 
> By custom memory providers you mean page pool? Thinking about it more,

zcrx / devmem. I'm just saying that in situations where zcrx sets
the size for its queues and that only affects zcrx allocations and
not normal page pools, PAGE_ALLOC_COSTLY_ORDER is irrelevant.

I agreed on dropping the netlink queue size set, which leaves the
global size set, but that's a separate topic.

> maybe it's fine as is as long as we have ndo_queue_cfg_validate that
> enforces sensible ranges..

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-30 14:16               ` Pavel Begunkov
@ 2025-07-30 15:50                 ` Stanislav Fomichev
  2025-07-31 19:34                   ` Mina Almasry
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-07-30 15:50 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073

On 07/30, Pavel Begunkov wrote:
> On 7/29/25 17:33, Stanislav Fomichev wrote:
> > On 07/28, Pavel Begunkov wrote:
> > > On 7/28/25 23:06, Stanislav Fomichev wrote:
> > > > On 07/28, Pavel Begunkov wrote:
> > > > > On 7/28/25 21:21, Stanislav Fomichev wrote:
> > > > > > On 07/28, Pavel Begunkov wrote:
> > > > > > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> > > > > ...>>> Supporting big buffers is the right direction, but I have the same
> > > > > > > > feedback:
> > > > > > > 
> > > > > > > Let me actually check the feedback for the queue config RFC...
> > > > > > > 
> > > > > > > it would be nice to fit a cohesive story for the devmem as well.
> > > > > > > 
> > > > > > > Only the last patch is zcrx specific, the rest is agnostic,
> > > > > > > devmem can absolutely reuse that. I don't think there are any
> > > > > > > issues wiring up devmem?
> > > > > > 
> > > > > > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > > > > > I'm not sure is the right fit for devmem, see below. If all you
> > > > > 
> > > > > I guess you're talking about uapi setting it, because as an
> > > > > internal per queue parameter IMHO it does make sense for devmem.
> > > > > 
> > > > > > care is exposing it via io_uring, maybe don't expose it from netlink for
> > > > > 
> > > > > Sure, I can remove the set operation.
> > > > > 
> > > > > > now? Although I'm not sure I understand why you're also passing
> > > > > > this per-queue value via io_uring. Can you not inherit it from the
> > > > > > queue config?
> > > > > 
> > > > > It's not a great option. It complicates user space with netlink.
> > > > > And there are convenience configuration features in the future
> > > > > that requires io_uring to parse memory first. E.g. instead of
> > > > > user specifying a particular size, it can say "choose the largest
> > > > > length under 32K that the backing memory allows".
> > > > 
> > > > Don't you already need a bunch of netlink to setup rss and flow
> > > 
> > > Could be needed, but there are cases where configuration and
> > > virtual queue selection is done outside the program. I'll need
> > > to ask which option we currently use.
> > 
> > If the setup is done outside, you can also setup rx-buf-len outside, no?
> 
> You can't do it without assuming the memory layout, and that's
> the application's role to allocate buffers. Not to mention that
> often the app won't know about all specifics either and it'd be
> resolved on zcrx registration.

I think, fundamentally, we need to distinguish:

1. chunk size of the memory pool (page pool order, niov size)
2. chunk size of the rx queue entries (this is what this series calls
   rx-buf-len), mostly influenced by MTU?

For devmem (and same for iou?), we want an option to derive (2) from (1):
page pools with larger chunks need to generate larger rx entries.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-30 15:50                 ` Stanislav Fomichev
@ 2025-07-31 19:34                   ` Mina Almasry
  2025-07-31 19:57                     ` Pavel Begunkov
  2025-08-01  9:58                     ` Pavel Begunkov
  0 siblings, 2 replies; 66+ messages in thread
From: Mina Almasry @ 2025-07-31 19:34 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Pavel Begunkov, Jakub Kicinski, netdev, io-uring, Eric Dumazet,
	Willem de Bruijn, Paolo Abeni, andrew+netdev, horms, davem, sdf,
	dw, michael.chan, dtatulea, ap420073

On Wed, Jul 30, 2025 at 8:50 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 07/30, Pavel Begunkov wrote:
> > On 7/29/25 17:33, Stanislav Fomichev wrote:
> > > On 07/28, Pavel Begunkov wrote:
> > > > On 7/28/25 23:06, Stanislav Fomichev wrote:
> > > > > On 07/28, Pavel Begunkov wrote:
> > > > > > On 7/28/25 21:21, Stanislav Fomichev wrote:
> > > > > > > On 07/28, Pavel Begunkov wrote:
> > > > > > > > On 7/28/25 18:13, Stanislav Fomichev wrote:
> > > > > > ...>>> Supporting big buffers is the right direction, but I have the same
> > > > > > > > > feedback:
> > > > > > > >
> > > > > > > > Let me actually check the feedback for the queue config RFC...
> > > > > > > >
> > > > > > > > it would be nice to fit a cohesive story for the devmem as well.
> > > > > > > >
> > > > > > > > Only the last patch is zcrx specific, the rest is agnostic,
> > > > > > > > devmem can absolutely reuse that. I don't think there are any
> > > > > > > > issues wiring up devmem?
> > > > > > >
> > > > > > > Right, but the patch number 2 exposes per-queue rx-buf-len which
> > > > > > > I'm not sure is the right fit for devmem, see below. If all you
> > > > > >
> > > > > > I guess you're talking about uapi setting it, because as an
> > > > > > internal per queue parameter IMHO it does make sense for devmem.
> > > > > >
> > > > > > > care is exposing it via io_uring, maybe don't expose it from netlink for
> > > > > >
> > > > > > Sure, I can remove the set operation.
> > > > > >
> > > > > > > now? Although I'm not sure I understand why you're also passing
> > > > > > > this per-queue value via io_uring. Can you not inherit it from the
> > > > > > > queue config?
> > > > > >
> > > > > > It's not a great option. It complicates user space with netlink.
> > > > > > And there are convenience configuration features in the future
> > > > > > that requires io_uring to parse memory first. E.g. instead of
> > > > > > user specifying a particular size, it can say "choose the largest
> > > > > > length under 32K that the backing memory allows".
> > > > >
> > > > > Don't you already need a bunch of netlink to setup rss and flow
> > > >
> > > > Could be needed, but there are cases where configuration and
> > > > virtual queue selection is done outside the program. I'll need
> > > > to ask which option we currently use.
> > >
> > > If the setup is done outside, you can also setup rx-buf-len outside, no?
> >
> > You can't do it without assuming the memory layout, and that's
> > the application's role to allocate buffers. Not to mention that
> > often the app won't know about all specifics either and it'd be
> > resolved on zcrx registration.
>
> I think, fundamentally, we need to distinguish:
>
> 1. chunk size of the memory pool (page pool order, niov size)
> 2. chunk size of the rx queue entries (this is what this series calls
>    rx-buf-len), mostly influenced by MTU?
>
> For devmem (and same for iou?), we want an option to derive (2) from (1):
> page pools with larger chunks need to generate larger rx entries.

To be honest I'm not following. #1 and #2 seem the same to me.
rx-buf-len is just the size of each rx buffer posted to the NIC.

With pp_params.order = 0 (most common configuration today), rx-buf-len
== 4K. Regardless of MTU. With pp_params.order=1, I'm guessing 8K
then, again regardless of MTU.

I think if the user has not configured rx-buf-len, the driver is
probably free to pick whatever it wants and that can be a derivative
of the MTU.

When the rx-buf-len is configured by the user, I assume the driver
puts aside all MTU-related heuristics (if it has them) and uses
whatever the userspace specified.

Note that the memory provider may reject the request. For example
iouring and pages providers can only do page-order allocations. Devmem
can in theory do any byte-aligned allocation, since gen_pool doesn't
have a restriction AFAIR.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-31 19:34                   ` Mina Almasry
@ 2025-07-31 19:57                     ` Pavel Begunkov
  2025-07-31 20:05                       ` Mina Almasry
  2025-08-01  9:58                     ` Pavel Begunkov
  1 sibling, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2025-07-31 19:57 UTC (permalink / raw)
  To: Mina Almasry, Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On 7/31/25 20:34, Mina Almasry wrote:
> On Wed, Jul 30, 2025 at 8:50 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>>
>> On 07/30, Pavel Begunkov wrote:
>>> On 7/29/25 17:33, Stanislav Fomichev wrote:
>>>> On 07/28, Pavel Begunkov wrote:
>>>>> On 7/28/25 23:06, Stanislav Fomichev wrote:
>>>>>> On 07/28, Pavel Begunkov wrote:
>>>>>>> On 7/28/25 21:21, Stanislav Fomichev wrote:
>>>>>>>> On 07/28, Pavel Begunkov wrote:
>>>>>>>>> On 7/28/25 18:13, Stanislav Fomichev wrote:
>>>>>>> ...>>> Supporting big buffers is the right direction, but I have the same
>>>>>>>>>> feedback:
>>>>>>>>>
>>>>>>>>> Let me actually check the feedback for the queue config RFC...
>>>>>>>>>
>>>>>>>>> it would be nice to fit a cohesive story for the devmem as well.
>>>>>>>>>
>>>>>>>>> Only the last patch is zcrx specific, the rest is agnostic,
>>>>>>>>> devmem can absolutely reuse that. I don't think there are any
>>>>>>>>> issues wiring up devmem?
>>>>>>>>
>>>>>>>> Right, but the patch number 2 exposes per-queue rx-buf-len which
>>>>>>>> I'm not sure is the right fit for devmem, see below. If all you
>>>>>>>
>>>>>>> I guess you're talking about uapi setting it, because as an
>>>>>>> internal per queue parameter IMHO it does make sense for devmem.
>>>>>>>
>>>>>>>> care is exposing it via io_uring, maybe don't expose it from netlink for
>>>>>>>
>>>>>>> Sure, I can remove the set operation.
>>>>>>>
>>>>>>>> now? Although I'm not sure I understand why you're also passing
>>>>>>>> this per-queue value via io_uring. Can you not inherit it from the
>>>>>>>> queue config?
>>>>>>>
>>>>>>> It's not a great option. It complicates user space with netlink.
>>>>>>> And there are convenience configuration features in the future
>>>>>>> that requires io_uring to parse memory first. E.g. instead of
>>>>>>> user specifying a particular size, it can say "choose the largest
>>>>>>> length under 32K that the backing memory allows".
>>>>>>
>>>>>> Don't you already need a bunch of netlink to setup rss and flow
>>>>>
>>>>> Could be needed, but there are cases where configuration and
>>>>> virtual queue selection is done outside the program. I'll need
>>>>> to ask which option we currently use.
>>>>
>>>> If the setup is done outside, you can also setup rx-buf-len outside, no?
>>>
>>> You can't do it without assuming the memory layout, and that's
>>> the application's role to allocate buffers. Not to mention that
>>> often the app won't know about all specifics either and it'd be
>>> resolved on zcrx registration.
>>
>> I think, fundamentally, we need to distinguish:
>>
>> 1. chunk size of the memory pool (page pool order, niov size)
>> 2. chunk size of the rx queue entries (this is what this series calls
>>     rx-buf-len), mostly influenced by MTU?
>>
>> For devmem (and same for iou?), we want an option to derive (2) from (1):
>> page pools with larger chunks need to generate larger rx entries.
> 
> To be honest I'm not following. #1 and #2 seem the same to me.
> rx-buf-len is just the size of each rx buffer posted to the NIC.
> 
> With pp_params.order = 0 (most common configuration today), rx-buf-len
> == 4K. Regardless of MTU. With pp_params.order=1, I'm guessing 8K
> then, again regardless of MTU.

There are drivers that fragment the buffer they get from a page
pool and give smaller chunks to the hw. It's surely a good idea to
be more explicit on what's what, but from the whole setup and uapi
perspective I'm not too concerned.

The parameter the user passes to zcrx must controls 1. As for 2.
I'd expect the driver to use the passed size directly or fail
validation, but even if that's not the case, zcrx / devmem would
just continue to work without any change in uapi, so we have
the freedom to patch up the nuances later on if anything sticks
out.

  > I think if the user has not configured rx-buf-len, the driver is
> probably free to pick whatever it wants and that can be a derivative
> of the MTU.
> 
> When the rx-buf-len is configured by the user, I assume the driver
> puts aside all MTU-related heuristics (if it has them) and uses
> whatever the userspace specified.
> 
> Note that the memory provider may reject the request. For example
> iouring and pages providers can only do page-order allocations. Devmem
> can in theory do any byte-aligned allocation, since gen_pool doesn't
> have a restriction AFAIR.
> 

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-31 19:57                     ` Pavel Begunkov
@ 2025-07-31 20:05                       ` Mina Almasry
  2025-08-01  9:48                         ` Pavel Begunkov
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-07-31 20:05 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Stanislav Fomichev, Jakub Kicinski, netdev, io-uring,
	Eric Dumazet, Willem de Bruijn, Paolo Abeni, andrew+netdev, horms,
	davem, sdf, dw, michael.chan, dtatulea, ap420073

On Thu, Jul 31, 2025 at 12:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 7/31/25 20:34, Mina Almasry wrote:
> > On Wed, Jul 30, 2025 at 8:50 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >>
> >> On 07/30, Pavel Begunkov wrote:
> >>> On 7/29/25 17:33, Stanislav Fomichev wrote:
> >>>> On 07/28, Pavel Begunkov wrote:
> >>>>> On 7/28/25 23:06, Stanislav Fomichev wrote:
> >>>>>> On 07/28, Pavel Begunkov wrote:
> >>>>>>> On 7/28/25 21:21, Stanislav Fomichev wrote:
> >>>>>>>> On 07/28, Pavel Begunkov wrote:
> >>>>>>>>> On 7/28/25 18:13, Stanislav Fomichev wrote:
> >>>>>>> ...>>> Supporting big buffers is the right direction, but I have the same
> >>>>>>>>>> feedback:
> >>>>>>>>>
> >>>>>>>>> Let me actually check the feedback for the queue config RFC...
> >>>>>>>>>
> >>>>>>>>> it would be nice to fit a cohesive story for the devmem as well.
> >>>>>>>>>
> >>>>>>>>> Only the last patch is zcrx specific, the rest is agnostic,
> >>>>>>>>> devmem can absolutely reuse that. I don't think there are any
> >>>>>>>>> issues wiring up devmem?
> >>>>>>>>
> >>>>>>>> Right, but the patch number 2 exposes per-queue rx-buf-len which
> >>>>>>>> I'm not sure is the right fit for devmem, see below. If all you
> >>>>>>>
> >>>>>>> I guess you're talking about uapi setting it, because as an
> >>>>>>> internal per queue parameter IMHO it does make sense for devmem.
> >>>>>>>
> >>>>>>>> care is exposing it via io_uring, maybe don't expose it from netlink for
> >>>>>>>
> >>>>>>> Sure, I can remove the set operation.
> >>>>>>>
> >>>>>>>> now? Although I'm not sure I understand why you're also passing
> >>>>>>>> this per-queue value via io_uring. Can you not inherit it from the
> >>>>>>>> queue config?
> >>>>>>>
> >>>>>>> It's not a great option. It complicates user space with netlink.
> >>>>>>> And there are convenience configuration features in the future
> >>>>>>> that requires io_uring to parse memory first. E.g. instead of
> >>>>>>> user specifying a particular size, it can say "choose the largest
> >>>>>>> length under 32K that the backing memory allows".
> >>>>>>
> >>>>>> Don't you already need a bunch of netlink to setup rss and flow
> >>>>>
> >>>>> Could be needed, but there are cases where configuration and
> >>>>> virtual queue selection is done outside the program. I'll need
> >>>>> to ask which option we currently use.
> >>>>
> >>>> If the setup is done outside, you can also setup rx-buf-len outside, no?
> >>>
> >>> You can't do it without assuming the memory layout, and that's
> >>> the application's role to allocate buffers. Not to mention that
> >>> often the app won't know about all specifics either and it'd be
> >>> resolved on zcrx registration.
> >>
> >> I think, fundamentally, we need to distinguish:
> >>
> >> 1. chunk size of the memory pool (page pool order, niov size)
> >> 2. chunk size of the rx queue entries (this is what this series calls
> >>     rx-buf-len), mostly influenced by MTU?
> >>
> >> For devmem (and same for iou?), we want an option to derive (2) from (1):
> >> page pools with larger chunks need to generate larger rx entries.
> >
> > To be honest I'm not following. #1 and #2 seem the same to me.
> > rx-buf-len is just the size of each rx buffer posted to the NIC.
> >
> > With pp_params.order = 0 (most common configuration today), rx-buf-len
> > == 4K. Regardless of MTU. With pp_params.order=1, I'm guessing 8K
> > then, again regardless of MTU.
>
> There are drivers that fragment the buffer they get from a page
> pool and give smaller chunks to the hw. It's surely a good idea to
> be more explicit on what's what, but from the whole setup and uapi
> perspective I'm not too concerned.
>
> The parameter the user passes to zcrx must controls 1. As for 2.
> I'd expect the driver to use the passed size directly or fail
> validation, but even if that's not the case, zcrx / devmem would
> just continue to work without any change in uapi, so we have
> the freedom to patch up the nuances later on if anything sticks
> out.
>

I indeed forgot about driver-fragmenting. That does complicate things
quite a bit.

So AFAIU the intended behavior is that rx-buf-len refers to the memory
size allocated by the driver (and thun memory provider), but not
necessarily the one posted by the driver if it's fragmenting that
piece of memory? If so, that sounds good to me. Although I wonder if
that could cause some unexpected behavior... Someone may configure
rx-buf-len to 8K on one driver and get actual 8K packets, but then
configure rx-buf-len on another driver and get 4K packets because the
driver fragmented each buffer into 2...

I guess in the future there may be a knob that controls how much
fragmentation the driver does?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-31 20:05                       ` Mina Almasry
@ 2025-08-01  9:48                         ` Pavel Begunkov
  0 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-08-01  9:48 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Stanislav Fomichev, Jakub Kicinski, netdev, io-uring,
	Eric Dumazet, Willem de Bruijn, Paolo Abeni, andrew+netdev, horms,
	davem, sdf, dw, michael.chan, dtatulea, ap420073

On 7/31/25 21:05, Mina Almasry wrote:
> On Thu, Jul 31, 2025 at 12:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
...>>>>>> If the setup is done outside, you can also setup rx-buf-len outside, no?
>>>>>
>>>>> You can't do it without assuming the memory layout, and that's
>>>>> the application's role to allocate buffers. Not to mention that
>>>>> often the app won't know about all specifics either and it'd be
>>>>> resolved on zcrx registration.
>>>>
>>>> I think, fundamentally, we need to distinguish:
>>>>
>>>> 1. chunk size of the memory pool (page pool order, niov size)
>>>> 2. chunk size of the rx queue entries (this is what this series calls
>>>>      rx-buf-len), mostly influenced by MTU?
>>>>
>>>> For devmem (and same for iou?), we want an option to derive (2) from (1):
>>>> page pools with larger chunks need to generate larger rx entries.
>>>
>>> To be honest I'm not following. #1 and #2 seem the same to me.
>>> rx-buf-len is just the size of each rx buffer posted to the NIC.
>>>
>>> With pp_params.order = 0 (most common configuration today), rx-buf-len
>>> == 4K. Regardless of MTU. With pp_params.order=1, I'm guessing 8K
>>> then, again regardless of MTU.
>>
>> There are drivers that fragment the buffer they get from a page
>> pool and give smaller chunks to the hw. It's surely a good idea to
>> be more explicit on what's what, but from the whole setup and uapi
>> perspective I'm not too concerned.
>>
>> The parameter the user passes to zcrx must controls 1. As for 2.
>> I'd expect the driver to use the passed size directly or fail
>> validation, but even if that's not the case, zcrx / devmem would
>> just continue to work without any change in uapi, so we have
>> the freedom to patch up the nuances later on if anything sticks
>> out.
>>
> 
> I indeed forgot about driver-fragmenting. That does complicate things
> quite a bit.
> 
> So AFAIU the intended behavior is that rx-buf-len refers to the memory
> size allocated by the driver (and thun memory provider), but not
> necessarily the one posted by the driver if it's fragmenting that
> piece of memory? If so, that sounds good to me. Although I wonder if

Yep

> that could cause some unexpected behavior... Someone may configure
> rx-buf-len to 8K on one driver and get actual 8K packets, but then
> configure rx-buf-len on another driver and get 4K packets because the
> driver fragmented each buffer into 2...

That already can happen, the user can hope to get whole full buffers
but shouldn't assume that it will. hw gro can't be 100% reliable in
this sense for all circumstances. And I don't think it's sane for
driver implementations to do that. Fragmenting PAGE_SIZE because the
NIC needs smaller chunks or for some other compatibility reasons?
Sure, but then I don't see a reason for validating even larger buffers.

> I guess in the future there may be a knob that controls how much
> fragmentation the driver does?

Probably, but hopefully it'll not be needed

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 00/22] Large rx buffer support for zcrx
  2025-07-31 19:34                   ` Mina Almasry
  2025-07-31 19:57                     ` Pavel Begunkov
@ 2025-08-01  9:58                     ` Pavel Begunkov
  1 sibling, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-08-01  9:58 UTC (permalink / raw)
  To: Mina Almasry, Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On 7/31/25 20:34, Mina Almasry wrote:
> On Wed, Jul 30, 2025 at 8:50 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
...>> For devmem (and same for iou?), we want an option to derive (2) from (1):
>> page pools with larger chunks need to generate larger rx entries.
> 
> To be honest I'm not following. #1 and #2 seem the same to me.
> rx-buf-len is just the size of each rx buffer posted to the NIC.
> 
> With pp_params.order = 0 (most common configuration today), rx-buf-len
> == 4K. Regardless of MTU. With pp_params.order=1, I'm guessing 8K
> then, again regardless of MTU.
> 
> I think if the user has not configured rx-buf-len, the driver is
> probably free to pick whatever it wants and that can be a derivative
> of the MTU.
> 
> When the rx-buf-len is configured by the user, I assume the driver
> puts aside all MTU-related heuristics (if it has them) and uses
> whatever the userspace specified.
> 
> Note that the memory provider may reject the request. For example
> iouring and pages providers can only do page-order allocations. Devmem
> can in theory do any byte-aligned allocation, since gen_pool doesn't
> have a restriction AFAIR.

It's trivial to add sub-page and/or non-pow2 allocations to zcrx,
but the size needs to be uniform and decided on at registration.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-07-28 21:36   ` Mina Almasry
@ 2025-08-01 23:13     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-01 23:13 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, 28 Jul 2025 14:36:37 -0700 Mina Almasry wrote:
> > +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> > +uses to receive packets. If the device uses different memory polls for headers
> > +and payload this setting may control the size of the header buffers but must
> > +control the size of the payload buffers.
> > +  
> 
> To be honest I'm not a big fan of the ambiguity here? Could this
> configure just the payload buffer sizes? And a new one to configure
> the header buffer sizes eventually?
> 
> Also, IIUC in this patchset, actually the size applied will be the
> order that is larger than the size configured, no? So a setting of 9KB
> will actually result in 16KB, no? Should this be documented? Or do we
> expect non power of 2 sizes to be rejected by the driver and this API
> fail?

This is an existing parameter.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 04/22] net: clarify the meaning of netdev_config members
  2025-07-28 21:44   ` Mina Almasry
@ 2025-08-01 23:14     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-01 23:14 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, 28 Jul 2025 14:44:19 -0700 Mina Almasry wrote:
> >  struct netdev_config {
> > +       /* Direct value
> > +        *
> > +        * Driver default is expected to be fixed, and set in this struct
> > +        * at init. From that point on user may change the value. There is
> > +        * no explicit way to "unset" / restore driver default.
> > +        */  
> 
> Does the user setting hds_thres imply turning hds_config to "on"? Or
> is hds_thres only used when hds_config is actually on?

The latter.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 05/22] net: add rx_buf_len to netdev config
  2025-07-28 21:50   ` Mina Almasry
@ 2025-08-01 23:18     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-01 23:18 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, 28 Jul 2025 14:50:12 -0700 Mina Almasry wrote:
> On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> > Add rx_buf_len to configuration maintained by the core.
> > Use "three-state" semantics where 0 means "driver default".
> 
> What are three states in the semantics here?
> 
> - 0 = driver default.
> - non-zero means value set by userspace
> 
> What is the 3rd state here?

I just mean a value with an explicit default / unset state.
If you have a better name I'm all ears ..
> > diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> > index a87298f659f5..8fdffc77e981 100644
> > --- a/net/ethtool/common.c
> > +++ b/net/ethtool/common.c
> > @@ -832,6 +832,7 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
> >
> >         /* Driver gives us current state, we want to return current config */
> >         kparam->tcp_data_split = dev->cfg->hds_config;
> > +       kparam->rx_buf_len = dev->cfg->rx_buf_len;  
> 
> I'm confused that struct netdev_config is defined in netdev_queues.h,
> and is documented to be a queue-related configuration, but doesn't
> seem to be actually per queue? This line is grabbing the current
> config for this queue from dev->cfg which looks like a shared value.
> 
> I don't think rx_buf_len should be a shared value between all the
> queues. I strongly think it should a per-queue value. The
> devmem/io_uring queues will probably want large rx_buf_len, but normal
> queues will want 0 buf len, me thinks.

I presume that question answered itself as you were reading the rest 
of the patches? :)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 15/22] eth: bnxt: store the rx buf size per queue
  2025-07-28 22:33   ` Mina Almasry
@ 2025-08-01 23:20     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-01 23:20 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, 28 Jul 2025 15:33:55 -0700 Mina Almasry wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> >
> > In normal operation only a subset of queues is configured for
> > zero-copy. Since zero-copy is the main use for larger buffer
> > sizes we need to configure the sizes per queue.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>  
> 
> I wonder if this is necessary for some reason, or is it better to
> expect the driver to refer to the netdev->qcfgs directly?
> 
> By my count the configs can now live in 4 places: the core netdev
> config, the core per-queue config, the driver netdev config, and the
> driver per-queue config.
> 
> I honestly I'm not sure about duplicating settings between the netdev
> configs and the per-queue configs in the first place (seems like
> configs should be either driver wide or per-queue to me, and not
> both), and I'm less sure about again duplicating the settings between
> core structs and in-driver structs. Seems like the same information
> duplicated in many places and a nightmare to keep it all in sync.

Does patch 20 answer this question?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue
  2025-07-28 23:10   ` Mina Almasry
@ 2025-08-01 23:37     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-01 23:37 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, 28 Jul 2025 16:10:36 -0700 Mina Almasry wrote:
> I don't see this used anywhere.

Good catch, I think I was planning to reuse the full structs from
ethtool. And when I gave up I forgot to re-implement the checks.

> But more generally, I'm a bit concerned about protecting drivers that
> don't support configuring one particular queue config. I think likely
> supported_ring_params needs to be moved earlier to the patch which
> adds per queue netdev_configs to the queue API, and probably as part
> of that patch core needs to make sure it's never asking a driver that
> doesn't support changing a netdev_queue_config to do so?

I may be missing your point, but the "supported_params" flags are just
an internal kernel thing. Based on our rich experience of drivers not
validating inputs the "supported" flags just tell the core that a driver
will pay attention to the member of a struct. We can add new members
without having to go over all existing drivers to add input validation.

The flag doesn't actually say anything about particular configuration
being... well.. supported. It's just that the driver promises to
interpret it.

> Some thought may be given to moving the entire configuration story
> outside of queue_mem_alloc/free queue_start/stop altogether to new
> ndos where core can easily check if the ndo is supported otherwise
> per-queue config is not supported. Otherwise core needs to be careful
> never to attempt a config that is not supported?

The configuration is of the queues. The queue configuration belongs in
the queue APIs.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
@ 2025-08-02  0:10   ` Jakub Kicinski
  2025-08-04 12:50     ` Pavel Begunkov
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-02  0:10 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, io-uring, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073

On Mon, 28 Jul 2025 12:04:25 +0100 Pavel Begunkov wrote:
> This patch allows memory providers to pass a queue config when opening a
> queue. It'll be used in the next patch to pass a custom rx buffer length
> from zcrx. As there are many users of netdev_rx_queue_restart(), it's
> allowed to pass a NULL qcfg, in which case the function will use the
> default configuration.

This is not exactly what I anticipated, TBH, I was thinking of
extending the config stuff with another layer.. Drivers will
restart their queues for most random reasons, so we need to be able 
to reconstitute this config easily and serve it up via
netdev_queue_config(). This was, IIUC, also Mina's first concern.

My thinking was that the config would be constructed like this:

  qcfg = init_to_defaults()
  drv_def = get_driver_defaults()
  for each setting:
    if drv_def.X.set:
       qcfg.X = drv_def.X.value
    if dev.config.X.set:
       qcfg.X = dev.config.X.value
    if dev.config.qcfg[qid].X.set:
       qcfg.X = dev.config.qcfg[qid].X.value
    if dev.config.mp[qid].X.set:               << this was not in my
       qcfg.X = dev.config.mp[qid].X.value     << RFC series

Since we don't allow MP to be replaced atomically today, we don't
actually have to place the mp overrides in the config struct and
involve the whole netdev_reconfig_start() _swap() _free() machinery.
We can just stash the config in the queue state, and "logically"
do what I described above.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-02  0:10   ` Jakub Kicinski
@ 2025-08-04 12:50     ` Pavel Begunkov
  2025-08-05 22:43       ` Jakub Kicinski
                         ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Pavel Begunkov @ 2025-08-04 12:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, io-uring, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073

On 8/2/25 01:10, Jakub Kicinski wrote:
> On Mon, 28 Jul 2025 12:04:25 +0100 Pavel Begunkov wrote:
>> This patch allows memory providers to pass a queue config when opening a
>> queue. It'll be used in the next patch to pass a custom rx buffer length
>> from zcrx. As there are many users of netdev_rx_queue_restart(), it's
>> allowed to pass a NULL qcfg, in which case the function will use the
>> default configuration.
> 
> This is not exactly what I anticipated, TBH, I was thinking of
> extending the config stuff with another layer.. Drivers will
> restart their queues for most random reasons, so we need to be able
> to reconstitute this config easily and serve it up via

Yeah, also noticed the gap that while replying to Stan.

> netdev_queue_config(). This was, IIUC, also Mina's first concern.
> 
> My thinking was that the config would be constructed like this:
> 
>    qcfg = init_to_defaults()
>    drv_def = get_driver_defaults()
>    for each setting:
>      if drv_def.X.set:
>         qcfg.X = drv_def.X.value
>      if dev.config.X.set:
>         qcfg.X = dev.config.X.value
>      if dev.config.qcfg[qid].X.set:
>         qcfg.X = dev.config.qcfg[qid].X.value
>      if dev.config.mp[qid].X.set:               << this was not in my
>         qcfg.X = dev.config.mp[qid].X.value     << RFC series
> 
> Since we don't allow MP to be replaced atomically today, we don't
> actually have to place the mp overrides in the config struct and
> involve the whole netdev_reconfig_start() _swap() _free() machinery.
> We can just stash the config in the queue state, and "logically"
> do what I described above.

I was thinking stashing it in struct pp_memory_provider_params and
applying in netdev_rx_queue_restart(). Let me try to move it
into __netdev_queue_config. Any preference between keeping just
the size vs a qcfg pointer in pp_memory_provider_params?

struct struct pp_memory_provider_params {
	const struct memory_provider_ops *mp_ops;
	u32 rx_buf_len;
};

vs

struct struct pp_memory_provider_params {
	const struct memory_provider_ops *mp_ops;
	// providers will need to allocate and keep the qcfg
	// until it's completely detached from the queues.
	struct netdev_queue_config *qcfg;
};

The former one would be simpler for now.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-04 12:50     ` Pavel Begunkov
@ 2025-08-05 22:43       ` Jakub Kicinski
  2025-08-06  0:05       ` Jakub Kicinski
  2025-08-06 16:48       ` Mina Almasry
  2 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-05 22:43 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, io-uring, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073

On Mon, 4 Aug 2025 13:50:08 +0100 Pavel Begunkov wrote:
> > Since we don't allow MP to be replaced atomically today, we don't
> > actually have to place the mp overrides in the config struct and
> > involve the whole netdev_reconfig_start() _swap() _free() machinery.
> > We can just stash the config in the queue state, and "logically"
> > do what I described above.  
> 
> I was thinking stashing it in struct pp_memory_provider_params and
> applying in netdev_rx_queue_restart(). Let me try to move it
> into __netdev_queue_config. Any preference between keeping just
> the size vs a qcfg pointer in pp_memory_provider_params?
> 
> struct struct pp_memory_provider_params {
> 	const struct memory_provider_ops *mp_ops;
> 	u32 rx_buf_len;
> };
> 
> vs
> 
> struct struct pp_memory_provider_params {
> 	const struct memory_provider_ops *mp_ops;
> 	// providers will need to allocate and keep the qcfg
> 	// until it's completely detached from the queues.
> 	struct netdev_queue_config *qcfg;
> };
> 
> The former one would be simpler for now.

+1, I'd stick to the former. We can adjust later if need be.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-04 12:50     ` Pavel Begunkov
  2025-08-05 22:43       ` Jakub Kicinski
@ 2025-08-06  0:05       ` Jakub Kicinski
  2025-08-06 16:48       ` Mina Almasry
  2 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-06  0:05 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, io-uring, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073

On Mon, 4 Aug 2025 13:50:08 +0100 Pavel Begunkov wrote:
> I was thinking stashing it in struct pp_memory_provider_params and
> applying in netdev_rx_queue_restart().

Tho, netdev_rx_queue_restart() may not be a great place, it's called
from multiple points and more will come. net_mp_open_rxq() /
net_mp_close_rxq() and friends are probably a better place to apply
and clear MP related overrides.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-04 12:50     ` Pavel Begunkov
  2025-08-05 22:43       ` Jakub Kicinski
  2025-08-06  0:05       ` Jakub Kicinski
@ 2025-08-06 16:48       ` Mina Almasry
  2025-08-06 18:11         ` Jakub Kicinski
  2 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-08-06 16:48 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Mon, Aug 4, 2025 at 5:48 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 8/2/25 01:10, Jakub Kicinski wrote:
> > On Mon, 28 Jul 2025 12:04:25 +0100 Pavel Begunkov wrote:
> >> This patch allows memory providers to pass a queue config when opening a
> >> queue. It'll be used in the next patch to pass a custom rx buffer length
> >> from zcrx. As there are many users of netdev_rx_queue_restart(), it's
> >> allowed to pass a NULL qcfg, in which case the function will use the
> >> default configuration.
> >
> > This is not exactly what I anticipated, TBH, I was thinking of
> > extending the config stuff with another layer.. Drivers will
> > restart their queues for most random reasons, so we need to be able
> > to reconstitute this config easily and serve it up via
>
> Yeah, also noticed the gap that while replying to Stan.
>
> > netdev_queue_config(). This was, IIUC, also Mina's first concern.
> >
> > My thinking was that the config would be constructed like this:
> >
> >    qcfg = init_to_defaults()
> >    drv_def = get_driver_defaults()
> >    for each setting:
> >      if drv_def.X.set:
> >         qcfg.X = drv_def.X.value
> >      if dev.config.X.set:
> >         qcfg.X = dev.config.X.value
> >      if dev.config.qcfg[qid].X.set:
> >         qcfg.X = dev.config.qcfg[qid].X.value
> >      if dev.config.mp[qid].X.set:               << this was not in my
> >         qcfg.X = dev.config.mp[qid].X.value     << RFC series
> >
> > Since we don't allow MP to be replaced atomically today, we don't
> > actually have to place the mp overrides in the config struct and
> > involve the whole netdev_reconfig_start() _swap() _free() machinery.
> > We can just stash the config in the queue state, and "logically"
> > do what I described above.
>
> I was thinking stashing it in struct pp_memory_provider_params and
> applying in netdev_rx_queue_restart(). Let me try to move it
> into __netdev_queue_config. Any preference between keeping just
> the size vs a qcfg pointer in pp_memory_provider_params?
>
> struct struct pp_memory_provider_params {
>         const struct memory_provider_ops *mp_ops;
>         u32 rx_buf_len;
> };
>

Is this suggesting one more place where we put rx_buf_len, so in
addition to netdev_config?

Honestly I'm in favor of de-duplicating the info as much as possible,
to reduce the headache of keeping all the copies in sync.
pp_memory_provider_params is part of netdev_rx_queue. How about we add
either all of netdev_config or just rx_buf_len there? And set the
precedent that queue configs should be in netdev_rx_queue and all
pieces that need it should grab it from there? Unless the driver needs
a copy of the param I guess.

iouring zcrx and devmem can configure netdev_rx_queue->rx_buf_len in
addition to netdev_rx_queue->mp_params in this scenario.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-06 16:48       ` Mina Almasry
@ 2025-08-06 18:11         ` Jakub Kicinski
  2025-08-06 18:30           ` Mina Almasry
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-06 18:11 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Wed, 6 Aug 2025 09:48:56 -0700 Mina Almasry wrote:
> iouring zcrx and devmem can configure netdev_rx_queue->rx_buf_len in
> addition to netdev_rx_queue->mp_params in this scenario.

Did you not read my message or are you disagreeing that the setting
should be separate and form a hierarchy?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-06 18:11         ` Jakub Kicinski
@ 2025-08-06 18:30           ` Mina Almasry
  2025-08-06 22:05             ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-08-06 18:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Wed, Aug 6, 2025 at 11:11 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 6 Aug 2025 09:48:56 -0700 Mina Almasry wrote:
> > iouring zcrx and devmem can configure netdev_rx_queue->rx_buf_len in
> > addition to netdev_rx_queue->mp_params in this scenario.
>
> Did you not read my message or are you disagreeing that the setting
> should be separate and form a hierarchy?

Sorry, I was disagreeing. The flow above seems complicated. I'm
probably missing something that requires this complication. I was
suggesting an approach I find more straightforward. Something like:

```
  nedev_config = get_driver_defaults()
  qcfg = get_driver_defaults()

  for each setting:
    if qcfg[i].X is set:
       use qcfg[i].X
    else
      use netdev_config.X
```

APIs that set netdev-global attributes could set netdev_config.X. APIs
that set per-queue attributes would set qcfg[i].X (after validating
that the driver supports setting this param on a queue granularity).

With this flow we don't need to duplicate each attribute like
rx-buf-len in 3 different places and have a delicate hierarchy of
serving the config. And we treat mp like any other 'X'. It's just a
setting that exists per-queue but not per netdev.

Although I don't feel strongly here. If you feel the duplication is
warranted please do go ahead1 :-D

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v1 21/22] net: parametrise mp open with a queue config
  2025-08-06 18:30           ` Mina Almasry
@ 2025-08-06 22:05             ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-08-06 22:05 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, io-uring, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073

On Wed, 6 Aug 2025 11:30:38 -0700 Mina Almasry wrote:
> Sorry, I was disagreeing. The flow above seems complicated. I'm
> probably missing something that requires this complication. I was
> suggesting an approach I find more straightforward. 
>
> Something like:
>   nedev_config = get_driver_defaults()
>   qcfg = get_driver_defaults()
> 
>   for each setting:
>     if qcfg[i].X is set:
>        use qcfg[i].X
>     else
>       use netdev_config.X

IMO the rules on when to override/update and reset qcfg[i].X will
get much more complicated than the extra `else if` in this logic.

Plus I suspect at some point we may want to add another layer here
for e.g. a group of queues delegated to the same container interface
(netkit, veth, ipvlan etc.) So I want to establish a clear model
rather than "optimize" for the number of u32 variables.

Most code (drivers) should never be exposed to any of this, they
consume a flattened qcfg for a reason.

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2025-08-06 22:05 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-07-28 18:11   ` Mina Almasry
2025-07-28 21:36   ` Mina Almasry
2025-08-01 23:13     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-07-29  5:00   ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-07-29  5:03   ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-07-28 21:44   ` Mina Almasry
2025-08-01 23:14     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
2025-07-28 21:50   ` Mina Almasry
2025-08-01 23:18     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
2025-07-28 22:26   ` Mina Almasry
2025-07-28 11:04 ` [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-07-28 22:33   ` Mina Almasry
2025-08-01 23:20     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-07-28 23:10   ` Mina Almasry
2025-08-01 23:37     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 18/22] net: wipe the setting of deactived queues Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 19/22] eth: bnxt: use queue op config validate Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
2025-08-02  0:10   ` Jakub Kicinski
2025-08-04 12:50     ` Pavel Begunkov
2025-08-05 22:43       ` Jakub Kicinski
2025-08-06  0:05       ` Jakub Kicinski
2025-08-06 16:48       ` Mina Almasry
2025-08-06 18:11         ` Jakub Kicinski
2025-08-06 18:30           ` Mina Almasry
2025-08-06 22:05             ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support Pavel Begunkov
2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
2025-07-28 18:18   ` Pavel Begunkov
2025-07-28 20:21     ` Stanislav Fomichev
2025-07-28 21:28       ` Pavel Begunkov
2025-07-28 22:06         ` Stanislav Fomichev
2025-07-28 22:44           ` Pavel Begunkov
2025-07-29 16:33             ` Stanislav Fomichev
2025-07-30 14:16               ` Pavel Begunkov
2025-07-30 15:50                 ` Stanislav Fomichev
2025-07-31 19:34                   ` Mina Almasry
2025-07-31 19:57                     ` Pavel Begunkov
2025-07-31 20:05                       ` Mina Almasry
2025-08-01  9:48                         ` Pavel Begunkov
2025-08-01  9:58                     ` Pavel Begunkov
2025-07-28 23:22           ` Mina Almasry
2025-07-29 16:41             ` Stanislav Fomichev
2025-07-29 17:01               ` Mina Almasry
2025-07-28 18:54 ` Mina Almasry
2025-07-28 19:42   ` Pavel Begunkov
2025-07-28 20:23     ` Mina Almasry
2025-07-28 20:57       ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).