[PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers

io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
@ 2025-08-18 13:57 Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order Pavel Begunkov
                   ` (24 more replies)
  0 siblings, 25 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

Pull request with netdev only patches that add support for per queue
configuration and large rx buffers for memory providers. The zcrx
patch using it is separately and can be found at [2].

Large buffers yielded significant benefits during testing, e.g.
a setup with 32KB buffers was using 30% less CPU than with 4K,
see [3] for more details.

Per queue configuration series:
[1] https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/
Branch with the zcrx patch
[2] https://github.com/isilence/linux.git zcrx/large-buffers-v3
v2 of the series
[3] https://lore.kernel.org/all/cover.1754657711.git.asml.silence@gmail.com/

---

v3: - rebased, excluded zcrx specific patches
    - set agg_size_fac to 1 on warning
v2: - Add MAX_PAGE_ORDER check on pp init (Patch 1)
    - Applied comments rewording (Patch 2)
    - Adjust pp.max_len based on order (Patch 8)
    - Patch up mlx5 queue callbacks after rebase (Patch 12)
    - Minor ->queue_mgmt_ops refactoring (Patch 15)
    - Rebased to account for both fill level and agg_size_fac (Patch 17)
    - Pass providers buf length in struct pp_memory_provider_params and
      apply it in __netdev_queue_confi(). (Patch 22)
    - Use ->supported_ring_params to validate drivers support of set
      qcfg parameters. (Patch 23)

The following changes since commit c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9:

  Linux 6.17-rc2 (2025-08-17 15:22:10 -0700)

are available in the Git repository at:

  https://github.com/isilence/linux.git tags/net-for-6.18-queue-rx-buf-len

for you to fetch changes up to 417cf28f3bf129d1a0d1b231220aa045abac3263:

  net: validate driver supports passed qcfg params (2025-08-18 07:39:50 +0100)

Jakub Kicinski (20):
      docs: ethtool: document that rx_buf_len must control payload lengths
      net: ethtool: report max value for rx-buf-len
      net: use zero value to restore rx_buf_len to default
      net: clarify the meaning of netdev_config members
      net: add rx_buf_len to netdev config
      eth: bnxt: read the page size from the adapter struct
      eth: bnxt: set page pool page order based on rx_page_size
      eth: bnxt: support setting size of agg buffers via ethtool
      net: move netdev_config manipulation to dedicated helpers
      net: reduce indent of struct netdev_queue_mgmt_ops members
      net: allocate per-queue config structs and pass them thru the queue API
      net: pass extack to netdev_rx_queue_restart()
      net: add queue config validation callback
      eth: bnxt: always set the queue mgmt ops
      eth: bnxt: store the rx buf size per queue
      eth: bnxt: adjust the fill level of agg queues with larger buffers
      netdev: add support for setting rx-buf-len per queue
      net: wipe the setting of deactived queues
      eth: bnxt: use queue op config validate
      eth: bnxt: support per queue configuration of rx-buf-len

Pavel Begunkov (3):
      net: page_pool: sanitise allocation order
      net: let pp memory provider to specify rx buf len
      net: validate driver supports passed qcfg params

 Documentation/netlink/specs/ethtool.yaml           |   4 +
 Documentation/netlink/specs/netdev.yaml            |  15 ++
 Documentation/networking/ethtool-netlink.rst       |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c          | 143 ++++++++++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h          |   5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c  |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h      |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c         |   9 +-
 .../ethernet/marvell/octeontx2/nic/otx2_ethtool.c  |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +-
 drivers/net/netdevsim/netdev.c                     |   8 +-
 include/linux/ethtool.h                            |   3 +
 include/net/netdev_queues.h                        |  84 ++++++++--
 include/net/netdev_rx_queue.h                      |   3 +-
 include/net/netlink.h                              |  19 +++
 include/net/page_pool/types.h                      |   1 +
 include/uapi/linux/ethtool_netlink_generated.h     |   1 +
 include/uapi/linux/netdev.h                        |   2 +
 net/core/Makefile                                  |   2 +-
 net/core/dev.c                                     |  12 +-
 net/core/dev.h                                     |  15 ++
 net/core/netdev-genl-gen.c                         |  15 ++
 net/core/netdev-genl-gen.h                         |   1 +
 net/core/netdev-genl.c                             |  92 +++++++++++
 net/core/netdev_config.c                           | 183 +++++++++++++++++++++
 net/core/netdev_rx_queue.c                         |  22 ++-
 net/core/page_pool.c                               |   3 +
 net/ethtool/common.c                               |   4 +-
 net/ethtool/netlink.c                              |  14 +-
 net/ethtool/rings.c                                |  14 +-
 tools/include/uapi/linux/netdev.h                  |   2 +
 32 files changed, 631 insertions(+), 84 deletions(-)
 create mode 100644 net/core/netdev_config.c

Jakub Kicinski (20):
  docs: ethtool: document that rx_buf_len must control payload lengths
  net: ethtool: report max value for rx-buf-len
  net: use zero value to restore rx_buf_len to default
  net: clarify the meaning of netdev_config members
  net: add rx_buf_len to netdev config
  eth: bnxt: read the page size from the adapter struct
  eth: bnxt: set page pool page order based on rx_page_size
  eth: bnxt: support setting size of agg buffers via ethtool
  net: move netdev_config manipulation to dedicated helpers
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: allocate per-queue config structs and pass them thru the queue
    API
  net: pass extack to netdev_rx_queue_restart()
  net: add queue config validation callback
  eth: bnxt: always set the queue mgmt ops
  eth: bnxt: store the rx buf size per queue
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  netdev: add support for setting rx-buf-len per queue
  net: wipe the setting of deactived queues
  eth: bnxt: use queue op config validate
  eth: bnxt: support per queue configuration of rx-buf-len

Pavel Begunkov (3):
  net: page_pool: sanitise allocation order
  net: let pp memory provider to specify rx buf len
  net: validate driver supports passed qcfg params

 Documentation/netlink/specs/ethtool.yaml      |   4 +
 Documentation/netlink/specs/netdev.yaml       |  15 ++
 Documentation/networking/ethtool-netlink.rst  |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 143 +++++++++++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   9 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 include/linux/ethtool.h                       |   3 +
 include/net/netdev_queues.h                   |  84 ++++++--
 include/net/netdev_rx_queue.h                 |   3 +-
 include/net/netlink.h                         |  19 ++
 include/net/page_pool/types.h                 |   1 +
 .../uapi/linux/ethtool_netlink_generated.h    |   1 +
 include/uapi/linux/netdev.h                   |   2 +
 net/core/Makefile                             |   2 +-
 net/core/dev.c                                |  12 +-
 net/core/dev.h                                |  15 ++
 net/core/netdev-genl-gen.c                    |  15 ++
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  92 +++++++++
 net/core/netdev_config.c                      | 183 ++++++++++++++++++
 net/core/netdev_rx_queue.c                    |  22 ++-
 net/core/page_pool.c                          |   3 +
 net/ethtool/common.c                          |   4 +-
 net/ethtool/netlink.c                         |  14 +-
 net/ethtool/rings.c                           |  14 +-
 tools/include/uapi/linux/netdev.h             |   2 +
 32 files changed, 631 insertions(+), 84 deletions(-)
 create mode 100644 net/core/netdev_config.c

-- 
2.49.0


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 23:33   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 02/23] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

We're going to give more control over rx buffer sizes to user space, and
since we can't always rely on driver validation, let's sanitise it in
page_pool_init() as well. Note that we only need to reject over
MAX_PAGE_ORDER allocations for normal page pools, as current memory
providers don't need to use the buddy allocator and must check the order
on init.

Suggested-by: Stanislav Fomichev <stfomichev@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/page_pool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 343a6cac21e3..630e34533b16 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -303,6 +303,9 @@ static int page_pool_init(struct page_pool *pool,
 		}
 
 		static_branch_inc(&page_pool_mem_providers);
+	} else if (pool->p.order > MAX_PAGE_ORDER) {
+		err = -EINVAL;
+		goto free_ptr_ring;
 	}
 
 	return 0;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 02/23] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 23:50   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 03/23] net: ethtool: report max value for rx-buf-len Pavel Begunkov
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Document the semantics of the rx_buf_len ethtool ring param.
Clarify its meaning in case of HDS, where driver may have
two separate buffer pools.

The various zero-copy TCP Rx schemes we have suffer from memory
management overhead. Specifically applications aren't too impressed
with the number of 4kB buffers they have to juggle. Zero-copy
TCP makes most sense with larger memory transfers so using
16kB or 32kB buffers (with the help of HW-GRO) feels more
natural.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/networking/ethtool-netlink.rst | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index ab20c644af24..cae372f719d1 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -966,7 +966,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
 driver. Driver may impose additional constraints and may not support all
 attributes.
 
-
 ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
 Completion queue events (CQE) are the events posted by NIC to indicate the
 completion status of a packet when the packet is sent (like send success or
@@ -980,6 +979,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
 header / data split feature. If a received packet size is larger than this
 threshold value, header and data will be split.
 
+``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffers driver
+uses to receive packets. If the device uses different buffer pools for
+headers and payload (due to HDS, HW-GRO etc.) this setting must
+control the size of the payload buffers.
+
 CHANNELS_GET
 ============
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 03/23] net: ethtool: report max value for rx-buf-len
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 02/23] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19  0:00   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default Pavel Begunkov
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Unlike most of our APIs the rx-buf-len param does not have an associated
max value. In theory user could set this value pretty high, but in
practice most NICs have limits due to the width of the length fields
in the descriptors.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/netlink/specs/ethtool.yaml                  | 4 ++++
 Documentation/networking/ethtool-netlink.rst              | 1 +
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 ++-
 include/linux/ethtool.h                                   | 2 ++
 include/uapi/linux/ethtool_netlink_generated.h            | 1 +
 net/ethtool/rings.c                                       | 5 +++++
 6 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 1bc1bd7d33c2..a88e3c4fcc6f 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -449,6 +449,9 @@ attribute-sets:
       -
         name: hds-thresh-max
         type: u32
+      -
+        name: rx-buf-len-max
+        type: u32
 
   -
     name: mm-stat
@@ -2046,6 +2049,7 @@ operations:
             - rx-jumbo
             - tx
             - rx-buf-len
+            - rx-buf-len-max
             - tcp-data-split
             - cqe-size
             - tx-push
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index cae372f719d1..05a7f6b3f945 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -902,6 +902,7 @@ Kernel response contents:
   ``ETHTOOL_A_RINGS_RX_JUMBO``              u32     size of RX jumbo ring
   ``ETHTOOL_A_RINGS_TX``                    u32     size of TX ring
   ``ETHTOOL_A_RINGS_RX_BUF_LEN``            u32     size of buffers on the ring
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN_MAX``        u32     max size of rx buffers
   ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``        u8      TCP header / data split
   ``ETHTOOL_A_RINGS_CQE_SIZE``              u32     Size of TX/RX CQE
   ``ETHTOOL_A_RINGS_TX_PUSH``               u8      flag of TX Push mode
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
index 998c734ff839..1c8a7ee2e459 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
@@ -377,6 +377,7 @@ static void otx2_get_ringparam(struct net_device *netdev,
 	ring->tx_max_pending = Q_COUNT(Q_SIZE_MAX);
 	ring->tx_pending = qs->sqe_cnt ? qs->sqe_cnt : Q_COUNT(Q_SIZE_4K);
 	kernel_ring->rx_buf_len = pfvf->hw.rbuf_len;
+	kernel_ring->rx_buf_len_max = 32768;
 	kernel_ring->cqe_size = pfvf->hw.xqe_size;
 }
 
@@ -399,7 +400,7 @@ static int otx2_set_ringparam(struct net_device *netdev,
 	/* Hardware supports max size of 32k for a receive buffer
 	 * and 1536 is typical ethernet frame size.
 	 */
-	if (rx_buf_len && (rx_buf_len < 1536 || rx_buf_len > 32768)) {
+	if (rx_buf_len && (rx_buf_len < 1536)) {
 		netdev_err(netdev,
 			   "Receive buffer range is 1536 - 32768");
 		return -EINVAL;
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index de5bd76a400c..9267bac16195 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -77,6 +77,7 @@ enum {
 /**
  * struct kernel_ethtool_ringparam - RX/TX ring configuration
  * @rx_buf_len: Current length of buffers on the rx ring.
+ * @rx_buf_len_max: Max length of buffers on the rx ring.
  * @tcp_data_split: Scatter packet headers and data to separate buffers
  * @tx_push: The flag of tx push mode
  * @rx_push: The flag of rx push mode
@@ -89,6 +90,7 @@ enum {
  */
 struct kernel_ethtool_ringparam {
 	u32	rx_buf_len;
+	u32	rx_buf_len_max;
 	u8	tcp_data_split;
 	u8	tx_push;
 	u8	rx_push;
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index e3b8813465d7..8b293d3499f1 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -192,6 +192,7 @@ enum {
 	ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
 	ETHTOOL_A_RINGS_HDS_THRESH,
 	ETHTOOL_A_RINGS_HDS_THRESH_MAX,
+	ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
 
 	__ETHTOOL_A_RINGS_CNT,
 	ETHTOOL_A_RINGS_MAX = (__ETHTOOL_A_RINGS_CNT - 1)
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index aeedd5ec6b8c..5e872ceab5dd 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -105,6 +105,9 @@ static int rings_fill_reply(struct sk_buff *skb,
 			  ringparam->tx_pending)))  ||
 	    (kr->rx_buf_len &&
 	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN, kr->rx_buf_len))) ||
+	    (kr->rx_buf_len_max &&
+	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
+			  kr->rx_buf_len_max))) ||
 	    (kr->tcp_data_split &&
 	     (nla_put_u8(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT,
 			 kr->tcp_data_split))) ||
@@ -281,6 +284,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		err_attr = tb[ETHTOOL_A_RINGS_TX];
 	else if (kernel_ringparam.hds_thresh > kernel_ringparam.hds_thresh_max)
 		err_attr = tb[ETHTOOL_A_RINGS_HDS_THRESH];
+	else if (kernel_ringparam.rx_buf_len > kernel_ringparam.rx_buf_len_max)
+		err_attr = tb[ETHTOOL_A_RINGS_RX_BUF_LEN];
 	else
 		err_attr = NULL;
 	if (err_attr) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 03/23] net: ethtool: report max value for rx-buf-len Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19  0:07   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members Pavel Begunkov
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Distinguish between rx_buf_len being driver default vs user config.
Use 0 as a special value meaning "unset" or "restore driver default".
This will be necessary later on to configure it per-queue, but
the ability to restore defaults may be useful in itself.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/networking/ethtool-netlink.rst              | 2 +-
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 +++
 include/linux/ethtool.h                                   | 1 +
 net/ethtool/rings.c                                       | 2 +-
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 05a7f6b3f945..83c6ac72549b 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -983,7 +983,7 @@ threshold value, header and data will be split.
 ``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffers driver
 uses to receive packets. If the device uses different buffer pools for
 headers and payload (due to HDS, HW-GRO etc.) this setting must
-control the size of the payload buffers.
+control the size of the payload buffers. Setting to 0 restores driver default.
 
 CHANNELS_GET
 ============
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
index 1c8a7ee2e459..1d120b7825de 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
@@ -397,6 +397,9 @@ static int otx2_set_ringparam(struct net_device *netdev,
 	if (ring->rx_mini_pending || ring->rx_jumbo_pending)
 		return -EINVAL;
 
+	if (!rx_buf_len)
+		rx_buf_len = OTX2_DEFAULT_RBUF_LEN;
+
 	/* Hardware supports max size of 32k for a receive buffer
 	 * and 1536 is typical ethernet frame size.
 	 */
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 9267bac16195..e65f04a64266 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -77,6 +77,7 @@ enum {
 /**
  * struct kernel_ethtool_ringparam - RX/TX ring configuration
  * @rx_buf_len: Current length of buffers on the rx ring.
+ *		Setting to 0 means reset to driver default.
  * @rx_buf_len_max: Max length of buffers on the rx ring.
  * @tcp_data_split: Scatter packet headers and data to separate buffers
  * @tx_push: The flag of tx push mode
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 5e872ceab5dd..628546a1827b 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -139,7 +139,7 @@ const struct nla_policy ethnl_rings_set_policy[] = {
 	[ETHTOOL_A_RINGS_RX_MINI]		= { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_RX_JUMBO]		= { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_TX]			= { .type = NLA_U32 },
-	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = NLA_POLICY_MIN(NLA_U32, 1),
+	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_TCP_DATA_SPLIT]	=
 		NLA_POLICY_MAX(NLA_U8, ETHTOOL_TCP_DATA_SPLIT_ENABLED),
 	[ETHTOOL_A_RINGS_CQE_SIZE]		= NLA_POLICY_MIN(NLA_U32, 1),
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19  1:46   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 06/23] net: add rx_buf_len to netdev config Pavel Begunkov
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

hds_thresh and hds_config are both inside struct netdev_config
but have quite different semantics. hds_config is the user config
with ternary semantics (on/off/unset). hds_thresh is a straight
up value, populated by the driver at init and only modified by
user space. We don't expect the drivers to have to pick a special
hds_thresh value based on other configuration.

The two approaches have different advantages and downsides.
hds_thresh ("direct value") gives core easy access to current
device settings, but there's no way to express whether the value
comes from the user. It also requires the initialization by
the driver.

hds_config ("user config values") tells us what user wanted, but
doesn't give us the current value in the core.

Try to explain this a bit in the comments, so at we make a conscious
choice for new values which semantics we expect.

Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
returns a hds_thresh value always as 0.") added the setting for the
benefit of netdevsim which doesn't touch the value at all on get.
Again, this is just to clarify the intention, shouldn't cause any
functional change.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: applied clarification on relationship b/w HDS thresh and config]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 20 ++++++++++++++++++--
 net/ethtool/common.c        |  3 ++-
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 6e835972abd1..c8ce23e7c812 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -6,11 +6,27 @@
 
 /**
  * struct netdev_config - queue-related configuration for a netdev
- * @hds_thresh:		HDS Threshold value.
- * @hds_config:		HDS value from userspace.
  */
 struct netdev_config {
+	/* Direct value
+	 *
+	 * Driver default is expected to be fixed, and set in this struct
+	 * at init. From that point on user may change the value. There is
+	 * no explicit way to "unset" / restore driver default. Used only
+	 * when @hds_config is set.
+	 */
+	/** @hds_thresh: HDS Threshold value (ETHTOOL_A_RINGS_HDS_THRESH).
+	 */
 	u32	hds_thresh;
+
+	/* User config values
+	 *
+	 * Contain user configuration. If "set" driver must obey.
+	 * If "unset" driver is free to decide, and may change its choice
+	 * as other parameters change.
+	 */
+	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
+	 */
 	u8	hds_config;
 };
 
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 4f58648a27ad..faa95f91efad 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -882,12 +882,13 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
 	memset(param, 0, sizeof(*param));
 	memset(kparam, 0, sizeof(*kparam));
 
+	kparam->hds_thresh = dev->cfg->hds_thresh;
+
 	param->cmd = ETHTOOL_GRINGPARAM;
 	dev->ethtool_ops->get_ringparam(dev, param, kparam, extack);
 
 	/* Driver gives us current state, we want to return current config */
 	kparam->tcp_data_split = dev->cfg->hds_config;
-	kparam->hds_thresh = dev->cfg->hds_thresh;
 }
 
 static void ethtool_init_tsinfo(struct kernel_ethtool_ts_info *info)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 06/23] net: add rx_buf_len to netdev config
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (4 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 19:32   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Add rx_buf_len to configuration maintained by the core.
Use "three-state" semantics where 0 means "driver default".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 4 ++++
 net/ethtool/common.c        | 1 +
 net/ethtool/rings.c         | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index c8ce23e7c812..8c21ea9b9515 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -25,6 +25,10 @@ struct netdev_config {
 	 * If "unset" driver is free to decide, and may change its choice
 	 * as other parameters change.
 	 */
+	/** @rx_buf_len: Size of buffers on the Rx ring
+	 *		 (ETHTOOL_A_RINGS_RX_BUF_LEN).
+	 */
+	u32	rx_buf_len;
 	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
 	 */
 	u8	hds_config;
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index faa95f91efad..44fd27480756 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -889,6 +889,7 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
 
 	/* Driver gives us current state, we want to return current config */
 	kparam->tcp_data_split = dev->cfg->hds_config;
+	kparam->rx_buf_len = dev->cfg->rx_buf_len;
 }
 
 static void ethtool_init_tsinfo(struct kernel_ethtool_ts_info *info)
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 628546a1827b..6a74e7e4064e 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -41,6 +41,7 @@ static int rings_prepare_data(const struct ethnl_req_info *req_base,
 		return ret;
 
 	data->kernel_ringparam.tcp_data_split = dev->cfg->hds_config;
+	data->kernel_ringparam.rx_buf_len = dev->cfg->rx_buf_len;
 	data->kernel_ringparam.hds_thresh = dev->cfg->hds_thresh;
 
 	dev->ethtool_ops->get_ringparam(dev, &data->ringparam,
@@ -302,6 +303,7 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		return -EINVAL;
 	}
 
+	dev->cfg_pending->rx_buf_len = kernel_ringparam.rx_buf_len;
 	dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
 	dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (5 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 06/23] net: add rx_buf_len to netdev config Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 19:37   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
inside struct bnxt. This will allow configuring the page size
at runtime in subsequent patches.

The MSS size calculation for older chip continues to use the constant.
I'm intending to support the configuration only on more recent HW,
looks like on older chips setting this per queue won't work,
and that's the ultimate goal.

This patch should not change the current behavior as value
read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 27 ++++++++++---------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  4 +--
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 2800a90fba1f..5307b33ea1c7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -900,7 +900,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > BNXT_RX_PAGE_SIZE;
+	return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -910,9 +910,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
+	if (PAGE_SIZE > bp->rx_page_size) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						BNXT_RX_PAGE_SIZE);
+						bp->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -1150,9 +1150,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, BNXT_RX_PAGE_SIZE);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1184,7 +1184,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1198,7 +1198,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE);
+	skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1283,7 +1283,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, BNXT_RX_PAGE_SIZE);
+					       frag_len, bp->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1308,7 +1308,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= BNXT_RX_PAGE_SIZE;
+				skb->truesize -= bp->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1323,7 +1323,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  BNXT_RX_PAGE_SIZE);
+						  bp->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -4472,7 +4472,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)BNXT_RX_PAGE_SIZE << RX_BD_LEN_SHIFT) |
+		type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
 
 		bnxt_init_rxbd_pages(ring, type);
@@ -4734,7 +4734,7 @@ void bnxt_set_ring_params(struct bnxt *bp)
 	bp->rx_agg_nr_pages = 0;
 
 	if (bp->flags & BNXT_FLAG_TPA || bp->flags & BNXT_FLAG_HDS)
-		agg_factor = min_t(u32, 4, 65536 / BNXT_RX_PAGE_SIZE);
+		agg_factor = min_t(u32, 4, 65536 / bp->rx_page_size);
 
 	bp->flags &= ~BNXT_FLAG_JUMBO;
 	if (rx_space > PAGE_SIZE && !(bp->flags & BNXT_FLAG_NO_AGG_RINGS)) {
@@ -7046,7 +7046,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(BNXT_RX_PAGE_SIZE);
+		req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -16576,6 +16576,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bp = netdev_priv(dev);
 	bp->board_idx = ent->driver_data;
 	bp->msg_enable = BNXT_DEF_MSG_ENABLE;
+	bp->rx_page_size = BNXT_RX_PAGE_SIZE;
 	bnxt_set_max_func_irqs(bp, max_irqs);
 
 	if (bnxt_vf_pciid(bp->board_idx))
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index fda0d3cc6227..ac841d02d7ad 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2358,6 +2358,7 @@ struct bnxt {
 	u16			max_tpa;
 	u32			rx_buf_size;
 	u32			rx_buf_use_size;	/* useable size */
+	u16			rx_page_size;
 	u16			rx_offset;
 	u16			rx_dma_offset;
 	enum dma_data_direction	rx_dir;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 58d579dca3f1..41d3ba56ba41 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = BNXT_RX_PAGE_SIZE;
+	u32 buflen = bp->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 
 	xdp_update_skb_shared_info(skb, num_frags,
 				   sinfo->xdp_frags_size,
-				   BNXT_RX_PAGE_SIZE * num_frags,
+				   bp->rx_page_size * num_frags,
 				   xdp_buff_is_frag_pfmemalloc(xdp));
 	return skb;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (6 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 19:43   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 09/23] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

If user decides to increase the buffer size for agg ring
we need to ask the page pool for higher order pages.
There is no need to use larger pages for header frags,
if user increase the size of agg ring buffers switch
to separate header page automatically.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: calculate adjust max_len]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 5307b33ea1c7..d3d9b72ef313 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3824,11 +3824,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size / rx_size_fac;
+
+	pp.order = get_order(bp->rx_page_size);
 	pp.nid = numa_node;
 	pp.netdev = bp->dev;
 	pp.dev = &bp->pdev->dev;
 	pp.dma_dir = bp->rx_dir;
-	pp.max_len = PAGE_SIZE;
+	pp.max_len = PAGE_SIZE << pp.order;
 	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV |
 		   PP_FLAG_ALLOW_UNREADABLE_NETMEM;
 	pp.queue_idx = rxr->bnapi->index;
@@ -3839,7 +3841,10 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	rxr->page_pool = pool;
 
 	rxr->need_head_pool = page_pool_is_unreadable(pool);
+	rxr->need_head_pool |= !!pp.order;
 	if (bnxt_separate_head_pool(rxr)) {
+		pp.order = 0;
+		pp.max_len = PAGE_SIZE;
 		pp.pool_size = min(bp->rx_ring_size / rx_size_fac, 1024);
 		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 		pool = page_pool_create(&pp);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 09/23] eth: bnxt: support setting size of agg buffers via ethtool
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (7 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 20:10   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 10/23] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

bnxt seems to be able to aggregate data up to 32kB without any issue.
The driver is already capable of doing this for systems with higher
order pages. While for systems with 4k pages we historically preferred
to stick to small buffers because they are easier to allocate, the
zero-copy APIs remove the allocation problem. The ZC mem is
pre-allocated and fixed size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  3 ++-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 21 ++++++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index ac841d02d7ad..56aafae568f8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -758,7 +758,8 @@ struct nqe_cn {
 #define BNXT_RX_PAGE_SHIFT PAGE_SHIFT
 #endif
 
-#define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
+#define BNXT_MAX_RX_PAGE_SIZE	(1 << 15)
+#define BNXT_RX_PAGE_SIZE	(1 << BNXT_RX_PAGE_SHIFT)
 
 #define BNXT_MAX_MTU		9500
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 1b37612b1c01..2e130eeeabe5 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -835,6 +835,8 @@ static void bnxt_get_ringparam(struct net_device *dev,
 	ering->rx_jumbo_pending = bp->rx_agg_ring_size;
 	ering->tx_pending = bp->tx_ring_size;
 
+	kernel_ering->rx_buf_len_max = BNXT_MAX_RX_PAGE_SIZE;
+	kernel_ering->rx_buf_len = bp->rx_page_size;
 	kernel_ering->hds_thresh_max = BNXT_HDS_THRESHOLD_MAX;
 }
 
@@ -862,6 +864,21 @@ static int bnxt_set_ringparam(struct net_device *dev,
 		return -EINVAL;
 	}
 
+	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
+		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
+
+	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
+	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
+	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
+	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
+		return -ERANGE;
+	}
+
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
@@ -874,6 +891,7 @@ static int bnxt_set_ringparam(struct net_device *dev,
 
 	bp->rx_ring_size = ering->rx_pending;
 	bp->tx_ring_size = ering->tx_pending;
+	bp->rx_page_size = kernel_ering->rx_buf_len;
 	bnxt_set_ring_params(bp);
 
 	if (netif_running(dev))
@@ -5489,7 +5507,8 @@ const struct ethtool_ops bnxt_ethtool_ops = {
 				     ETHTOOL_COALESCE_STATS_BLOCK_USECS |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
 				     ETHTOOL_COALESCE_USE_CQE,
-	.supported_ring_params	= ETHTOOL_RING_USE_TCP_DATA_SPLIT |
+	.supported_ring_params	= ETHTOOL_RING_USE_RX_BUF_LEN |
+				  ETHTOOL_RING_USE_TCP_DATA_SPLIT |
 				  ETHTOOL_RING_USE_HDS_THRS,
 	.get_link_ksettings	= bnxt_get_link_ksettings,
 	.set_link_ksettings	= bnxt_set_link_ksettings,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 10/23] net: move netdev_config manipulation to dedicated helpers
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (8 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 09/23] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 20:15   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 11/23] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

netdev_config manipulation will become slightly more complicated
soon and we will need to call if from ethtool as well as queue API.
Encapsulate the logic into helper functions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/Makefile        |  2 +-
 net/core/dev.c           |  7 ++-----
 net/core/dev.h           |  5 +++++
 net/core/netdev_config.c | 43 ++++++++++++++++++++++++++++++++++++++++
 net/ethtool/netlink.c    | 14 ++++++-------
 5 files changed, 57 insertions(+), 14 deletions(-)
 create mode 100644 net/core/netdev_config.c

diff --git a/net/core/Makefile b/net/core/Makefile
index b2a76ce33932..4db487396094 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,7 +19,7 @@ obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
 
 obj-y += net-sysfs.o
 obj-y += hotdata.o
-obj-y += netdev_rx_queue.o
+obj-y += netdev_config.o netdev_rx_queue.o
 obj-$(CONFIG_PAGE_POOL) += page_pool.o page_pool_user.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a3c0f40a93f..7cd4e5eab441 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11873,10 +11873,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool)
 		goto free_all;
 
-	dev->cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
-	if (!dev->cfg)
+	if (netdev_alloc_config(dev))
 		goto free_all;
-	dev->cfg_pending = dev->cfg;
 
 	dev->num_napi_configs = maxqs;
 	napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
@@ -11947,8 +11945,7 @@ void free_netdev(struct net_device *dev)
 		return;
 	}
 
-	WARN_ON(dev->cfg != dev->cfg_pending);
-	kfree(dev->cfg);
+	netdev_free_config(dev);
 	kfree(dev->ethtool);
 	netif_free_tx_queues(dev);
 	netif_free_rx_queues(dev);
diff --git a/net/core/dev.h b/net/core/dev.h
index d6b08d435479..7041c8bd2a0f 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -92,6 +92,11 @@ extern struct rw_semaphore dev_addr_sem;
 extern struct list_head net_todo_list;
 void netdev_run_todo(void);
 
+int netdev_alloc_config(struct net_device *dev);
+void __netdev_free_config(struct netdev_config *cfg);
+void netdev_free_config(struct net_device *dev);
+int netdev_reconfig_start(struct net_device *dev);
+
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
 	struct hlist_node hlist;
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
new file mode 100644
index 000000000000..270b7f10a192
--- /dev/null
+++ b/net/core/netdev_config.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/netdevice.h>
+#include <net/netdev_queues.h>
+
+#include "dev.h"
+
+int netdev_alloc_config(struct net_device *dev)
+{
+	struct netdev_config *cfg;
+
+	cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg)
+		return -ENOMEM;
+
+	dev->cfg = cfg;
+	dev->cfg_pending = cfg;
+	return 0;
+}
+
+void __netdev_free_config(struct netdev_config *cfg)
+{
+	kfree(cfg);
+}
+
+void netdev_free_config(struct net_device *dev)
+{
+	WARN_ON(dev->cfg != dev->cfg_pending);
+	__netdev_free_config(dev->cfg);
+}
+
+int netdev_reconfig_start(struct net_device *dev)
+{
+	struct netdev_config *cfg;
+
+	WARN_ON(dev->cfg != dev->cfg_pending);
+	cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg)
+		return -ENOMEM;
+
+	dev->cfg_pending = cfg;
+	return 0;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 2f813f25f07e..d376d3043177 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -6,6 +6,7 @@
 #include <linux/ethtool_netlink.h>
 #include <linux/phy_link_topology.h>
 #include <linux/pm_runtime.h>
+#include "../core/dev.h"
 #include "netlink.h"
 #include "module_fw.h"
 
@@ -906,12 +907,9 @@ static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 
 	rtnl_lock();
 	netdev_lock_ops(dev);
-	dev->cfg_pending = kmemdup(dev->cfg, sizeof(*dev->cfg),
-				   GFP_KERNEL_ACCOUNT);
-	if (!dev->cfg_pending) {
-		ret = -ENOMEM;
-		goto out_tie_cfg;
-	}
+	ret = netdev_reconfig_start(dev);
+	if (ret)
+		goto out_unlock;
 
 	ret = ethnl_ops_begin(dev);
 	if (ret < 0)
@@ -930,9 +928,9 @@ static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 out_ops:
 	ethnl_ops_complete(dev);
 out_free_cfg:
-	kfree(dev->cfg_pending);
-out_tie_cfg:
+	__netdev_free_config(dev->cfg_pending);
 	dev->cfg_pending = dev->cfg;
+out_unlock:
 	netdev_unlock_ops(dev);
 	rtnl_unlock();
 out_dev:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 11/23] net: reduce indent of struct netdev_queue_mgmt_ops members
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (9 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 10/23] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Trivial change, reduce the indent. I think the original is copied
from real NDOs. It's unnecessarily deep, makes passing struct args
problematic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 8c21ea9b9515..d73f9023c96f 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -152,18 +152,18 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
-	size_t			ndo_queue_mem_size;
-	int			(*ndo_queue_mem_alloc)(struct net_device *dev,
-						       void *per_queue_mem,
-						       int idx);
-	void			(*ndo_queue_mem_free)(struct net_device *dev,
-						      void *per_queue_mem);
-	int			(*ndo_queue_start)(struct net_device *dev,
-						   void *per_queue_mem,
-						   int idx);
-	int			(*ndo_queue_stop)(struct net_device *dev,
-						  void *per_queue_mem,
-						  int idx);
+	size_t	ndo_queue_mem_size;
+	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       void *per_queue_mem,
+				       int idx);
+	void	(*ndo_queue_mem_free)(struct net_device *dev,
+				      void *per_queue_mem);
+	int	(*ndo_queue_start)(struct net_device *dev,
+				   void *per_queue_mem,
+				   int idx);
+	int	(*ndo_queue_stop)(struct net_device *dev,
+				  void *per_queue_mem,
+				  int idx);
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (10 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 11/23] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 21:29   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 13/23] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Create an array of config structs to store per-queue config.
Pass these structs in the queue API. Drivers can also retrieve
the config for a single queue calling netdev_queue_config()
directly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: patch up mlx callbacks with unused qcfg]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  8 ++-
 drivers/net/ethernet/google/gve/gve_main.c    |  9 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  9 +--
 drivers/net/netdevsim/netdev.c                |  6 +-
 include/net/netdev_queues.h                   | 19 ++++++
 net/core/dev.h                                |  3 +
 net/core/netdev_config.c                      | 58 +++++++++++++++++++
 net/core/netdev_rx_queue.c                    | 11 +++-
 8 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d3d9b72ef313..48ff6f024e07 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15824,7 +15824,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
 	.get_base_stats		= bnxt_get_base_stats,
 };
 
-static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_mem_alloc(struct net_device *dev,
+				struct netdev_queue_config *qcfg,
+				void *qmem, int idx)
 {
 	struct bnxt_rx_ring_info *rxr, *clone;
 	struct bnxt *bp = netdev_priv(dev);
@@ -15992,7 +15994,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	dst->rx_agg_bmap = src->rx_agg_bmap;
 }
 
-static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_start(struct net_device *dev,
+			    struct netdev_queue_config *qcfg,
+			    void *qmem, int idx)
 {
 	struct bnxt *bp = netdev_priv(dev);
 	struct bnxt_rx_ring_info *rxr, *clone;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 1f411d7c4373..f40edab616d8 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2580,8 +2580,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
 		gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
 }
 
-static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
-				  int idx)
+static int gve_rx_queue_mem_alloc(struct net_device *dev,
+				  struct netdev_queue_config *qcfg,
+				  void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_alloc_rings_cfg cfg = {0};
@@ -2602,7 +2603,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
 	return err;
 }
 
-static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
+static int gve_rx_queue_start(struct net_device *dev,
+			      struct netdev_queue_config *qcfg,
+			      void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_ring *gve_per_q_mem;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 21bb88c5d3dc..83264c17a4f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5541,8 +5541,9 @@ struct mlx5_qmgmt_data {
 	struct mlx5e_channel_param cparam;
 };
 
-static int mlx5e_queue_mem_alloc(struct net_device *dev, void *newq,
-				 int queue_index)
+static int mlx5e_queue_mem_alloc(struct net_device *dev,
+				 struct netdev_queue_config *qcfg,
+				 void *newq, int queue_index)
 {
 	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
 	struct mlx5e_priv *priv = netdev_priv(dev);
@@ -5603,8 +5604,8 @@ static int mlx5e_queue_stop(struct net_device *dev, void *oldq, int queue_index)
 	return 0;
 }
 
-static int mlx5e_queue_start(struct net_device *dev, void *newq,
-			     int queue_index)
+static int mlx5e_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
+			     void *newq, int queue_index)
 {
 	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
 	struct mlx5e_priv *priv = netdev_priv(dev);
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0178219f0db5..985c3403ec57 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -733,7 +733,8 @@ struct nsim_queue_mem {
 };
 
 static int
-nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_mem_alloc(struct net_device *dev, struct netdev_queue_config *qcfg,
+		     void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
@@ -782,7 +783,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
 }
 
 static int
-nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
+		 void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index d73f9023c96f..b850cff71d12 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -32,6 +32,13 @@ struct netdev_config {
 	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
 	 */
 	u8	hds_config;
+
+	/** @qcfg: per-queue configuration */
+	struct netdev_queue_config *qcfg;
+};
+
+/* Same semantics as fields in struct netdev_config */
+struct netdev_queue_config {
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -136,6 +143,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  *
  * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
  *
+ * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
+ *			defaults. Queue config structs are passed to this
+ *			helper before the user-requested settings are applied.
+ *
  * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
  *			 The new memory is written at the specified address.
  *
@@ -153,12 +164,17 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  */
 struct netdev_queue_mgmt_ops {
 	size_t	ndo_queue_mem_size;
+	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
+					  int idx,
+					  struct netdev_queue_config *qcfg);
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
 				       int idx);
 	void	(*ndo_queue_mem_free)(struct net_device *dev,
 				      void *per_queue_mem);
 	int	(*ndo_queue_start)(struct net_device *dev,
+				   struct netdev_queue_config *qcfg,
 				   void *per_queue_mem,
 				   int idx);
 	int	(*ndo_queue_stop)(struct net_device *dev,
@@ -166,6 +182,9 @@ struct netdev_queue_mgmt_ops {
 				  int idx);
 };
 
+void netdev_queue_config(struct net_device *dev, int rxq,
+			 struct netdev_queue_config *qcfg);
+
 /**
  * DOC: Lockless queue stopping / waking helpers.
  *
diff --git a/net/core/dev.h b/net/core/dev.h
index 7041c8bd2a0f..a553a0f1f846 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -9,6 +9,7 @@
 #include <net/netdev_lock.h>
 
 struct net;
+struct netdev_queue_config;
 struct netlink_ext_ack;
 struct cpumask;
 
@@ -96,6 +97,8 @@ int netdev_alloc_config(struct net_device *dev);
 void __netdev_free_config(struct netdev_config *cfg);
 void netdev_free_config(struct net_device *dev);
 int netdev_reconfig_start(struct net_device *dev);
+void __netdev_queue_config(struct net_device *dev, int rxq,
+			   struct netdev_queue_config *qcfg, bool pending);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index 270b7f10a192..bad2d53522f0 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -8,18 +8,29 @@
 int netdev_alloc_config(struct net_device *dev)
 {
 	struct netdev_config *cfg;
+	unsigned int maxqs;
 
 	cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
 	if (!cfg)
 		return -ENOMEM;
 
+	maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
+	cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg->qcfg)
+		goto err_free_cfg;
+
 	dev->cfg = cfg;
 	dev->cfg_pending = cfg;
 	return 0;
+
+err_free_cfg:
+	kfree(cfg);
+	return -ENOMEM;
 }
 
 void __netdev_free_config(struct netdev_config *cfg)
 {
+	kfree(cfg->qcfg);
 	kfree(cfg);
 }
 
@@ -32,12 +43,59 @@ void netdev_free_config(struct net_device *dev)
 int netdev_reconfig_start(struct net_device *dev)
 {
 	struct netdev_config *cfg;
+	unsigned int maxqs;
 
 	WARN_ON(dev->cfg != dev->cfg_pending);
 	cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
 	if (!cfg)
 		return -ENOMEM;
 
+	maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
+	cfg->qcfg = kmemdup_array(dev->cfg->qcfg, maxqs, sizeof(*cfg->qcfg),
+				  GFP_KERNEL_ACCOUNT);
+	if (!cfg->qcfg)
+		goto err_free_cfg;
+
 	dev->cfg_pending = cfg;
 	return 0;
+
+err_free_cfg:
+	kfree(cfg);
+	return -ENOMEM;
+}
+
+void __netdev_queue_config(struct net_device *dev, int rxq,
+			   struct netdev_queue_config *qcfg, bool pending)
+{
+	memset(qcfg, 0, sizeof(*qcfg));
+
+	/* Get defaults from the driver, in case user config not set */
+	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
+		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+}
+
+/**
+ * netdev_queue_config() - get configuration for a given queue
+ * @dev:  net_device instance
+ * @rxq:  index of the queue of interest
+ * @qcfg: queue configuration struct (output)
+ *
+ * Render the configuration for a given queue. This helper should be used
+ * by drivers which support queue configuration to retrieve config for
+ * a particular queue.
+ *
+ * @qcfg is an output parameter and is always fully initialized by this
+ * function. Some values may not be set by the user, drivers may either
+ * deal with the "unset" values in @qcfg, or provide the callback
+ * to populate defaults in queue_management_ops.
+ *
+ * Note that this helper returns pending config, as it is expected that
+ * "old" queues are retained until config is successful so they can
+ * be restored directly without asking for the config.
+ */
+void netdev_queue_config(struct net_device *dev, int rxq,
+			 struct netdev_queue_config *qcfg)
+{
+	__netdev_queue_config(dev, rxq, qcfg, true);
 }
+EXPORT_SYMBOL(netdev_queue_config);
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 3bf1151d8061..fb87ce219a8a 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -7,12 +7,14 @@
 #include <net/netdev_rx_queue.h>
 #include <net/page_pool/memory_provider.h>
 
+#include "dev.h"
 #include "page_pool_priv.h"
 
 int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
 	void *new_mem, *old_mem;
 	int err;
 
@@ -32,7 +34,9 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		goto err_free_new_mem;
 	}
 
-	err = qops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx);
+	netdev_queue_config(dev, rxq_idx, &qcfg);
+
+	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
 
@@ -45,7 +49,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		if (err)
 			goto err_free_new_queue_mem;
 
-		err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
+		err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
 		if (err)
 			goto err_start_queue;
 	} else {
@@ -60,6 +64,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	return 0;
 
 err_start_queue:
+	__netdev_queue_config(dev, rxq_idx, &qcfg, false);
 	/* Restarting the queue with old_mem should be successful as we haven't
 	 * changed any of the queue configuration, and there is not much we can
 	 * do to recover from a failure here.
@@ -67,7 +72,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	 * WARN if we fail to recover the old rx queue, and at least free
 	 * old_mem so we don't also leak that.
 	 */
-	if (qops->ndo_queue_start(dev, old_mem, rxq_idx)) {
+	if (qops->ndo_queue_start(dev, &qcfg, old_mem, rxq_idx)) {
 		WARN(1,
 		     "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
 		     rxq_idx);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 13/23] net: pass extack to netdev_rx_queue_restart()
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (11 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 21:30   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 14/23] net: add queue config validation callback Pavel Begunkov
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Pass extack to netdev_rx_queue_restart(). Subsequent change will need it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 drivers/net/netdevsim/netdev.c            | 2 +-
 include/net/netdev_rx_queue.h             | 3 ++-
 net/core/netdev_rx_queue.c                | 7 ++++---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 48ff6f024e07..4cb92267251d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11544,7 +11544,7 @@ static void bnxt_irq_affinity_notify(struct irq_affinity_notify *notify,
 
 	netdev_lock(irq->bp->dev);
 	if (netif_running(irq->bp->dev)) {
-		err = netdev_rx_queue_restart(irq->bp->dev, irq->ring_nr);
+		err = netdev_rx_queue_restart(irq->bp->dev, irq->ring_nr, NULL);
 		if (err)
 			netdev_err(irq->bp->dev,
 				   "RX queue restart failed: err=%d\n", err);
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 985c3403ec57..919088822159 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -869,7 +869,7 @@ nsim_qreset_write(struct file *file, const char __user *data,
 	}
 
 	ns->rq_reset_mode = mode;
-	ret = netdev_rx_queue_restart(ns->netdev, queue);
+	ret = netdev_rx_queue_restart(ns->netdev, queue, NULL);
 	ns->rq_reset_mode = 0;
 	if (ret)
 		goto exit_unlock;
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 8cdcd138b33f..a7def1f94823 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -56,6 +56,7 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
 	return index;
 }
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq,
+			    struct netlink_ext_ack *extack);
 
 #endif
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index fb87ce219a8a..420b956a40e4 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -10,7 +10,8 @@
 #include "dev.h"
 #include "page_pool_priv.h"
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
+			    struct netlink_ext_ack *extack)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
@@ -134,7 +135,7 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 #endif
 
 	rxq->mp_params = *p;
-	ret = netdev_rx_queue_restart(dev, rxq_idx);
+	ret = netdev_rx_queue_restart(dev, rxq_idx, extack);
 	if (ret) {
 		rxq->mp_params.mp_ops = NULL;
 		rxq->mp_params.mp_priv = NULL;
@@ -177,7 +178,7 @@ void __net_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
 
 	rxq->mp_params.mp_ops = NULL;
 	rxq->mp_params.mp_priv = NULL;
-	err = netdev_rx_queue_restart(dev, ifq_idx);
+	err = netdev_rx_queue_restart(dev, ifq_idx, NULL);
 	WARN_ON(err && err != -ENETDOWN);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 14/23] net: add queue config validation callback
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (12 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 13/23] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 21:54   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 15/23] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before each queue restart
and when ring params are modified.

For NIC-wide changes the callback gets invoked for each active
(or active to-be) queue, and additionally with a negative queue
index for NIC-wide defaults. The NIC-wide check is needed in
case all queues have an override active when NIC-wide setting
is changed to an unsupported one. Alternatively we could check
the settings when new queues are enabled (in the channel API),
but accepting invalid config is a bad idea. Users may expect
that resetting a queue override will always work.

The "trick" of passing a negative index is a bit ugly, we may
want to revisit if it causes confusion and bugs. Existing drivers
don't care about the index so it "just works".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 12 ++++++++++++
 net/core/dev.h              |  2 ++
 net/core/netdev_config.c    | 20 ++++++++++++++++++++
 net/core/netdev_rx_queue.c  |  6 ++++++
 net/ethtool/rings.c         |  5 +++++
 5 files changed, 45 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index b850cff71d12..d0cc475ec51e 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -147,6 +147,14 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  *			defaults. Queue config structs are passed to this
  *			helper before the user-requested settings are applied.
  *
+ * @ndo_queue_cfg_validate: (Optional) Check if queue config is supported.
+ *			Called when configuration affecting a queue may be
+ *			changing, either due to NIC-wide config, or config
+ *			scoped to the queue at a specified index.
+ *			When NIC-wide config is changed the callback will
+ *			be invoked for all queues, and in addition to that
+ *			with a negative queue index for the base settings.
+ *
  * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
  *			 The new memory is written at the specified address.
  *
@@ -167,6 +175,10 @@ struct netdev_queue_mgmt_ops {
 	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
 					  int idx,
 					  struct netdev_queue_config *qcfg);
+	int	(*ndo_queue_cfg_validate)(struct net_device *dev,
+					  int idx,
+					  struct netdev_queue_config *qcfg,
+					  struct netlink_ext_ack *extack);
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
 				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
diff --git a/net/core/dev.h b/net/core/dev.h
index a553a0f1f846..523d50e6f88d 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -99,6 +99,8 @@ void netdev_free_config(struct net_device *dev);
 int netdev_reconfig_start(struct net_device *dev);
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending);
+int netdev_queue_config_revalidate(struct net_device *dev,
+				   struct netlink_ext_ack *extack);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index bad2d53522f0..fc700b77e4eb 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
 	__netdev_queue_config(dev, rxq, qcfg, true);
 }
 EXPORT_SYMBOL(netdev_queue_config);
+
+int netdev_queue_config_revalidate(struct net_device *dev,
+				   struct netlink_ext_ack *extack)
+{
+	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
+	int i, err;
+
+	if (!qops || !qops->ndo_queue_cfg_validate)
+		return 0;
+
+	for (i = -1; i < (int)dev->real_num_rx_queues; i++) {
+		netdev_queue_config(dev, i, &qcfg);
+		err = qops->ndo_queue_cfg_validate(dev, i, &qcfg, extack);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 420b956a40e4..39834b196e95 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -37,6 +37,12 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 
 	netdev_queue_config(dev, rxq_idx, &qcfg);
 
+	if (qops->ndo_queue_cfg_validate) {
+		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, &qcfg, extack);
+		if (err)
+			goto err_free_old_mem;
+	}
+
 	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 6a74e7e4064e..7884d10c090f 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -4,6 +4,7 @@
 
 #include "netlink.h"
 #include "common.h"
+#include "../core/dev.h"
 
 struct rings_req_info {
 	struct ethnl_req_info		base;
@@ -307,6 +308,10 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 	dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
 	dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
 
+	ret = netdev_queue_config_revalidate(dev, info->extack);
+	if (ret)
+		return ret;
+
 	ret = dev->ethtool_ops->set_ringparam(dev, &ringparam,
 					      &kernel_ringparam, info->extack);
 	return ret < 0 ? ret : 1;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 15/23] eth: bnxt: always set the queue mgmt ops
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (13 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 14/23] net: add queue config validation callback Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 21:57   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 16/23] eth: bnxt: store the rx buf size per queue Pavel Begunkov
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Core provides a centralized callback for validating per-queue settings
but the callback is part of the queue management ops. Having the ops
conditionally set complicates the parts of the driver which could
otherwise lean on the core to feed it the correct settings.

Always set the queue ops, but provide no restart-related callbacks if
queue ops are not supported by the device. This should maintain current
behavior, the check in netdev_rx_queue_restart() looks both at op struct
and individual ops.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: reflow mgmt ops assignment]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 4cb92267251d..467e8a0745e1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16131,6 +16131,9 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_stop		= bnxt_queue_stop,
 };
 
+static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
+};
+
 static void bnxt_remove_one(struct pci_dev *pdev)
 {
 	struct net_device *dev = pci_get_drvdata(pdev);
@@ -16784,6 +16787,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	if (BNXT_SUPPORTS_NTUPLE_VNIC(bp))
 		bp->rss_cap |= BNXT_RSS_CAP_MULTI_RSS_CTX;
+
+	dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
 	if (BNXT_SUPPORTS_QUEUE_API(bp))
 		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;
 	dev->request_ops_lock = true;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 16/23] eth: bnxt: store the rx buf size per queue
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (14 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 15/23] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-25  6:24   ` Somnath Kotur
  2025-08-18 13:57 ` [PATCH net-next v3 17/23] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

In normal operation only a subset of queues is configured for
zero-copy. Since zero-copy is the main use for larger buffer
sizes we need to configure the sizes per queue.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 46 ++++++++++---------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  6 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |  2 +-
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 467e8a0745e1..50f663777843 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -900,7 +900,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
+	return rxr->need_head_pool || PAGE_SIZE > rxr->rx_page_size;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -910,9 +910,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > bp->rx_page_size) {
+	if (PAGE_SIZE > rxr->rx_page_size) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						bp->rx_page_size);
+						rxr->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -1150,9 +1150,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1184,7 +1184,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1198,7 +1198,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
+	skb_add_rx_frag(skb, 0, page, off, len, rxr->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1283,7 +1283,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, bp->rx_page_size);
+					       frag_len, rxr->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1308,7 +1308,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= bp->rx_page_size;
+				skb->truesize -= rxr->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1323,7 +1323,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  bp->rx_page_size);
+						  rxr->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -2276,8 +2276,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 			if (!skb)
 				goto oom_next_rx;
 		} else {
-			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs,
-						 rxr->page_pool, &xdp);
+			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr, &xdp);
 			if (!skb) {
 				/* we should be able to free the old skb here */
 				bnxt_xdp_buff_frags_free(rxr, &xdp);
@@ -3825,7 +3824,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size / rx_size_fac;
 
-	pp.order = get_order(bp->rx_page_size);
+	pp.order = get_order(rxr->rx_page_size);
 	pp.nid = numa_node;
 	pp.netdev = bp->dev;
 	pp.dev = &bp->pdev->dev;
@@ -4318,6 +4317,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
+		rxr->rx_page_size = bp->rx_page_size;
+
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
 		rmem->nr_pages = bp->rx_nr_pages;
@@ -4477,7 +4478,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
+		type = ((u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
 
 		bnxt_init_rxbd_pages(ring, type);
@@ -7042,6 +7043,7 @@ static void bnxt_hwrm_ring_grp_free(struct bnxt *bp)
 
 static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 				       struct hwrm_ring_alloc_input *req,
+				       struct bnxt_rx_ring_info *rxr,
 				       struct bnxt_ring_struct *ring)
 {
 	struct bnxt_ring_grp_info *grp_info = &bp->grp_info[ring->grp_idx];
@@ -7051,7 +7053,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
+		req->rx_buf_size = cpu_to_le16(rxr->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -7065,6 +7067,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 }
 
 static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
+				    struct bnxt_rx_ring_info *rxr,
 				    struct bnxt_ring_struct *ring,
 				    u32 ring_type, u32 map_index)
 {
@@ -7121,7 +7124,8 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
 			      cpu_to_le32(bp->rx_ring_mask + 1) :
 			      cpu_to_le32(bp->rx_agg_ring_mask + 1);
 		if (bp->flags & BNXT_FLAG_CHIP_P5_PLUS)
-			bnxt_set_rx_ring_params_p5(bp, ring_type, req, ring);
+			bnxt_set_rx_ring_params_p5(bp, ring_type, req,
+						   rxr, ring);
 		break;
 	case HWRM_RING_ALLOC_CMPL:
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL;
@@ -7269,7 +7273,7 @@ static int bnxt_hwrm_rx_ring_alloc(struct bnxt *bp,
 	u32 map_idx = bnapi->index;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7289,7 +7293,7 @@ static int bnxt_hwrm_rx_agg_ring_alloc(struct bnxt *bp,
 	int rc;
 
 	map_idx = grp_idx + bp->rx_nr_rings;
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7313,7 +7317,7 @@ static int bnxt_hwrm_cp_ring_alloc_p5(struct bnxt *bp,
 
 	ring = &cpr->cp_ring_struct;
 	ring->handle = BNXT_SET_NQ_HDL(cpr);
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &cpr->cp_db, type, map_idx, ring->fw_ring_id);
@@ -7328,7 +7332,7 @@ static int bnxt_hwrm_tx_ring_alloc(struct bnxt *bp,
 	const u32 type = HWRM_RING_ALLOC_TX;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, tx_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, tx_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &txr->tx_db, type, tx_idx, ring->fw_ring_id);
@@ -7354,7 +7358,7 @@ static int bnxt_hwrm_ring_alloc(struct bnxt *bp)
 
 		vector = bp->irq_tbl[map_idx].vector;
 		disable_irq_nosync(vector);
-		rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+		rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 		if (rc) {
 			enable_irq(vector);
 			goto err_out;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 56aafae568f8..4f9d4c71c0e2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
 
 	unsigned long		*rx_agg_bmap;
 	u16			rx_agg_bmap_size;
+	u16			rx_page_size;
 	bool                    need_head_pool;
 
 	dma_addr_t		rx_desc_mapping[MAX_RX_PAGES];
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 41d3ba56ba41..19dda0201c69 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = bp->rx_page_size;
+	u32 buflen = rxr->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -461,7 +461,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 
 struct sk_buff *
 bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
-		   struct page_pool *pool, struct xdp_buff *xdp)
+		   struct bnxt_rx_ring_info *rxr, struct xdp_buff *xdp)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 
@@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 
 	xdp_update_skb_shared_info(skb, num_frags,
 				   sinfo->xdp_frags_size,
-				   bp->rx_page_size * num_frags,
+				   rxr->rx_page_size * num_frags,
 				   xdp_buff_is_frag_pfmemalloc(xdp));
 	return skb;
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
index 220285e190fc..8933a0dec09a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
@@ -32,6 +32,6 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 void bnxt_xdp_buff_frags_free(struct bnxt_rx_ring_info *rxr,
 			      struct xdp_buff *xdp);
 struct sk_buff *bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb,
-				   u8 num_frags, struct page_pool *pool,
+				   u8 num_frags, struct bnxt_rx_ring_info *rxr,
 				   struct xdp_buff *xdp);
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 17/23] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (15 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 16/23] eth: bnxt: store the rx buf size per queue Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 18/23] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

The driver tries to provision more agg buffers than header buffers
since multiple agg segments can reuse the same header. The calculation
/ heuristic tries to provide enough pages for 65k of data for each header
(or 4 frags per header if the result is too big). This calculation is
currently global to the adapter. If we increase the buffer sizes 8x
we don't want 8x the amount of memory sitting on the rings.
Luckily we don't have to fill the rings completely, adjust
the fill level dynamically in case particular queue has buffers
larger than the global size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: rebase on top of agg_size_fac, assert agg_size_fac]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 28 +++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 50f663777843..b47b95631a33 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3811,16 +3811,34 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 	}
 }
 
+static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
+				       struct bnxt_rx_ring_info *rxr)
+{
+	/* User may have chosen larger than default rx_page_size,
+	 * we keep the ring sizes uniform and also want uniform amount
+	 * of bytes consumed per ring, so cap how much of the rings we fill.
+	 */
+	int fill_level = bp->rx_agg_ring_size;
+
+	if (rxr->rx_page_size > bp->rx_page_size)
+		fill_level /= rxr->rx_page_size / bp->rx_page_size;
+
+	return fill_level;
+}
+
 static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 				   struct bnxt_rx_ring_info *rxr,
 				   int numa_node)
 {
-	const unsigned int agg_size_fac = PAGE_SIZE / BNXT_RX_PAGE_SIZE;
+	unsigned int agg_size_fac = rxr->rx_page_size / BNXT_RX_PAGE_SIZE;
 	const unsigned int rx_size_fac = PAGE_SIZE / SZ_4K;
 	struct page_pool_params pp = { 0 };
 	struct page_pool *pool;
 
-	pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
+	if (WARN_ON_ONCE(agg_size_fac == 0))
+		agg_size_fac = 1;
+
+	pp.pool_size = bnxt_rx_agg_ring_fill_level(bp, rxr) / agg_size_fac;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size / rx_size_fac;
 
@@ -4396,11 +4414,13 @@ static void bnxt_alloc_one_rx_ring_netmem(struct bnxt *bp,
 					  struct bnxt_rx_ring_info *rxr,
 					  int ring_nr)
 {
+	int fill_level, i;
 	u32 prod;
-	int i;
+
+	fill_level = bnxt_rx_agg_ring_fill_level(bp, rxr);
 
 	prod = rxr->rx_agg_prod;
-	for (i = 0; i < bp->rx_agg_ring_size; i++) {
+	for (i = 0; i < fill_level; i++) {
 		if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
 			netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
 				    ring_nr, i, bp->rx_ring_size);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 18/23] netdev: add support for setting rx-buf-len per queue
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (16 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 17/23] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 22:36   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 19/23] net: wipe the setting of deactived queues Pavel Begunkov
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Zero-copy APIs increase the cost of buffer management. They also extend
this cost to user space applications which may be used to dealing with
much larger buffers. Allow setting rx-buf-len per queue, devices with
HW-GRO support can commonly fill buffers up to 32k (or rather 64k - 1
but that's not a power of 2..)

The implementation adds a new option to the netdev netlink, rather
than ethtool. The NIC-wide setting lives in ethtool ringparams so
one could argue that we should be extending the ethtool API.
OTOH netdev API is where we already have queue-get, and it's how
zero-copy applications bind memory providers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/netlink/specs/netdev.yaml | 15 ++++
 include/net/netdev_queues.h             |  5 ++
 include/net/netlink.h                   | 19 +++++
 include/uapi/linux/netdev.h             |  2 +
 net/core/netdev-genl-gen.c              | 15 ++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 92 +++++++++++++++++++++++++
 net/core/netdev_config.c                | 16 +++++
 tools/include/uapi/linux/netdev.h       |  2 +
 9 files changed, 167 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index c035dc0f64fd..498c4bcafdbd 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -338,6 +338,10 @@ attribute-sets:
         doc: XSK information for this queue, if any.
         type: nest
         nested-attributes: xsk-info
+      -
+        name: rx-buf-len
+        doc: Per-queue configuration of ETHTOOL_A_RINGS_RX_BUF_LEN.
+        type: u32
   -
     name: qstats
     doc: |
@@ -771,6 +775,17 @@ operations:
         reply:
           attributes:
             - id
+    -
+      name: queue-set
+      doc: Set per-queue configurable options.
+      attribute-set: queue
+      do:
+        request:
+          attributes:
+            - ifindex
+            - type
+            - id
+            - rx-buf-len
 
 kernel-family:
   headers: ["net/netdev_netlink.h"]
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index d0cc475ec51e..b69b1d519dcb 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -39,6 +39,7 @@ struct netdev_config {
 
 /* Same semantics as fields in struct netdev_config */
 struct netdev_queue_config {
+	u32	rx_buf_len;
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -141,6 +142,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
 /**
  * struct netdev_queue_mgmt_ops - netdev ops for queue management
  *
+ * @supported_ring_params: ring params supported per queue (ETHTOOL_RING_USE_*).
+ *
  * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
  *
  * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
@@ -171,6 +174,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
+	u32     supported_ring_params;
+
 	size_t	ndo_queue_mem_size;
 	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
 					  int idx,
diff --git a/include/net/netlink.h b/include/net/netlink.h
index 1a8356ca4b78..29989ad81ddd 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -2200,6 +2200,25 @@ static inline struct nla_bitfield32 nla_get_bitfield32(const struct nlattr *nla)
 	return tmp;
 }
 
+/**
+ * nla_update_u32() - update u32 value from NLA_U32 attribute
+ * @dst:  value to update
+ * @attr: netlink attribute with new value or null
+ *
+ * Copy the u32 value from NLA_U32 netlink attribute @attr into variable
+ * pointed to by @dst; do nothing if @attr is null.
+ *
+ * Return: true if this function changed the value of @dst, otherwise false.
+ */
+static inline bool nla_update_u32(u32 *dst, const struct nlattr *attr)
+{
+	u32 old_val = *dst;
+
+	if (attr)
+		*dst = nla_get_u32(attr);
+	return *dst != old_val;
+}
+
 /**
  * nla_memdup - duplicate attribute memory (kmemdup)
  * @src: netlink attribute to duplicate from
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 48eb49aa03d4..820f89b67a72 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -158,6 +158,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_RX_BUF_LEN,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
@@ -226,6 +227,7 @@ enum {
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
 	NETDEV_CMD_BIND_TX,
+	NETDEV_CMD_QUEUE_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index e9a2a6f26cb7..d053306a3af8 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -106,6 +106,14 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
 	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
 };
 
+/* NETDEV_CMD_QUEUE_SET - do */
+static const struct nla_policy netdev_queue_set_nl_policy[NETDEV_A_QUEUE_RX_BUF_LEN + 1] = {
+	[NETDEV_A_QUEUE_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+	[NETDEV_A_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_QUEUE_ID] = { .type = NLA_U32, },
+	[NETDEV_A_QUEUE_RX_BUF_LEN] = { .type = NLA_U32, },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -204,6 +212,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_DMABUF_FD,
 		.flags		= GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_QUEUE_SET,
+		.doit		= netdev_nl_queue_set_doit,
+		.policy		= netdev_queue_set_nl_policy,
+		.maxattr	= NETDEV_A_QUEUE_RX_BUF_LEN,
+		.flags		= GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index cf3fad74511f..b7f5e5d9fca9 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 6314eb7bdf69..abb128e45fcf 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -386,6 +386,30 @@ static int nla_put_napi_id(struct sk_buff *skb, const struct napi_struct *napi)
 	return 0;
 }
 
+static int
+netdev_nl_queue_fill_cfg(struct sk_buff *rsp, struct net_device *netdev,
+			 u32 q_idx, u32 q_type)
+{
+	struct netdev_queue_config *qcfg;
+
+	if (!netdev_need_ops_lock(netdev))
+		return 0;
+
+	qcfg = &netdev->cfg->qcfg[q_idx];
+	switch (q_type) {
+	case NETDEV_QUEUE_TYPE_RX:
+		if (qcfg->rx_buf_len &&
+		    nla_put_u32(rsp, NETDEV_A_QUEUE_RX_BUF_LEN,
+				qcfg->rx_buf_len))
+			return -EMSGSIZE;
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
@@ -433,6 +457,9 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 		break;
 	}
 
+	if (netdev_nl_queue_fill_cfg(rsp, netdev, q_idx, q_type))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
@@ -572,6 +599,71 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr * const *tb = info->attrs;
+	struct netdev_queue_config *qcfg;
+	u32 q_id, q_type, ifindex;
+	struct net_device *netdev;
+	bool mod;
+	int ret;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_TYPE) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_IFINDEX))
+		return -EINVAL;
+
+	q_id = nla_get_u32(tb[NETDEV_A_QUEUE_ID]);
+	q_type = nla_get_u32(tb[NETDEV_A_QUEUE_TYPE]);
+	ifindex = nla_get_u32(tb[NETDEV_A_QUEUE_IFINDEX]);
+
+	if (q_type != NETDEV_QUEUE_TYPE_RX) {
+		/* Only Rx params exist right now */
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_TYPE]);
+		return -EINVAL;
+	}
+
+	ret = 0;
+	netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex);
+	if (!netdev || !netif_device_present(netdev))
+		ret = -ENODEV;
+	else if (!netdev->queue_mgmt_ops)
+		ret = -EOPNOTSUPP;
+	if (ret) {
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_IFINDEX]);
+		goto exit_unlock;
+	}
+
+	ret = netdev_nl_queue_validate(netdev, q_id, q_type);
+	if (ret) {
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_ID]);
+		goto exit_unlock;
+	}
+
+	ret = netdev_reconfig_start(netdev);
+	if (ret)
+		goto exit_unlock;
+
+	qcfg = &netdev->cfg_pending->qcfg[q_id];
+	mod = nla_update_u32(&qcfg->rx_buf_len, tb[NETDEV_A_QUEUE_RX_BUF_LEN]);
+	if (!mod)
+		goto exit_free_cfg;
+
+	ret = netdev_rx_queue_restart(netdev, q_id, info->extack);
+	if (ret)
+		goto exit_free_cfg;
+
+	swap(netdev->cfg, netdev->cfg_pending);
+
+exit_free_cfg:
+	__netdev_free_config(netdev->cfg_pending);
+	netdev->cfg_pending = netdev->cfg;
+exit_unlock:
+	if (netdev)
+		netdev_unlock(netdev);
+	return ret;
+}
+
 #define NETDEV_STAT_NOT_SET		(~0ULL)
 
 static void netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index fc700b77e4eb..ede02b77470e 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -67,11 +67,27 @@ int netdev_reconfig_start(struct net_device *dev)
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
+	const struct netdev_config *cfg;
+
+	cfg = pending ? dev->cfg_pending : dev->cfg;
+
 	memset(qcfg, 0, sizeof(*qcfg));
 
 	/* Get defaults from the driver, in case user config not set */
 	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
 		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+
+	/* Set config based on device-level settings */
+	if (cfg->rx_buf_len)
+		qcfg->rx_buf_len = cfg->rx_buf_len;
+
+	/* Set config dedicated to this queue */
+	if (rxq >= 0) {
+		const struct netdev_queue_config *user_cfg = &cfg->qcfg[rxq];
+
+		if (user_cfg->rx_buf_len)
+			qcfg->rx_buf_len = user_cfg->rx_buf_len;
+	}
 }
 
 /**
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 48eb49aa03d4..820f89b67a72 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -158,6 +158,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_RX_BUF_LEN,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
@@ -226,6 +227,7 @@ enum {
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
 	NETDEV_CMD_BIND_TX,
+	NETDEV_CMD_QUEUE_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 19/23] net: wipe the setting of deactived queues
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (17 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 18/23] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-19 22:49   ` Mina Almasry
  2025-08-18 13:57 ` [PATCH net-next v3 20/23] eth: bnxt: use queue op config validate Pavel Begunkov
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Clear out all settings of deactived queues when user changes
the number of channels. We already perform similar cleanup
for shapers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/dev.c           |  5 +++++
 net/core/dev.h           |  2 ++
 net/core/netdev_config.c | 13 +++++++++++++
 3 files changed, 20 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7cd4e5eab441..457ba1d111e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3188,6 +3188,8 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 		if (dev->num_tc)
 			netif_setup_tc(dev, txq);
 
+		netdev_queue_config_update_cnt(dev, txq,
+					       dev->real_num_rx_queues);
 		net_shaper_set_real_num_tx_queues(dev, txq);
 
 		dev_qdisc_change_real_num_tx(dev, txq);
@@ -3233,6 +3235,9 @@ int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
 						  rxq);
 		if (rc)
 			return rc;
+
+		netdev_queue_config_update_cnt(dev, dev->real_num_tx_queues,
+					       rxq);
 	}
 
 	dev->real_num_rx_queues = rxq;
diff --git a/net/core/dev.h b/net/core/dev.h
index 523d50e6f88d..c1cc54e38fe4 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -101,6 +101,8 @@ void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending);
 int netdev_queue_config_revalidate(struct net_device *dev,
 				   struct netlink_ext_ack *extack);
+void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
+				    unsigned int rxq);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index ede02b77470e..c5ae39e76f40 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -64,6 +64,19 @@ int netdev_reconfig_start(struct net_device *dev)
 	return -ENOMEM;
 }
 
+void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
+				    unsigned int rxq)
+{
+	size_t len;
+
+	if (rxq < dev->real_num_rx_queues) {
+		len = (dev->real_num_rx_queues - rxq) * sizeof(*dev->cfg->qcfg);
+
+		memset(&dev->cfg->qcfg[rxq], 0, len);
+		memset(&dev->cfg_pending->qcfg[rxq], 0, len);
+	}
+}
+
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 20/23] eth: bnxt: use queue op config validate
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (18 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 19/23] net: wipe the setting of deactived queues Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 21/23] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Move the rx-buf-len config validation to the queue ops.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 40 +++++++++++++++++++
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 ------
 2 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b47b95631a33..b02205f1f010 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16147,8 +16147,46 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
 	return 0;
 }
 
+static int
+bnxt_queue_cfg_validate(struct net_device *dev, int idx,
+			struct netdev_queue_config *qcfg,
+			struct netlink_ext_ack *extack)
+{
+	struct bnxt *bp = netdev_priv(dev);
+
+	/* Older chips need MSS calc so rx_buf_len is not supported,
+	 * but we don't set queue ops for them so we should never get here.
+	 */
+	if (qcfg->rx_buf_len != bp->rx_page_size &&
+	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(qcfg->rx_buf_len)) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
+		return -ERANGE;
+	}
+	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
+	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
+		return -ERANGE;
+	}
+	return 0;
+}
+
+static void
+bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
+			struct netdev_queue_config *qcfg)
+{
+	qcfg->rx_buf_len	= BNXT_RX_PAGE_SIZE;
+}
+
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
+
+	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
+	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
 	.ndo_queue_mem_alloc	= bnxt_queue_mem_alloc,
 	.ndo_queue_mem_free	= bnxt_queue_mem_free,
 	.ndo_queue_start	= bnxt_queue_start,
@@ -16156,6 +16194,8 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 };
 
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
+	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
+	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
 };
 
 static void bnxt_remove_one(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 2e130eeeabe5..65b8eabdcd24 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -867,18 +867,6 @@ static int bnxt_set_ringparam(struct net_device *dev,
 	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
 		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
 
-	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
-	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
-		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
-		return -EINVAL;
-	}
-	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
-	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
-	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
-		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
-		return -ERANGE;
-	}
-
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 21/23] eth: bnxt: support per queue configuration of rx-buf-len
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (19 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 20/23] eth: bnxt: use queue op config validate Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 22/23] net: let pp memory provider to specify rx buf len Pavel Begunkov
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Now that the rx_buf_len is stored and validated per queue allow
it being set differently for different queues. Instead of copying
the device setting for each queue ask the core for the config
via netdev_queue_config().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b02205f1f010..5490f956f577 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4313,6 +4313,7 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 
 	for (i = 0; i < bp->cp_nr_rings; i++) {
 		struct bnxt_napi *bnapi = bp->bnapi[i];
+		struct netdev_queue_config qcfg;
 		struct bnxt_ring_mem_info *rmem;
 		struct bnxt_cp_ring_info *cpr;
 		struct bnxt_rx_ring_info *rxr;
@@ -4335,7 +4336,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
-		rxr->rx_page_size = bp->rx_page_size;
+		netdev_queue_config(bp->dev, i,	&qcfg);
+		rxr->rx_page_size = qcfg.rx_buf_len;
 
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
@@ -15870,6 +15872,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev,
 	clone->rx_agg_prod = 0;
 	clone->rx_sw_agg_prod = 0;
 	clone->rx_next_cons = 0;
+	clone->rx_page_size = qcfg->rx_buf_len;
 	clone->need_head_pool = false;
 
 	rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
@@ -15976,6 +15979,8 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	src_ring = &src->rx_ring_struct;
 	src_rmem = &src_ring->ring_mem;
 
+	dst->rx_page_size = src->rx_page_size;
+
 	WARN_ON(dst_rmem->nr_pages != src_rmem->nr_pages);
 	WARN_ON(dst_rmem->page_size != src_rmem->page_size);
 	WARN_ON(dst_rmem->flags != src_rmem->flags);
@@ -16183,6 +16188,7 @@ bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
 }
 
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
+	.supported_ring_params	= ETHTOOL_RING_USE_RX_BUF_LEN,
 	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
 
 	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 22/23] net: let pp memory provider to specify rx buf len
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (20 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 21/23] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 13:57 ` [PATCH net-next v3 23/23] net: validate driver supports passed qcfg params Pavel Begunkov
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

Allow memory providers to configure rx queues with a specific receive
buffer length. Pass it in sturct pp_memory_provider_params, which is
copied into the queue, and make __netdev_queue_config() to check if it's
present and apply to the configuration. This way the configured length
will persist across queue restarts, and will be automatically removed
once a memory provider is detached.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/page_pool/types.h |  1 +
 net/core/netdev_config.c      | 15 +++++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 1509a536cb85..be74e4aec7b5 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -161,6 +161,7 @@ struct memory_provider_ops;
 struct pp_memory_provider_params {
 	void *mp_priv;
 	const struct memory_provider_ops *mp_ops;
+	u32 rx_buf_len;
 };
 
 struct page_pool {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index c5ae39e76f40..2c9b06f94e01 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -2,6 +2,7 @@
 
 #include <linux/netdevice.h>
 #include <net/netdev_queues.h>
+#include <net/netdev_rx_queue.h>
 
 #include "dev.h"
 
@@ -77,7 +78,7 @@ void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
 	}
 }
 
-void __netdev_queue_config(struct net_device *dev, int rxq,
+void __netdev_queue_config(struct net_device *dev, int rxq_idx,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
 	const struct netdev_config *cfg;
@@ -88,18 +89,24 @@ void __netdev_queue_config(struct net_device *dev, int rxq,
 
 	/* Get defaults from the driver, in case user config not set */
 	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
-		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq_idx, qcfg);
 
 	/* Set config based on device-level settings */
 	if (cfg->rx_buf_len)
 		qcfg->rx_buf_len = cfg->rx_buf_len;
 
 	/* Set config dedicated to this queue */
-	if (rxq >= 0) {
-		const struct netdev_queue_config *user_cfg = &cfg->qcfg[rxq];
+	if (rxq_idx >= 0) {
+		const struct netdev_queue_config *user_cfg;
+		struct netdev_rx_queue *rxq;
 
+		user_cfg = &cfg->qcfg[rxq_idx];
 		if (user_cfg->rx_buf_len)
 			qcfg->rx_buf_len = user_cfg->rx_buf_len;
+
+		rxq = __netif_get_rx_queue(dev, rxq_idx);
+		if (rxq->mp_params.mp_ops && rxq->mp_params.rx_buf_len)
+			qcfg->rx_buf_len = rxq->mp_params.rx_buf_len;
 	}
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next v3 23/23] net: validate driver supports passed qcfg params
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (21 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 22/23] net: let pp memory provider to specify rx buf len Pavel Begunkov
@ 2025-08-18 13:57 ` Pavel Begunkov
  2025-08-18 13:59 ` [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
  2025-08-20  2:31 ` Jakub Kicinski
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:57 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: asml.silence, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

When we pass a qcfg to a driver, make sure it supports the set
parameters by checking it against ->supported_ring_params.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/dev.h             |  3 +++
 net/core/netdev_config.c   | 26 ++++++++++++++++++++++++++
 net/core/netdev_rx_queue.c |  8 +++-----
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.h b/net/core/dev.h
index c1cc54e38fe4..c53b8045d685 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -103,6 +103,9 @@ int netdev_queue_config_revalidate(struct net_device *dev,
 				   struct netlink_ext_ack *extack);
 void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
 				    unsigned int rxq);
+int netdev_queue_config_validate(struct net_device *dev, int rxq_idx,
+				  struct netdev_queue_config *qcfg,
+				  struct netlink_ext_ack *extack);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index 2c9b06f94e01..ffe997893cd1 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/netdevice.h>
+#include <linux/ethtool.h>
 #include <net/netdev_queues.h>
 #include <net/netdev_rx_queue.h>
 
@@ -136,6 +137,31 @@ void netdev_queue_config(struct net_device *dev, int rxq,
 }
 EXPORT_SYMBOL(netdev_queue_config);
 
+int netdev_queue_config_validate(struct net_device *dev, int rxq_idx,
+				  struct netdev_queue_config *qcfg,
+				  struct netlink_ext_ack *extack)
+{
+	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	int err;
+
+	if (WARN_ON_ONCE(!qops))
+		return -EINVAL;
+
+	if (!(qops->supported_ring_params & ETHTOOL_RING_USE_RX_BUF_LEN) &&
+	    qcfg->rx_buf_len &&
+	    qcfg->rx_buf_len != dev->cfg_pending->rx_buf_len) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+
+	if (qops->ndo_queue_cfg_validate) {
+		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, qcfg, extack);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
 int netdev_queue_config_revalidate(struct net_device *dev,
 				   struct netlink_ext_ack *extack)
 {
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 39834b196e95..d583a9ead9c4 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -37,11 +37,9 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 
 	netdev_queue_config(dev, rxq_idx, &qcfg);
 
-	if (qops->ndo_queue_cfg_validate) {
-		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, &qcfg, extack);
-		if (err)
-			goto err_free_old_mem;
-	}
+	err = netdev_queue_config_validate(dev, rxq_idx, &qcfg, extack);
+	if (err)
+		goto err_free_old_mem;
 
 	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (22 preceding siblings ...)
  2025-08-18 13:57 ` [PATCH net-next v3 23/23] net: validate driver supports passed qcfg params Pavel Begunkov
@ 2025-08-18 13:59 ` Pavel Begunkov
  2025-08-20  2:31 ` Jakub Kicinski
  24 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-18 13:59 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: Eric Dumazet, Willem de Bruijn, Paolo Abeni, andrew+netdev, horms,
	davem, sdf, almasrymina, dw, michael.chan, dtatulea, ap420073,
	linux-kernel, io-uring

On 8/18/25 14:57, Pavel Begunkov wrote:
> Pull request with netdev only patches that add support for per queue
> configuration and large rx buffers for memory providers. The zcrx
> patch using it is separately and can be found at [2].

I'm sending it out as a v6.17-rc2 based pull request since I'll also
need it in another tree for zcrx. The patch number is over the limit,
however most of them are just taken from Jakub's series, and it'll
likely be esier this way for cross tree work. Please let me know if
that's acceptable or whether I need to somehow split or trim it
down.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order
  2025-08-18 13:57 ` [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order Pavel Begunkov
@ 2025-08-18 23:33   ` Mina Almasry
  2025-08-19 15:53     ` Pavel Begunkov
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-18 23:33 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> We're going to give more control over rx buffer sizes to user space, and
> since we can't always rely on driver validation, let's sanitise it in
> page_pool_init() as well. Note that we only need to reject over
> MAX_PAGE_ORDER allocations for normal page pools, as current memory
> providers don't need to use the buddy allocator and must check the order
> on init.
>
> Suggested-by: Stanislav Fomichev <stfomichev@gmail.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

I think I noticed an unrelated bug in this code and we need this fix?

```
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 343a6cac21e3..ba70569bd4b0 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -287,8 +287,10 @@ static int page_pool_init(struct page_pool *pool,
        }

        if (pool->mp_ops) {
-               if (!pool->dma_map || !pool->dma_sync)
-                       return -EOPNOTSUPP;
+               if (!pool->dma_map || !pool->dma_sync) {
+                       err = -EOPNOTSUPP;
+                       goto free_ptr_ring;
+               }

                if (WARN_ON(!is_kernel_rodata((unsigned long)pool->mp_ops))) {
                        err = -EFAULT;
```

I'll send a separate fix.


--
Thanks,
Mina

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 02/23] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-08-18 13:57 ` [PATCH net-next v3 02/23] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
@ 2025-08-18 23:50   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-18 23:50 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Document the semantics of the rx_buf_len ethtool ring param.
> Clarify its meaning in case of HDS, where driver may have
> two separate buffer pools.
>
> The various zero-copy TCP Rx schemes we have suffer from memory

nit: 'we have suffer' sounds weird, probably meant just 'suffer'.

> management overhead. Specifically applications aren't too impressed
> with the number of 4kB buffers they have to juggle. Zero-copy
> TCP makes most sense with larger memory transfers so using
> 16kB or 32kB buffers (with the help of HW-GRO) feels more
> natural.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 03/23] net: ethtool: report max value for rx-buf-len
  2025-08-18 13:57 ` [PATCH net-next v3 03/23] net: ethtool: report max value for rx-buf-len Pavel Begunkov
@ 2025-08-19  0:00   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19  0:00 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Unlike most of our APIs the rx-buf-len param does not have an associated
> max value. In theory user could set this value pretty high, but in
> practice most NICs have limits due to the width of the length fields
> in the descriptors.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

The other driver that supports rx_buf_len I think is hns3, and I think
you can set the max_rx_buf_len as RX_BUF_LEN_4K there? It only support
2K and 4K.

Other than that, looks good to me,

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default
  2025-08-18 13:57 ` [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default Pavel Begunkov
@ 2025-08-19  0:07   ` Mina Almasry
  2025-08-19 15:52     ` Pavel Begunkov
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19  0:07 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Distinguish between rx_buf_len being driver default vs user config.
> Use 0 as a special value meaning "unset" or "restore driver default".
> This will be necessary later on to configure it per-queue, but
> the ability to restore defaults may be useful in itself.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

I wonder if it should be extended to the other driver using
rx_buf_len, hns3. For that, I think the default buf size would be
HNS3_DEFAULT_RX_BUF_LEN.

Other than that, seems fine to me,

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members
  2025-08-18 13:57 ` [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members Pavel Begunkov
@ 2025-08-19  1:46   ` Mina Almasry
  2025-08-20 12:04     ` Pavel Begunkov
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19  1:46 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> hds_thresh and hds_config are both inside struct netdev_config
> but have quite different semantics. hds_config is the user config
> with ternary semantics (on/off/unset). hds_thresh is a straight
> up value, populated by the driver at init and only modified by
> user space. We don't expect the drivers to have to pick a special
> hds_thresh value based on other configuration.
>
> The two approaches have different advantages and downsides.
> hds_thresh ("direct value") gives core easy access to current
> device settings, but there's no way to express whether the value
> comes from the user. It also requires the initialization by
> the driver.
>
> hds_config ("user config values") tells us what user wanted, but
> doesn't give us the current value in the core.
>
> Try to explain this a bit in the comments, so at we make a conscious
> choice for new values which semantics we expect.
>
> Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
> Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
> returns a hds_thresh value always as 0.") added the setting for the
> benefit of netdevsim which doesn't touch the value at all on get.
> Again, this is just to clarify the intention, shouldn't cause any
> functional change.
>

TBH I can't say that moving the init to before
dev->ethtool_ops->get_ringparam(dev, param, kparam, extack) made me
understand semantics better. If you do a respin, maybe a comment above
the kparam->hds_thresh to say what you mean would help the next reader
understand.

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> [pavel: applied clarification on relationship b/w HDS thresh and config]
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default
  2025-08-19  0:07   ` Mina Almasry
@ 2025-08-19 15:52     ` Pavel Begunkov
  2025-08-19 19:27       ` Mina Almasry
  0 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-19 15:52 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/19/25 01:07, Mina Almasry wrote:
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> Distinguish between rx_buf_len being driver default vs user config.
>> Use 0 as a special value meaning "unset" or "restore driver default".
>> This will be necessary later on to configure it per-queue, but
>> the ability to restore defaults may be useful in itself.
>>
>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> I wonder if it should be extended to the other driver using
> rx_buf_len, hns3. For that, I think the default buf size would be
> HNS3_DEFAULT_RX_BUF_LEN.

I'd rather avoid growing the series even more, let's follow up on
that in a separate patch on top, that should be just fine. And
thanks for the review

> Other than that, seems fine to me,
> 
> Reviewed-by: Mina Almasry <almasrymina@google.com>

With the said above, do you want me to retain the review tag?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order
  2025-08-18 23:33   ` Mina Almasry
@ 2025-08-19 15:53     ` Pavel Begunkov
  0 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-19 15:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/19/25 00:33, Mina Almasry wrote:
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> We're going to give more control over rx buffer sizes to user space, and
>> since we can't always rely on driver validation, let's sanitise it in
>> page_pool_init() as well. Note that we only need to reject over
>> MAX_PAGE_ORDER allocations for normal page pools, as current memory
>> providers don't need to use the buddy allocator and must check the order
>> on init.
>>
>> Suggested-by: Stanislav Fomichev <stfomichev@gmail.com>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> Reviewed-by: Mina Almasry <almasrymina@google.com>
> 
> I think I noticed an unrelated bug in this code and we need this fix?

Good catch

> 
> ```
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 343a6cac21e3..ba70569bd4b0 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -287,8 +287,10 @@ static int page_pool_init(struct page_pool *pool,
>          }
> 
>          if (pool->mp_ops) {
> -               if (!pool->dma_map || !pool->dma_sync)
> -                       return -EOPNOTSUPP;
> +               if (!pool->dma_map || !pool->dma_sync) {
> +                       err = -EOPNOTSUPP;
> +                       goto free_ptr_ring;
> +               }
> 
>                  if (WARN_ON(!is_kernel_rodata((unsigned long)pool->mp_ops))) {
>                          err = -EFAULT;
> ```


-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default
  2025-08-19 15:52     ` Pavel Begunkov
@ 2025-08-19 19:27       ` Mina Almasry
  2025-08-20 11:53         ` Pavel Begunkov
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 19:27 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Tue, Aug 19, 2025 at 8:51 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 8/19/25 01:07, Mina Almasry wrote:
> > On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> From: Jakub Kicinski <kuba@kernel.org>
> >>
> >> Distinguish between rx_buf_len being driver default vs user config.
> >> Use 0 as a special value meaning "unset" or "restore driver default".
> >> This will be necessary later on to configure it per-queue, but
> >> the ability to restore defaults may be useful in itself.
> >>
> >> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >
> > I wonder if it should be extended to the other driver using
> > rx_buf_len, hns3. For that, I think the default buf size would be
> > HNS3_DEFAULT_RX_BUF_LEN.
>
> I'd rather avoid growing the series even more, let's follow up on
> that in a separate patch on top, that should be just fine. And
> thanks for the review
>
> > Other than that, seems fine to me,
> >
> > Reviewed-by: Mina Almasry <almasrymina@google.com>
>
> With the said above, do you want me to retain the review tag?
>

I initially thought adding my reviewed-by would be fine, but on closer
look, doesn't this series break rx_buf_len setting for hns3? AFAICT so
far, in patch 3 you're adding a check to ethnl_set_rings where it'll
be an error if rx_buf_len > rx_buf_len_max, and i'm guessing if the
driver never sets rx_buf_len_max it'll be 0 initialized and that check
would always fail? Or did I miss something?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 06/23] net: add rx_buf_len to netdev config
  2025-08-18 13:57 ` [PATCH net-next v3 06/23] net: add rx_buf_len to netdev config Pavel Begunkov
@ 2025-08-19 19:32   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 19:32 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Add rx_buf_len to configuration maintained by the core.
> Use "three-state" semantics where 0 means "driver default".
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct
  2025-08-18 13:57 ` [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
@ 2025-08-19 19:37   ` Mina Almasry
  2025-08-20 13:43     ` Pavel Begunkov
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 19:37 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
> inside struct bnxt. This will allow configuring the page size
> at runtime in subsequent patches.
>
> The MSS size calculation for older chip continues to use the constant.
> I'm intending to support the configuration only on more recent HW,
> looks like on older chips setting this per queue won't work,
> and that's the ultimate goal.
>
> This patch should not change the current behavior as value
> read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

nit: AFAIU BNXT_RX_PAGE_SIZE should be unused after this? You could
delete the definition in bnxt.h if so.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size
  2025-08-18 13:57 ` [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
@ 2025-08-19 19:43   ` Mina Almasry
  2025-08-20 13:51     ` Pavel Begunkov
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 19:43 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> If user decides to increase the buffer size for agg ring
> we need to ask the page pool for higher order pages.
> There is no need to use larger pages for header frags,
> if user increase the size of agg ring buffers switch
> to separate header page automatically.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> [pavel: calculate adjust max_len]
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 5307b33ea1c7..d3d9b72ef313 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -3824,11 +3824,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>         pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
>         if (BNXT_RX_PAGE_MODE(bp))
>                 pp.pool_size += bp->rx_ring_size / rx_size_fac;
> +
> +       pp.order = get_order(bp->rx_page_size);
>         pp.nid = numa_node;
>         pp.netdev = bp->dev;
>         pp.dev = &bp->pdev->dev;
>         pp.dma_dir = bp->rx_dir;
> -       pp.max_len = PAGE_SIZE;
> +       pp.max_len = PAGE_SIZE << pp.order;

nit: I assume this could be `pp.max_len = bp->rx_ring_size;` if you
wanted, since bnxt is not actually using the full compound page in the
case that bp->rx_ring_size is not a power of 2. Though doesn't matter
much, either way:

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 09/23] eth: bnxt: support setting size of agg buffers via ethtool
  2025-08-18 13:57 ` [PATCH net-next v3 09/23] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
@ 2025-08-19 20:10   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 20:10 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> bnxt seems to be able to aggregate data up to 32kB without any issue.
> The driver is already capable of doing this for systems with higher
> order pages. While for systems with 4k pages we historically preferred
> to stick to small buffers because they are easier to allocate, the
> zero-copy APIs remove the allocation problem. The ZC mem is
> pre-allocated and fixed size.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  3 ++-
>  .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 21 ++++++++++++++++++-
>  2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index ac841d02d7ad..56aafae568f8 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -758,7 +758,8 @@ struct nqe_cn {
>  #define BNXT_RX_PAGE_SHIFT PAGE_SHIFT
>  #endif
>
> -#define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
> +#define BNXT_MAX_RX_PAGE_SIZE  (1 << 15)
> +#define BNXT_RX_PAGE_SIZE      (1 << BNXT_RX_PAGE_SHIFT)
>
>  #define BNXT_MAX_MTU           9500
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index 1b37612b1c01..2e130eeeabe5 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -835,6 +835,8 @@ static void bnxt_get_ringparam(struct net_device *dev,
>         ering->rx_jumbo_pending = bp->rx_agg_ring_size;
>         ering->tx_pending = bp->tx_ring_size;
>
> +       kernel_ering->rx_buf_len_max = BNXT_MAX_RX_PAGE_SIZE;
> +       kernel_ering->rx_buf_len = bp->rx_page_size;
>         kernel_ering->hds_thresh_max = BNXT_HDS_THRESHOLD_MAX;
>  }
>
> @@ -862,6 +864,21 @@ static int bnxt_set_ringparam(struct net_device *dev,
>                 return -EINVAL;
>         }
>
> +       if (!kernel_ering->rx_buf_len)  /* Zero means restore default */
> +               kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
> +

I wonder if things should be refactored a bit such that not every
driver needs to do this 0 special handling, and core does it instead.
I notice patch 4 does the same thing for otx2. But this is fine too.

hns3 changes are missing, but assuming Jakub knows what he's doing
with hns3, the changes here look good to me.

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 10/23] net: move netdev_config manipulation to dedicated helpers
  2025-08-18 13:57 ` [PATCH net-next v3 10/23] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
@ 2025-08-19 20:15   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 20:15 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> netdev_config manipulation will become slightly more complicated
> soon and we will need to call if from ethtool as well as queue API.
> Encapsulate the logic into helper functions.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  net/core/Makefile        |  2 +-
>  net/core/dev.c           |  7 ++-----
>  net/core/dev.h           |  5 +++++
>  net/core/netdev_config.c | 43 ++++++++++++++++++++++++++++++++++++++++
>  net/ethtool/netlink.c    | 14 ++++++-------
>  5 files changed, 57 insertions(+), 14 deletions(-)
>  create mode 100644 net/core/netdev_config.c
>
> diff --git a/net/core/Makefile b/net/core/Makefile
> index b2a76ce33932..4db487396094 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -19,7 +19,7 @@ obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
>
>  obj-y += net-sysfs.o
>  obj-y += hotdata.o
> -obj-y += netdev_rx_queue.o
> +obj-y += netdev_config.o netdev_rx_queue.o
>  obj-$(CONFIG_PAGE_POOL) += page_pool.o page_pool_user.o
>  obj-$(CONFIG_PROC_FS) += net-procfs.o
>  obj-$(CONFIG_NET_PKTGEN) += pktgen.o
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5a3c0f40a93f..7cd4e5eab441 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -11873,10 +11873,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>         if (!dev->ethtool)
>                 goto free_all;
>
> -       dev->cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
> -       if (!dev->cfg)
> +       if (netdev_alloc_config(dev))
>                 goto free_all;
> -       dev->cfg_pending = dev->cfg;
>
>         dev->num_napi_configs = maxqs;
>         napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
> @@ -11947,8 +11945,7 @@ void free_netdev(struct net_device *dev)
>                 return;
>         }
>
> -       WARN_ON(dev->cfg != dev->cfg_pending);
> -       kfree(dev->cfg);
> +       netdev_free_config(dev);
>         kfree(dev->ethtool);
>         netif_free_tx_queues(dev);
>         netif_free_rx_queues(dev);
> diff --git a/net/core/dev.h b/net/core/dev.h
> index d6b08d435479..7041c8bd2a0f 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -92,6 +92,11 @@ extern struct rw_semaphore dev_addr_sem;
>  extern struct list_head net_todo_list;
>  void netdev_run_todo(void);
>
> +int netdev_alloc_config(struct net_device *dev);
> +void __netdev_free_config(struct netdev_config *cfg);
> +void netdev_free_config(struct net_device *dev);
> +int netdev_reconfig_start(struct net_device *dev);
> +
>  /* netdev management, shared between various uAPI entry points */
>  struct netdev_name_node {
>         struct hlist_node hlist;
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> new file mode 100644
> index 000000000000..270b7f10a192
> --- /dev/null
> +++ b/net/core/netdev_config.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/netdevice.h>
> +#include <net/netdev_queues.h>
> +
> +#include "dev.h"
> +
> +int netdev_alloc_config(struct net_device *dev)
> +{
> +       struct netdev_config *cfg;
> +
> +       cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
> +       if (!cfg)
> +               return -ENOMEM;
> +
> +       dev->cfg = cfg;
> +       dev->cfg_pending = cfg;
> +       return 0;
> +}
> +
> +void __netdev_free_config(struct netdev_config *cfg)
> +{
> +       kfree(cfg);
> +}
> +
> +void netdev_free_config(struct net_device *dev)
> +{
> +       WARN_ON(dev->cfg != dev->cfg_pending);
> +       __netdev_free_config(dev->cfg);
> +}
> +
> +int netdev_reconfig_start(struct net_device *dev)
> +{
> +       struct netdev_config *cfg;
> +
> +       WARN_ON(dev->cfg != dev->cfg_pending);
> +       cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
> +       if (!cfg)
> +               return -ENOMEM;
> +
> +       dev->cfg_pending = cfg;
> +       return 0;

There are a couple of small behavior changes in this code. (a) the
WARN_ON is new, and (b) this helper retains dev->cfg_pending on error
while the old code would clear it. But both seem fine to me, so,

Reviewed-by: Mina Almasry <almasrymina@google.com>

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API
  2025-08-18 13:57 ` [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
@ 2025-08-19 21:29   ` Mina Almasry
  2025-08-20  1:32     ` Mina Almasry
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 21:29 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Create an array of config structs to store per-queue config.
> Pass these structs in the queue API. Drivers can also retrieve
> the config for a single queue calling netdev_queue_config()
> directly.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> [pavel: patch up mlx callbacks with unused qcfg]
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  8 ++-
>  drivers/net/ethernet/google/gve/gve_main.c    |  9 ++-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |  9 +--
>  drivers/net/netdevsim/netdev.c                |  6 +-
>  include/net/netdev_queues.h                   | 19 ++++++
>  net/core/dev.h                                |  3 +
>  net/core/netdev_config.c                      | 58 +++++++++++++++++++
>  net/core/netdev_rx_queue.c                    | 11 +++-
>  8 files changed, 109 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index d3d9b72ef313..48ff6f024e07 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -15824,7 +15824,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
>         .get_base_stats         = bnxt_get_base_stats,
>  };
>
> -static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
> +static int bnxt_queue_mem_alloc(struct net_device *dev,
> +                               struct netdev_queue_config *qcfg,
> +                               void *qmem, int idx)
>  {
>         struct bnxt_rx_ring_info *rxr, *clone;
>         struct bnxt *bp = netdev_priv(dev);
> @@ -15992,7 +15994,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
>         dst->rx_agg_bmap = src->rx_agg_bmap;
>  }
>
> -static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
> +static int bnxt_queue_start(struct net_device *dev,
> +                           struct netdev_queue_config *qcfg,
> +                           void *qmem, int idx)
>  {
>         struct bnxt *bp = netdev_priv(dev);
>         struct bnxt_rx_ring_info *rxr, *clone;
> diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> index 1f411d7c4373..f40edab616d8 100644
> --- a/drivers/net/ethernet/google/gve/gve_main.c
> +++ b/drivers/net/ethernet/google/gve/gve_main.c
> @@ -2580,8 +2580,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
>                 gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
>  }
>
> -static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
> -                                 int idx)
> +static int gve_rx_queue_mem_alloc(struct net_device *dev,
> +                                 struct netdev_queue_config *qcfg,
> +                                 void *per_q_mem, int idx)
>  {
>         struct gve_priv *priv = netdev_priv(dev);
>         struct gve_rx_alloc_rings_cfg cfg = {0};
> @@ -2602,7 +2603,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
>         return err;
>  }
>
> -static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
> +static int gve_rx_queue_start(struct net_device *dev,
> +                             struct netdev_queue_config *qcfg,
> +                             void *per_q_mem, int idx)
>  {
>         struct gve_priv *priv = netdev_priv(dev);
>         struct gve_rx_ring *gve_per_q_mem;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 21bb88c5d3dc..83264c17a4f7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -5541,8 +5541,9 @@ struct mlx5_qmgmt_data {
>         struct mlx5e_channel_param cparam;
>  };
>
> -static int mlx5e_queue_mem_alloc(struct net_device *dev, void *newq,
> -                                int queue_index)
> +static int mlx5e_queue_mem_alloc(struct net_device *dev,
> +                                struct netdev_queue_config *qcfg,
> +                                void *newq, int queue_index)
>  {
>         struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
>         struct mlx5e_priv *priv = netdev_priv(dev);
> @@ -5603,8 +5604,8 @@ static int mlx5e_queue_stop(struct net_device *dev, void *oldq, int queue_index)
>         return 0;
>  }
>
> -static int mlx5e_queue_start(struct net_device *dev, void *newq,
> -                            int queue_index)
> +static int mlx5e_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
> +                            void *newq, int queue_index)
>  {
>         struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
>         struct mlx5e_priv *priv = netdev_priv(dev);
> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> index 0178219f0db5..985c3403ec57 100644
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -733,7 +733,8 @@ struct nsim_queue_mem {
>  };
>
>  static int
> -nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
> +nsim_queue_mem_alloc(struct net_device *dev, struct netdev_queue_config *qcfg,
> +                    void *per_queue_mem, int idx)
>  {
>         struct nsim_queue_mem *qmem = per_queue_mem;
>         struct netdevsim *ns = netdev_priv(dev);
> @@ -782,7 +783,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
>  }
>
>  static int
> -nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
> +nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
> +                void *per_queue_mem, int idx)
>  {
>         struct nsim_queue_mem *qmem = per_queue_mem;
>         struct netdevsim *ns = netdev_priv(dev);
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index d73f9023c96f..b850cff71d12 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -32,6 +32,13 @@ struct netdev_config {
>         /** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
>          */
>         u8      hds_config;
> +
> +       /** @qcfg: per-queue configuration */
> +       struct netdev_queue_config *qcfg;
> +};
> +
> +/* Same semantics as fields in struct netdev_config */
> +struct netdev_queue_config {
>  };

I was very confused why this is empty until I looked at patch 18 :-D

>
>  /* See the netdev.yaml spec for definition of each statistic */
> @@ -136,6 +143,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   *
>   * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
>   *
> + * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
> + *                     defaults. Queue config structs are passed to this
> + *                     helper before the user-requested settings are applied.
> + *
>   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
>   *                      The new memory is written at the specified address.
>   *
> @@ -153,12 +164,17 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   */
>  struct netdev_queue_mgmt_ops {
>         size_t  ndo_queue_mem_size;
> +       void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
> +                                         int idx,
> +                                         struct netdev_queue_config *qcfg);
>         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
> +                                      struct netdev_queue_config *qcfg,
>                                        void *per_queue_mem,
>                                        int idx);
>         void    (*ndo_queue_mem_free)(struct net_device *dev,
>                                       void *per_queue_mem);
>         int     (*ndo_queue_start)(struct net_device *dev,
> +                                  struct netdev_queue_config *qcfg,
>                                    void *per_queue_mem,
>                                    int idx);
>         int     (*ndo_queue_stop)(struct net_device *dev,
> @@ -166,6 +182,9 @@ struct netdev_queue_mgmt_ops {
>                                   int idx);
>  };
>
> +void netdev_queue_config(struct net_device *dev, int rxq,
> +                        struct netdev_queue_config *qcfg);
> +
>  /**
>   * DOC: Lockless queue stopping / waking helpers.
>   *
> diff --git a/net/core/dev.h b/net/core/dev.h
> index 7041c8bd2a0f..a553a0f1f846 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -9,6 +9,7 @@
>  #include <net/netdev_lock.h>
>
>  struct net;
> +struct netdev_queue_config;
>  struct netlink_ext_ack;
>  struct cpumask;
>
> @@ -96,6 +97,8 @@ int netdev_alloc_config(struct net_device *dev);
>  void __netdev_free_config(struct netdev_config *cfg);
>  void netdev_free_config(struct net_device *dev);
>  int netdev_reconfig_start(struct net_device *dev);
> +void __netdev_queue_config(struct net_device *dev, int rxq,
> +                          struct netdev_queue_config *qcfg, bool pending);
>
>  /* netdev management, shared between various uAPI entry points */
>  struct netdev_name_node {
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> index 270b7f10a192..bad2d53522f0 100644
> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -8,18 +8,29 @@
>  int netdev_alloc_config(struct net_device *dev)
>  {
>         struct netdev_config *cfg;
> +       unsigned int maxqs;
>
>         cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
>         if (!cfg)
>                 return -ENOMEM;
>
> +       maxqs = max(dev->num_rx_queues, dev->num_tx_queues);

I honestly did not think about tx queues at all for the queue api thus
far. The ndos do specify that api applies to rx queues, and maybe the
driver only implemented them to assume indeed the calls are to rx
queues. Are you intentionally extending the queue api support for tx
queues? Or maybe you're allocing configs for the tx queues to be used
in some future?

Other places in this patch series uses num_rx_queues directly. Feels
like this should do the same.

> +       cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);
> +       if (!cfg->qcfg)
> +               goto err_free_cfg;
> +
>         dev->cfg = cfg;
>         dev->cfg_pending = cfg;
>         return 0;
> +
> +err_free_cfg:
> +       kfree(cfg);
> +       return -ENOMEM;
>  }
>
>  void __netdev_free_config(struct netdev_config *cfg)
>  {
> +       kfree(cfg->qcfg);
>         kfree(cfg);
>  }
>
> @@ -32,12 +43,59 @@ void netdev_free_config(struct net_device *dev)
>  int netdev_reconfig_start(struct net_device *dev)
>  {
>         struct netdev_config *cfg;
> +       unsigned int maxqs;
>
>         WARN_ON(dev->cfg != dev->cfg_pending);
>         cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
>         if (!cfg)
>                 return -ENOMEM;
>
> +       maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
> +       cfg->qcfg = kmemdup_array(dev->cfg->qcfg, maxqs, sizeof(*cfg->qcfg),
> +                                 GFP_KERNEL_ACCOUNT);
> +       if (!cfg->qcfg)
> +               goto err_free_cfg;
> +
>         dev->cfg_pending = cfg;
>         return 0;
> +
> +err_free_cfg:
> +       kfree(cfg);
> +       return -ENOMEM;
> +}
> +
> +void __netdev_queue_config(struct net_device *dev, int rxq,
> +                          struct netdev_queue_config *qcfg, bool pending)
> +{
> +       memset(qcfg, 0, sizeof(*qcfg));
> +

This memset 0 is wrong for queue configs like hds_thresh where 0 is a
value, not 'restore default'.

Either netdev_queue_config needs to have a comment that says 'only
values where 0 is restore default is allowed in this struct', or this
function needs to handle 0-as-value configs correctly.

But I wonder if the memset(0) is wrong in general. Isn't this helper
trying to grab the _current_ configuration? So qcfg should be seeded
with appropriate value from dev->qcfgs[rxq]? This function reads like
it's trying to get the default configuration, but in a way that
doesn't handle hds_thresh style semantics correctly?

> +       /* Get defaults from the driver, in case user config not set */
> +       if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
> +               dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);

If this function is trying to get the _current_ cfg, then overwriting
with the queue defaults doesn't seem correct?

> +}
> +
> +/**
> + * netdev_queue_config() - get configuration for a given queue
> + * @dev:  net_device instance
> + * @rxq:  index of the queue of interest
> + * @qcfg: queue configuration struct (output)
> + *
> + * Render the configuration for a given queue. This helper should be used
> + * by drivers which support queue configuration to retrieve config for
> + * a particular queue.
> + *

So the helper should be used with drivers that support queue
configuration, which are a subset of drivers that support queue mgmt
API. I don't see in this patch a signal from the driver that it
supports queue configuration, and core acting on that signal. Even the
added ndo_queue_cfg_defaults is just not called if it's not
implemented? Is that intended to be in the later patch with
netdev_queue_mgmt_ops->supported_ring_params?

> + * @qcfg is an output parameter and is always fully initialized by this
> + * function. Some values may not be set by the user, drivers may either
> + * deal with the "unset" values in @qcfg, or provide the callback
> + * to populate defaults in queue_management_ops.
> + *
> + * Note that this helper returns pending config, as it is expected that
> + * "old" queues are retained until config is successful so they can
> + * be restored directly without asking for the config.
> + */
> +void netdev_queue_config(struct net_device *dev, int rxq,
> +                        struct netdev_queue_config *qcfg)
> +{
> +       __netdev_queue_config(dev, rxq, qcfg, true);
>  }
> +EXPORT_SYMBOL(netdev_queue_config);
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index 3bf1151d8061..fb87ce219a8a 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -7,12 +7,14 @@
>  #include <net/netdev_rx_queue.h>
>  #include <net/page_pool/memory_provider.h>
>
> +#include "dev.h"
>  #include "page_pool_priv.h"
>
>  int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>  {
>         struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
>         const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
> +       struct netdev_queue_config qcfg;
>         void *new_mem, *old_mem;
>         int err;
>
> @@ -32,7 +34,9 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>                 goto err_free_new_mem;
>         }
>
> -       err = qops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx);
> +       netdev_queue_config(dev, rxq_idx, &qcfg);
> +
> +       err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
>         if (err)
>                 goto err_free_old_mem;
>
> @@ -45,7 +49,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>                 if (err)
>                         goto err_free_new_queue_mem;
>
> -               err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
> +               err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
>                 if (err)
>                         goto err_start_queue;
>         } else {
> @@ -60,6 +64,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>         return 0;
>
>  err_start_queue:
> +       __netdev_queue_config(dev, rxq_idx, &qcfg, false);
>         /* Restarting the queue with old_mem should be successful as we haven't
>          * changed any of the queue configuration, and there is not much we can
>          * do to recover from a failure here.
> @@ -67,7 +72,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>          * WARN if we fail to recover the old rx queue, and at least free
>          * old_mem so we don't also leak that.
>          */
> -       if (qops->ndo_queue_start(dev, old_mem, rxq_idx)) {
> +       if (qops->ndo_queue_start(dev, &qcfg, old_mem, rxq_idx)) {
>                 WARN(1,
>                      "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
>                      rxq_idx);
> --
> 2.49.0
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 13/23] net: pass extack to netdev_rx_queue_restart()
  2025-08-18 13:57 ` [PATCH net-next v3 13/23] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
@ 2025-08-19 21:30   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 21:30 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Pass extack to netdev_rx_queue_restart(). Subsequent change will need it.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 14/23] net: add queue config validation callback
  2025-08-18 13:57 ` [PATCH net-next v3 14/23] net: add queue config validation callback Pavel Begunkov
@ 2025-08-19 21:54   ` Mina Almasry
  2025-08-20  1:31     ` Mina Almasry
  0 siblings, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 21:54 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> I imagine (tm) that as the number of per-queue configuration
> options grows some of them may conflict for certain drivers.
> While the drivers can obviously do all the validation locally
> doing so is fairly inconvenient as the config is fed to drivers
> piecemeal via different ops (for different params and NIC-wide
> vs per-queue).
>
> Add a centralized callback for validating the queue config
> in queue ops. The callback gets invoked before each queue restart
> and when ring params are modified.
>
> For NIC-wide changes the callback gets invoked for each active
> (or active to-be) queue, and additionally with a negative queue
> index for NIC-wide defaults. The NIC-wide check is needed in
> case all queues have an override active when NIC-wide setting
> is changed to an unsupported one. Alternatively we could check
> the settings when new queues are enabled (in the channel API),
> but accepting invalid config is a bad idea. Users may expect
> that resetting a queue override will always work.
>
> The "trick" of passing a negative index is a bit ugly, we may
> want to revisit if it causes confusion and bugs. Existing drivers
> don't care about the index so it "just works".
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/net/netdev_queues.h | 12 ++++++++++++
>  net/core/dev.h              |  2 ++
>  net/core/netdev_config.c    | 20 ++++++++++++++++++++
>  net/core/netdev_rx_queue.c  |  6 ++++++
>  net/ethtool/rings.c         |  5 +++++
>  5 files changed, 45 insertions(+)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index b850cff71d12..d0cc475ec51e 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -147,6 +147,14 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   *                     defaults. Queue config structs are passed to this
>   *                     helper before the user-requested settings are applied.
>   *
> + * @ndo_queue_cfg_validate: (Optional) Check if queue config is supported.
> + *                     Called when configuration affecting a queue may be
> + *                     changing, either due to NIC-wide config, or config
> + *                     scoped to the queue at a specified index.
> + *                     When NIC-wide config is changed the callback will
> + *                     be invoked for all queues, and in addition to that
> + *                     with a negative queue index for the base settings.
> + *
>   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
>   *                      The new memory is written at the specified address.
>   *
> @@ -167,6 +175,10 @@ struct netdev_queue_mgmt_ops {
>         void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
>                                           int idx,
>                                           struct netdev_queue_config *qcfg);
> +       int     (*ndo_queue_cfg_validate)(struct net_device *dev,
> +                                         int idx,
> +                                         struct netdev_queue_config *qcfg,
> +                                         struct netlink_ext_ack *extack);
>         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
>                                        struct netdev_queue_config *qcfg,
>                                        void *per_queue_mem,
> diff --git a/net/core/dev.h b/net/core/dev.h
> index a553a0f1f846..523d50e6f88d 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -99,6 +99,8 @@ void netdev_free_config(struct net_device *dev);
>  int netdev_reconfig_start(struct net_device *dev);
>  void __netdev_queue_config(struct net_device *dev, int rxq,
>                            struct netdev_queue_config *qcfg, bool pending);
> +int netdev_queue_config_revalidate(struct net_device *dev,
> +                                  struct netlink_ext_ack *extack);
>
>  /* netdev management, shared between various uAPI entry points */
>  struct netdev_name_node {
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> index bad2d53522f0..fc700b77e4eb 100644
> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
>         __netdev_queue_config(dev, rxq, qcfg, true);
>  }
>  EXPORT_SYMBOL(netdev_queue_config);
> +
> +int netdev_queue_config_revalidate(struct net_device *dev,
> +                                  struct netlink_ext_ack *extack)
> +{
> +       const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
> +       struct netdev_queue_config qcfg;
> +       int i, err;
> +
> +       if (!qops || !qops->ndo_queue_cfg_validate)
> +               return 0;
> +
> +       for (i = -1; i < (int)dev->real_num_rx_queues; i++) {
> +               netdev_queue_config(dev, i, &qcfg);

This function as written feels very useless tbh. There is no config
passed in from the caller, so the function does a netdev_queue_config,
which grabs the current-or-default-config (I'm not sure which tbh),
and then validates that is applicable. But of course the current or
default configs can be applied, right?

I thought there would be a refactor in a future patch that makes this
function useful, but I don't see one.

The qcfg being applied needs to be passed in by the caller of this
function, no? That would make sense to me (the caller is wondering if
this new config is applicable).

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 15/23] eth: bnxt: always set the queue mgmt ops
  2025-08-18 13:57 ` [PATCH net-next v3 15/23] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
@ 2025-08-19 21:57   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 21:57 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Core provides a centralized callback for validating per-queue settings
> but the callback is part of the queue management ops. Having the ops
> conditionally set complicates the parts of the driver which could
> otherwise lean on the core to feed it the correct settings.
>

On first look, tbh doing this feels like adding boiler plate code
where core could treat !dev->queue_mgmt_ops as the same thing as
(dev->queue_mgmt_ops &&
!dev->queue_mgmt_ops->ndo_queue_config_validate).

But if this is direction you want to go, patch itself looks fine to me:

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 18/23] netdev: add support for setting rx-buf-len per queue
  2025-08-18 13:57 ` [PATCH net-next v3 18/23] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
@ 2025-08-19 22:36   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 22:36 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:57 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Zero-copy APIs increase the cost of buffer management. They also extend
> this cost to user space applications which may be used to dealing with
> much larger buffers. Allow setting rx-buf-len per queue, devices with
> HW-GRO support can commonly fill buffers up to 32k (or rather 64k - 1
> but that's not a power of 2..)
>
> The implementation adds a new option to the netdev netlink, rather
> than ethtool. The NIC-wide setting lives in ethtool ringparams so
> one could argue that we should be extending the ethtool API.
> OTOH netdev API is where we already have queue-get, and it's how
> zero-copy applications bind memory providers.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/netlink/specs/netdev.yaml | 15 ++++
>  include/net/netdev_queues.h             |  5 ++
>  include/net/netlink.h                   | 19 +++++
>  include/uapi/linux/netdev.h             |  2 +
>  net/core/netdev-genl-gen.c              | 15 ++++
>  net/core/netdev-genl-gen.h              |  1 +
>  net/core/netdev-genl.c                  | 92 +++++++++++++++++++++++++
>  net/core/netdev_config.c                | 16 +++++
>  tools/include/uapi/linux/netdev.h       |  2 +
>  9 files changed, 167 insertions(+)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index c035dc0f64fd..498c4bcafdbd 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -338,6 +338,10 @@ attribute-sets:
>          doc: XSK information for this queue, if any.
>          type: nest
>          nested-attributes: xsk-info
> +      -
> +        name: rx-buf-len
> +        doc: Per-queue configuration of ETHTOOL_A_RINGS_RX_BUF_LEN.
> +        type: u32
>    -
>      name: qstats
>      doc: |
> @@ -771,6 +775,17 @@ operations:
>          reply:
>            attributes:
>              - id
> +    -
> +      name: queue-set
> +      doc: Set per-queue configurable options.
> +      attribute-set: queue
> +      do:
> +        request:
> +          attributes:
> +            - ifindex
> +            - type
> +            - id
> +            - rx-buf-len
>
>  kernel-family:
>    headers: ["net/netdev_netlink.h"]
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index d0cc475ec51e..b69b1d519dcb 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -39,6 +39,7 @@ struct netdev_config {
>
>  /* Same semantics as fields in struct netdev_config */
>  struct netdev_queue_config {
> +       u32     rx_buf_len;
>  };
>
>  /* See the netdev.yaml spec for definition of each statistic */
> @@ -141,6 +142,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>  /**
>   * struct netdev_queue_mgmt_ops - netdev ops for queue management
>   *
> + * @supported_ring_params: ring params supported per queue (ETHTOOL_RING_USE_*).
> + *

Not necessarily a problem, but note that if you depend on
ETHTOOL_RING_USE_*, then queue configs need to also be ethtool
configs, which means they need to also be NIC wide configs. Maybe
that's a plus in your eyes. I wonder if ever we're going to be in a
situation where some config makes sense per queue, but not per NIC.
mem providers are already that sorta.

>   * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
>   *
>   * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
> @@ -171,6 +174,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   * be called for an interface which is open.
>   */
>  struct netdev_queue_mgmt_ops {
> +       u32     supported_ring_params;
> +
>         size_t  ndo_queue_mem_size;
>         void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
>                                           int idx,
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 1a8356ca4b78..29989ad81ddd 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -2200,6 +2200,25 @@ static inline struct nla_bitfield32 nla_get_bitfield32(const struct nlattr *nla)
>         return tmp;
>  }
>
> +/**
> + * nla_update_u32() - update u32 value from NLA_U32 attribute
> + * @dst:  value to update
> + * @attr: netlink attribute with new value or null
> + *
> + * Copy the u32 value from NLA_U32 netlink attribute @attr into variable
> + * pointed to by @dst; do nothing if @attr is null.
> + *
> + * Return: true if this function changed the value of @dst, otherwise false.
> + */
> +static inline bool nla_update_u32(u32 *dst, const struct nlattr *attr)
> +{
> +       u32 old_val = *dst;
> +
> +       if (attr)
> +               *dst = nla_get_u32(attr);
> +       return *dst != old_val;
> +}
> +
>  /**
>   * nla_memdup - duplicate attribute memory (kmemdup)
>   * @src: netlink attribute to duplicate from
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 48eb49aa03d4..820f89b67a72 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -158,6 +158,7 @@ enum {
>         NETDEV_A_QUEUE_DMABUF,
>         NETDEV_A_QUEUE_IO_URING,
>         NETDEV_A_QUEUE_XSK,
> +       NETDEV_A_QUEUE_RX_BUF_LEN,
>
>         __NETDEV_A_QUEUE_MAX,
>         NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
> @@ -226,6 +227,7 @@ enum {
>         NETDEV_CMD_BIND_RX,
>         NETDEV_CMD_NAPI_SET,
>         NETDEV_CMD_BIND_TX,
> +       NETDEV_CMD_QUEUE_SET,
>
>         __NETDEV_CMD_MAX,
>         NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index e9a2a6f26cb7..d053306a3af8 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -106,6 +106,14 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
>         [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
>  };
>
> +/* NETDEV_CMD_QUEUE_SET - do */
> +static const struct nla_policy netdev_queue_set_nl_policy[NETDEV_A_QUEUE_RX_BUF_LEN + 1] = {
> +       [NETDEV_A_QUEUE_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> +       [NETDEV_A_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1),
> +       [NETDEV_A_QUEUE_ID] = { .type = NLA_U32, },
> +       [NETDEV_A_QUEUE_RX_BUF_LEN] = { .type = NLA_U32, },
> +};
> +
>  /* Ops table for netdev */
>  static const struct genl_split_ops netdev_nl_ops[] = {
>         {
> @@ -204,6 +212,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>                 .maxattr        = NETDEV_A_DMABUF_FD,
>                 .flags          = GENL_CMD_CAP_DO,
>         },
> +       {
> +               .cmd            = NETDEV_CMD_QUEUE_SET,
> +               .doit           = netdev_nl_queue_set_doit,
> +               .policy         = netdev_queue_set_nl_policy,
> +               .maxattr        = NETDEV_A_QUEUE_RX_BUF_LEN,
> +               .flags          = GENL_CMD_CAP_DO,
> +       },
>  };
>
>  static const struct genl_multicast_group netdev_nl_mcgrps[] = {
> diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
> index cf3fad74511f..b7f5e5d9fca9 100644
> --- a/net/core/netdev-genl-gen.h
> +++ b/net/core/netdev-genl-gen.h
> @@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
>  int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
>  int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
>  int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
> +int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info);
>
>  enum {
>         NETDEV_NLGRP_MGMT,
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 6314eb7bdf69..abb128e45fcf 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -386,6 +386,30 @@ static int nla_put_napi_id(struct sk_buff *skb, const struct napi_struct *napi)
>         return 0;
>  }
>
> +static int
> +netdev_nl_queue_fill_cfg(struct sk_buff *rsp, struct net_device *netdev,
> +                        u32 q_idx, u32 q_type)
> +{
> +       struct netdev_queue_config *qcfg;
> +
> +       if (!netdev_need_ops_lock(netdev))
> +               return 0;
> +

Why are we checking this? I don't see this function doing any locking.

> +       qcfg = &netdev->cfg->qcfg[q_idx];
> +       switch (q_type) {
> +       case NETDEV_QUEUE_TYPE_RX:
> +               if (qcfg->rx_buf_len &&
> +                   nla_put_u32(rsp, NETDEV_A_QUEUE_RX_BUF_LEN,
> +                               qcfg->rx_buf_len))
> +                       return -EMSGSIZE;
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return 0;
> +}
> +
>  static int
>  netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
>                          u32 q_idx, u32 q_type, const struct genl_info *info)
> @@ -433,6 +457,9 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
>                 break;
>         }
>
> +       if (netdev_nl_queue_fill_cfg(rsp, netdev, q_idx, q_type))
> +               goto nla_put_failure;
> +
>         genlmsg_end(rsp, hdr);
>
>         return 0;
> @@ -572,6 +599,71 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
>         return err;
>  }
>
> +int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info)
> +{
> +       struct nlattr * const *tb = info->attrs;
> +       struct netdev_queue_config *qcfg;
> +       u32 q_id, q_type, ifindex;
> +       struct net_device *netdev;
> +       bool mod;
> +       int ret;
> +
> +       if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_ID) ||
> +           GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_TYPE) ||
> +           GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_IFINDEX))
> +               return -EINVAL;
> +
> +       q_id = nla_get_u32(tb[NETDEV_A_QUEUE_ID]);
> +       q_type = nla_get_u32(tb[NETDEV_A_QUEUE_TYPE]);
> +       ifindex = nla_get_u32(tb[NETDEV_A_QUEUE_IFINDEX]);
> +
> +       if (q_type != NETDEV_QUEUE_TYPE_RX) {
> +               /* Only Rx params exist right now */
> +               NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_TYPE]);
> +               return -EINVAL;
> +       }
> +
> +       ret = 0;
> +       netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex);
> +       if (!netdev || !netif_device_present(netdev))
> +               ret = -ENODEV;
> +       else if (!netdev->queue_mgmt_ops)
> +               ret = -EOPNOTSUPP;
> +       if (ret) {
> +               NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_IFINDEX]);
> +               goto exit_unlock;
> +       }
> +
> +       ret = netdev_nl_queue_validate(netdev, q_id, q_type);
> +       if (ret) {
> +               NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_ID]);
> +               goto exit_unlock;
> +       }
> +
> +       ret = netdev_reconfig_start(netdev);
> +       if (ret)
> +               goto exit_unlock;
> +
> +       qcfg = &netdev->cfg_pending->qcfg[q_id];
> +       mod = nla_update_u32(&qcfg->rx_buf_len, tb[NETDEV_A_QUEUE_RX_BUF_LEN]);

Don't you need to check the queue_mgmt_ops->supported_thingy first to
know that this driver actually supports modifying rx_buf_len?

> +       if (!mod)
> +               goto exit_free_cfg;
> +
> +       ret = netdev_rx_queue_restart(netdev, q_id, info->extack);
> +       if (ret)
> +               goto exit_free_cfg;
> +
> +       swap(netdev->cfg, netdev->cfg_pending);
> +
> +exit_free_cfg:
> +       __netdev_free_config(netdev->cfg_pending);
> +       netdev->cfg_pending = netdev->cfg;
> +exit_unlock:
> +       if (netdev)
> +               netdev_unlock(netdev);
> +       return ret;
> +}
> +
>  #define NETDEV_STAT_NOT_SET            (~0ULL)
>
>  static void netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> index fc700b77e4eb..ede02b77470e 100644
> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -67,11 +67,27 @@ int netdev_reconfig_start(struct net_device *dev)
>  void __netdev_queue_config(struct net_device *dev, int rxq,
>                            struct netdev_queue_config *qcfg, bool pending)
>  {
> +       const struct netdev_config *cfg;
> +
> +       cfg = pending ? dev->cfg_pending : dev->cfg;
> +
>         memset(qcfg, 0, sizeof(*qcfg));
>
>         /* Get defaults from the driver, in case user config not set */
>         if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
>                 dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
> +
> +       /* Set config based on device-level settings */
> +       if (cfg->rx_buf_len)
> +               qcfg->rx_buf_len = cfg->rx_buf_len;
> +
> +       /* Set config dedicated to this queue */
> +       if (rxq >= 0) {
> +               const struct netdev_queue_config *user_cfg = &cfg->qcfg[rxq];
> +
> +               if (user_cfg->rx_buf_len)
> +                       qcfg->rx_buf_len = user_cfg->rx_buf_len;
> +       }
>  }
>
>  /**
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index 48eb49aa03d4..820f89b67a72 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -158,6 +158,7 @@ enum {
>         NETDEV_A_QUEUE_DMABUF,
>         NETDEV_A_QUEUE_IO_URING,
>         NETDEV_A_QUEUE_XSK,
> +       NETDEV_A_QUEUE_RX_BUF_LEN,
>
>         __NETDEV_A_QUEUE_MAX,
>         NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
> @@ -226,6 +227,7 @@ enum {
>         NETDEV_CMD_BIND_RX,
>         NETDEV_CMD_NAPI_SET,
>         NETDEV_CMD_BIND_TX,
> +       NETDEV_CMD_QUEUE_SET,
>
>         __NETDEV_CMD_MAX,
>         NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
> --
> 2.49.0
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 19/23] net: wipe the setting of deactived queues
  2025-08-18 13:57 ` [PATCH net-next v3 19/23] net: wipe the setting of deactived queues Pavel Begunkov
@ 2025-08-19 22:49   ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-19 22:49 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, Aug 18, 2025 at 6:57 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> Clear out all settings of deactived queues when user changes
> the number of channels. We already perform similar cleanup
> for shapers.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 14/23] net: add queue config validation callback
  2025-08-19 21:54   ` Mina Almasry
@ 2025-08-20  1:31     ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-20  1:31 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Tue, Aug 19, 2025 at 2:54 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >
> > From: Jakub Kicinski <kuba@kernel.org>
> >
> > I imagine (tm) that as the number of per-queue configuration
> > options grows some of them may conflict for certain drivers.
> > While the drivers can obviously do all the validation locally
> > doing so is fairly inconvenient as the config is fed to drivers
> > piecemeal via different ops (for different params and NIC-wide
> > vs per-queue).
> >
> > Add a centralized callback for validating the queue config
> > in queue ops. The callback gets invoked before each queue restart
> > and when ring params are modified.
> >
> > For NIC-wide changes the callback gets invoked for each active
> > (or active to-be) queue, and additionally with a negative queue
> > index for NIC-wide defaults. The NIC-wide check is needed in
> > case all queues have an override active when NIC-wide setting
> > is changed to an unsupported one. Alternatively we could check
> > the settings when new queues are enabled (in the channel API),
> > but accepting invalid config is a bad idea. Users may expect
> > that resetting a queue override will always work.
> >
> > The "trick" of passing a negative index is a bit ugly, we may
> > want to revisit if it causes confusion and bugs. Existing drivers
> > don't care about the index so it "just works".
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > ---
> >  include/net/netdev_queues.h | 12 ++++++++++++
> >  net/core/dev.h              |  2 ++
> >  net/core/netdev_config.c    | 20 ++++++++++++++++++++
> >  net/core/netdev_rx_queue.c  |  6 ++++++
> >  net/ethtool/rings.c         |  5 +++++
> >  5 files changed, 45 insertions(+)
> >
> > diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> > index b850cff71d12..d0cc475ec51e 100644
> > --- a/include/net/netdev_queues.h
> > +++ b/include/net/netdev_queues.h
> > @@ -147,6 +147,14 @@ void netdev_stat_queue_sum(struct net_device *netdev,
> >   *                     defaults. Queue config structs are passed to this
> >   *                     helper before the user-requested settings are applied.
> >   *
> > + * @ndo_queue_cfg_validate: (Optional) Check if queue config is supported.
> > + *                     Called when configuration affecting a queue may be
> > + *                     changing, either due to NIC-wide config, or config
> > + *                     scoped to the queue at a specified index.
> > + *                     When NIC-wide config is changed the callback will
> > + *                     be invoked for all queues, and in addition to that
> > + *                     with a negative queue index for the base settings.
> > + *
> >   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
> >   *                      The new memory is written at the specified address.
> >   *
> > @@ -167,6 +175,10 @@ struct netdev_queue_mgmt_ops {
> >         void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
> >                                           int idx,
> >                                           struct netdev_queue_config *qcfg);
> > +       int     (*ndo_queue_cfg_validate)(struct net_device *dev,
> > +                                         int idx,
> > +                                         struct netdev_queue_config *qcfg,
> > +                                         struct netlink_ext_ack *extack);
> >         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
> >                                        struct netdev_queue_config *qcfg,
> >                                        void *per_queue_mem,
> > diff --git a/net/core/dev.h b/net/core/dev.h
> > index a553a0f1f846..523d50e6f88d 100644
> > --- a/net/core/dev.h
> > +++ b/net/core/dev.h
> > @@ -99,6 +99,8 @@ void netdev_free_config(struct net_device *dev);
> >  int netdev_reconfig_start(struct net_device *dev);
> >  void __netdev_queue_config(struct net_device *dev, int rxq,
> >                            struct netdev_queue_config *qcfg, bool pending);
> > +int netdev_queue_config_revalidate(struct net_device *dev,
> > +                                  struct netlink_ext_ack *extack);
> >
> >  /* netdev management, shared between various uAPI entry points */
> >  struct netdev_name_node {
> > diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> > index bad2d53522f0..fc700b77e4eb 100644
> > --- a/net/core/netdev_config.c
> > +++ b/net/core/netdev_config.c
> > @@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
> >         __netdev_queue_config(dev, rxq, qcfg, true);
> >  }
> >  EXPORT_SYMBOL(netdev_queue_config);
> > +
> > +int netdev_queue_config_revalidate(struct net_device *dev,
> > +                                  struct netlink_ext_ack *extack)
> > +{
> > +       const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
> > +       struct netdev_queue_config qcfg;
> > +       int i, err;
> > +
> > +       if (!qops || !qops->ndo_queue_cfg_validate)
> > +               return 0;
> > +
> > +       for (i = -1; i < (int)dev->real_num_rx_queues; i++) {
> > +               netdev_queue_config(dev, i, &qcfg);
>
> This function as written feels very useless tbh. There is no config
> passed in from the caller, so the function does a netdev_queue_config,
> which grabs the current-or-default-config (I'm not sure which tbh),
> and then validates that is applicable. But of course the current or
> default configs can be applied, right?
>
> I thought there would be a refactor in a future patch that makes this
> function useful, but I don't see one.
>
> The qcfg being applied needs to be passed in by the caller of this
> function, no? That would make sense to me (the caller is wondering if
> this new config is applicable).
>

OK, I misunderstood how this works on first read. netdev_queue_config
returns the pending config, not the current one, and that is what's
being validated. I'll give this a closer look.



-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API
  2025-08-19 21:29   ` Mina Almasry
@ 2025-08-20  1:32     ` Mina Almasry
  0 siblings, 0 replies; 58+ messages in thread
From: Mina Almasry @ 2025-08-20  1:32 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Tue, Aug 19, 2025 at 2:29 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >
> > From: Jakub Kicinski <kuba@kernel.org>
> >
> > Create an array of config structs to store per-queue config.
> > Pass these structs in the queue API. Drivers can also retrieve
> > the config for a single queue calling netdev_queue_config()
> > directly.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > [pavel: patch up mlx callbacks with unused qcfg]
> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  8 ++-
> >  drivers/net/ethernet/google/gve/gve_main.c    |  9 ++-
> >  .../net/ethernet/mellanox/mlx5/core/en_main.c |  9 +--
> >  drivers/net/netdevsim/netdev.c                |  6 +-
> >  include/net/netdev_queues.h                   | 19 ++++++
> >  net/core/dev.h                                |  3 +
> >  net/core/netdev_config.c                      | 58 +++++++++++++++++++
> >  net/core/netdev_rx_queue.c                    | 11 +++-
> >  8 files changed, 109 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index d3d9b72ef313..48ff6f024e07 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -15824,7 +15824,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
> >         .get_base_stats         = bnxt_get_base_stats,
> >  };
> >
> > -static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
> > +static int bnxt_queue_mem_alloc(struct net_device *dev,
> > +                               struct netdev_queue_config *qcfg,
> > +                               void *qmem, int idx)
> >  {
> >         struct bnxt_rx_ring_info *rxr, *clone;
> >         struct bnxt *bp = netdev_priv(dev);
> > @@ -15992,7 +15994,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
> >         dst->rx_agg_bmap = src->rx_agg_bmap;
> >  }
> >
> > -static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
> > +static int bnxt_queue_start(struct net_device *dev,
> > +                           struct netdev_queue_config *qcfg,
> > +                           void *qmem, int idx)
> >  {
> >         struct bnxt *bp = netdev_priv(dev);
> >         struct bnxt_rx_ring_info *rxr, *clone;
> > diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> > index 1f411d7c4373..f40edab616d8 100644
> > --- a/drivers/net/ethernet/google/gve/gve_main.c
> > +++ b/drivers/net/ethernet/google/gve/gve_main.c
> > @@ -2580,8 +2580,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
> >                 gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
> >  }
> >
> > -static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
> > -                                 int idx)
> > +static int gve_rx_queue_mem_alloc(struct net_device *dev,
> > +                                 struct netdev_queue_config *qcfg,
> > +                                 void *per_q_mem, int idx)
> >  {
> >         struct gve_priv *priv = netdev_priv(dev);
> >         struct gve_rx_alloc_rings_cfg cfg = {0};
> > @@ -2602,7 +2603,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
> >         return err;
> >  }
> >
> > -static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
> > +static int gve_rx_queue_start(struct net_device *dev,
> > +                             struct netdev_queue_config *qcfg,
> > +                             void *per_q_mem, int idx)
> >  {
> >         struct gve_priv *priv = netdev_priv(dev);
> >         struct gve_rx_ring *gve_per_q_mem;
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 21bb88c5d3dc..83264c17a4f7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -5541,8 +5541,9 @@ struct mlx5_qmgmt_data {
> >         struct mlx5e_channel_param cparam;
> >  };
> >
> > -static int mlx5e_queue_mem_alloc(struct net_device *dev, void *newq,
> > -                                int queue_index)
> > +static int mlx5e_queue_mem_alloc(struct net_device *dev,
> > +                                struct netdev_queue_config *qcfg,
> > +                                void *newq, int queue_index)
> >  {
> >         struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
> >         struct mlx5e_priv *priv = netdev_priv(dev);
> > @@ -5603,8 +5604,8 @@ static int mlx5e_queue_stop(struct net_device *dev, void *oldq, int queue_index)
> >         return 0;
> >  }
> >
> > -static int mlx5e_queue_start(struct net_device *dev, void *newq,
> > -                            int queue_index)
> > +static int mlx5e_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
> > +                            void *newq, int queue_index)
> >  {
> >         struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
> >         struct mlx5e_priv *priv = netdev_priv(dev);
> > diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> > index 0178219f0db5..985c3403ec57 100644
> > --- a/drivers/net/netdevsim/netdev.c
> > +++ b/drivers/net/netdevsim/netdev.c
> > @@ -733,7 +733,8 @@ struct nsim_queue_mem {
> >  };
> >
> >  static int
> > -nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
> > +nsim_queue_mem_alloc(struct net_device *dev, struct netdev_queue_config *qcfg,
> > +                    void *per_queue_mem, int idx)
> >  {
> >         struct nsim_queue_mem *qmem = per_queue_mem;
> >         struct netdevsim *ns = netdev_priv(dev);
> > @@ -782,7 +783,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
> >  }
> >
> >  static int
> > -nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
> > +nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
> > +                void *per_queue_mem, int idx)
> >  {
> >         struct nsim_queue_mem *qmem = per_queue_mem;
> >         struct netdevsim *ns = netdev_priv(dev);
> > diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> > index d73f9023c96f..b850cff71d12 100644
> > --- a/include/net/netdev_queues.h
> > +++ b/include/net/netdev_queues.h
> > @@ -32,6 +32,13 @@ struct netdev_config {
> >         /** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
> >          */
> >         u8      hds_config;
> > +
> > +       /** @qcfg: per-queue configuration */
> > +       struct netdev_queue_config *qcfg;
> > +};
> > +
> > +/* Same semantics as fields in struct netdev_config */
> > +struct netdev_queue_config {
> >  };
>
> I was very confused why this is empty until I looked at patch 18 :-D
>
> >
> >  /* See the netdev.yaml spec for definition of each statistic */
> > @@ -136,6 +143,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
> >   *
> >   * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
> >   *
> > + * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
> > + *                     defaults. Queue config structs are passed to this
> > + *                     helper before the user-requested settings are applied.
> > + *
> >   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
> >   *                      The new memory is written at the specified address.
> >   *
> > @@ -153,12 +164,17 @@ void netdev_stat_queue_sum(struct net_device *netdev,
> >   */
> >  struct netdev_queue_mgmt_ops {
> >         size_t  ndo_queue_mem_size;
> > +       void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
> > +                                         int idx,
> > +                                         struct netdev_queue_config *qcfg);
> >         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
> > +                                      struct netdev_queue_config *qcfg,
> >                                        void *per_queue_mem,
> >                                        int idx);
> >         void    (*ndo_queue_mem_free)(struct net_device *dev,
> >                                       void *per_queue_mem);
> >         int     (*ndo_queue_start)(struct net_device *dev,
> > +                                  struct netdev_queue_config *qcfg,
> >                                    void *per_queue_mem,
> >                                    int idx);
> >         int     (*ndo_queue_stop)(struct net_device *dev,
> > @@ -166,6 +182,9 @@ struct netdev_queue_mgmt_ops {
> >                                   int idx);
> >  };
> >
> > +void netdev_queue_config(struct net_device *dev, int rxq,
> > +                        struct netdev_queue_config *qcfg);
> > +
> >  /**
> >   * DOC: Lockless queue stopping / waking helpers.
> >   *
> > diff --git a/net/core/dev.h b/net/core/dev.h
> > index 7041c8bd2a0f..a553a0f1f846 100644
> > --- a/net/core/dev.h
> > +++ b/net/core/dev.h
> > @@ -9,6 +9,7 @@
> >  #include <net/netdev_lock.h>
> >
> >  struct net;
> > +struct netdev_queue_config;
> >  struct netlink_ext_ack;
> >  struct cpumask;
> >
> > @@ -96,6 +97,8 @@ int netdev_alloc_config(struct net_device *dev);
> >  void __netdev_free_config(struct netdev_config *cfg);
> >  void netdev_free_config(struct net_device *dev);
> >  int netdev_reconfig_start(struct net_device *dev);
> > +void __netdev_queue_config(struct net_device *dev, int rxq,
> > +                          struct netdev_queue_config *qcfg, bool pending);
> >
> >  /* netdev management, shared between various uAPI entry points */
> >  struct netdev_name_node {
> > diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> > index 270b7f10a192..bad2d53522f0 100644
> > --- a/net/core/netdev_config.c
> > +++ b/net/core/netdev_config.c
> > @@ -8,18 +8,29 @@
> >  int netdev_alloc_config(struct net_device *dev)
> >  {
> >         struct netdev_config *cfg;
> > +       unsigned int maxqs;
> >
> >         cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
> >         if (!cfg)
> >                 return -ENOMEM;
> >
> > +       maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
>
> I honestly did not think about tx queues at all for the queue api thus
> far. The ndos do specify that api applies to rx queues, and maybe the
> driver only implemented them to assume indeed the calls are to rx
> queues. Are you intentionally extending the queue api support for tx
> queues? Or maybe you're allocing configs for the tx queues to be used
> in some future?
>
> Other places in this patch series uses num_rx_queues directly. Feels
> like this should do the same.
>
> > +       cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);
> > +       if (!cfg->qcfg)
> > +               goto err_free_cfg;
> > +
> >         dev->cfg = cfg;
> >         dev->cfg_pending = cfg;
> >         return 0;
> > +
> > +err_free_cfg:
> > +       kfree(cfg);
> > +       return -ENOMEM;
> >  }
> >
> >  void __netdev_free_config(struct netdev_config *cfg)
> >  {
> > +       kfree(cfg->qcfg);
> >         kfree(cfg);
> >  }
> >
> > @@ -32,12 +43,59 @@ void netdev_free_config(struct net_device *dev)
> >  int netdev_reconfig_start(struct net_device *dev)
> >  {
> >         struct netdev_config *cfg;
> > +       unsigned int maxqs;
> >
> >         WARN_ON(dev->cfg != dev->cfg_pending);
> >         cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
> >         if (!cfg)
> >                 return -ENOMEM;
> >
> > +       maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
> > +       cfg->qcfg = kmemdup_array(dev->cfg->qcfg, maxqs, sizeof(*cfg->qcfg),
> > +                                 GFP_KERNEL_ACCOUNT);
> > +       if (!cfg->qcfg)
> > +               goto err_free_cfg;
> > +
> >         dev->cfg_pending = cfg;
> >         return 0;
> > +
> > +err_free_cfg:
> > +       kfree(cfg);
> > +       return -ENOMEM;
> > +}
> > +
> > +void __netdev_queue_config(struct net_device *dev, int rxq,
> > +                          struct netdev_queue_config *qcfg, bool pending)
> > +{
> > +       memset(qcfg, 0, sizeof(*qcfg));
> > +
>
> This memset 0 is wrong for queue configs like hds_thresh where 0 is a
> value, not 'restore default'.
>
> Either netdev_queue_config needs to have a comment that says 'only
> values where 0 is restore default is allowed in this struct', or this
> function needs to handle 0-as-value configs correctly.
>
> But I wonder if the memset(0) is wrong in general. Isn't this helper
> trying to grab the _current_ configuration? So qcfg should be seeded
> with appropriate value from dev->qcfgs[rxq]? This function reads like
> it's trying to get the default configuration, but in a way that
> doesn't handle hds_thresh style semantics correctly?
>

Nevermind this comment, a close review of patch 18 answered this
question actually. You are indeed grabbing the configuration from
dev->qcfgs[rxq], you're just not doing it here because
netdev_queue_config is empty.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
                   ` (23 preceding siblings ...)
  2025-08-18 13:59 ` [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
@ 2025-08-20  2:31 ` Jakub Kicinski
  2025-08-20 13:39   ` Pavel Begunkov
  24 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2025-08-20  2:31 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Mon, 18 Aug 2025 14:57:16 +0100 Pavel Begunkov wrote:
> Jakub Kicinski (20):

I think we need to revisit how we operate.
When we started the ZC work w/ io-uring I suggested a permanent shared
branch. That's perhaps an overkill. What I did not expect is that you
will not even CC netdev@ on changes to io_uring/zcrx.*

I don't mean to assert any sort of ownership of that code, but you're
not meeting basic collaboration standards for the kernel. This needs 
to change first.
-- 
pw-bot: defer

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default
  2025-08-19 19:27       ` Mina Almasry
@ 2025-08-20 11:53         ` Pavel Begunkov
  0 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-20 11:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/19/25 20:27, Mina Almasry wrote:
> On Tue, Aug 19, 2025 at 8:51 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 8/19/25 01:07, Mina Almasry wrote:
>>> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> From: Jakub Kicinski <kuba@kernel.org>
>>>>
>>>> Distinguish between rx_buf_len being driver default vs user config.
>>>> Use 0 as a special value meaning "unset" or "restore driver default".
>>>> This will be necessary later on to configure it per-queue, but
>>>> the ability to restore defaults may be useful in itself.
>>>>
>>>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>
>>> I wonder if it should be extended to the other driver using
>>> rx_buf_len, hns3. For that, I think the default buf size would be
>>> HNS3_DEFAULT_RX_BUF_LEN.
>>
>> I'd rather avoid growing the series even more, let's follow up on
>> that in a separate patch on top, that should be just fine. And
>> thanks for the review
>>
>>> Other than that, seems fine to me,
>>>
>>> Reviewed-by: Mina Almasry <almasrymina@google.com>
>>
>> With the said above, do you want me to retain the review tag?
>>
> 
> I initially thought adding my reviewed-by would be fine, but on closer
> look, doesn't this series break rx_buf_len setting for hns3? AFAICT so
> far, in patch 3 you're adding a check to ethnl_set_rings where it'll
> be an error if rx_buf_len > rx_buf_len_max, and i'm guessing if the
> driver never sets rx_buf_len_max it'll be 0 initialized and that check
> would always fail? Or did I miss something?

Good point, it'll need to be fixed then. I'll take a closer look

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members
  2025-08-19  1:46   ` Mina Almasry
@ 2025-08-20 12:04     ` Pavel Begunkov
  0 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-20 12:04 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/19/25 02:46, Mina Almasry wrote:
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> hds_thresh and hds_config are both inside struct netdev_config
>> but have quite different semantics. hds_config is the user config
>> with ternary semantics (on/off/unset). hds_thresh is a straight
>> up value, populated by the driver at init and only modified by
>> user space. We don't expect the drivers to have to pick a special
>> hds_thresh value based on other configuration.
>>
>> The two approaches have different advantages and downsides.
>> hds_thresh ("direct value") gives core easy access to current
>> device settings, but there's no way to express whether the value
>> comes from the user. It also requires the initialization by
>> the driver.
>>
>> hds_config ("user config values") tells us what user wanted, but
>> doesn't give us the current value in the core.
>>
>> Try to explain this a bit in the comments, so at we make a conscious
>> choice for new values which semantics we expect.
>>
>> Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
>> Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
>> returns a hds_thresh value always as 0.") added the setting for the
>> benefit of netdevsim which doesn't touch the value at all on get.
>> Again, this is just to clarify the intention, shouldn't cause any
>> functional change.
>>
> 
> TBH I can't say that moving the init to before
> dev->ethtool_ops->get_ringparam(dev, param, kparam, extack) made me
> understand semantics better. 

I agree, it didn't do it for me either ...

> If you do a respin, maybe a comment above
> the kparam->hds_thresh to say what you mean would help the next reader
> understand.

... and since the move doesn't have a strong semantical meaning, I
can't think of a good comment to put on top of the assignment.
hds_thresh is already described in struct netdev_config and it
seems like a better place for such stuff. Thoughts?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-20  2:31 ` Jakub Kicinski
@ 2025-08-20 13:39   ` Pavel Begunkov
  2025-08-20 13:59     ` Mina Almasry
  2025-08-21  1:37     ` Jakub Kicinski
  0 siblings, 2 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-20 13:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/20/25 03:31, Jakub Kicinski wrote:
> On Mon, 18 Aug 2025 14:57:16 +0100 Pavel Begunkov wrote:
>> Jakub Kicinski (20):
> 
> I think we need to revisit how we operate.
> When we started the ZC work w/ io-uring I suggested a permanent shared
> branch. That's perhaps an overkill. What I did not expect is that you
> will not even CC netdev@ on changes to io_uring/zcrx.*
> 
> I don't mean to assert any sort of ownership of that code, but you're
> not meeting basic collaboration standards for the kernel. This needs
> to change first.

You're throwing quite allegations. Basic collaboration standards don't
include spamming people with unrelated changes via an already busy list.
I cc'ed netdev on patches that meaningfully change how it interacts
(incl indirectly) with netdev and/or might be of interest, which is
beyond of the usual standard expected of a project using infrastructure
provided by a subsystem. There are pieces that don't touch netdev, like
how io_uring pins pages, accounts memory, sets up rings, etc. In the
very same way generic io_uring patches are not normally posted to
netdev, and netdev patches are not redirected to mm because there
are kmalloc calls, even though, it's not even the standard used here.

If you have some way you want to work, I'd appreciate a clear
indication of that, because that message you mentioned was answered
and I've never heard any objection, or anything else really.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct
  2025-08-19 19:37   ` Mina Almasry
@ 2025-08-20 13:43     ` Pavel Begunkov
  0 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-20 13:43 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/19/25 20:37, Mina Almasry wrote:
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
>> inside struct bnxt. This will allow configuring the page size
>> at runtime in subsequent patches.
>>
>> The MSS size calculation for older chip continues to use the constant.
>> I'm intending to support the configuration only on more recent HW,
>> looks like on older chips setting this per queue won't work,
>> and that's the ultimate goal.
>>
>> This patch should not change the current behavior as value
>> read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.
>>
>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> Reviewed-by: Mina Almasry <almasrymina@google.com>
> 
> nit: AFAIU BNXT_RX_PAGE_SIZE should be unused after this? You could
> delete the definition in bnxt.h if so.

Still used in a couple of places, notably as the minimum and/or
default size

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size
  2025-08-19 19:43   ` Mina Almasry
@ 2025-08-20 13:51     ` Pavel Begunkov
  2025-08-25  6:09       ` Somnath Kotur
  0 siblings, 1 reply; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-20 13:51 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/19/25 20:43, Mina Almasry wrote:
> On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> If user decides to increase the buffer size for agg ring
>> we need to ask the page pool for higher order pages.
>> There is no need to use larger pages for header frags,
>> if user increase the size of agg ring buffers switch
>> to separate header page automatically.
>>
>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> [pavel: calculate adjust max_len]
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> index 5307b33ea1c7..d3d9b72ef313 100644
>> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> @@ -3824,11 +3824,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>>          pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
>>          if (BNXT_RX_PAGE_MODE(bp))
>>                  pp.pool_size += bp->rx_ring_size / rx_size_fac;
>> +
>> +       pp.order = get_order(bp->rx_page_size);
>>          pp.nid = numa_node;
>>          pp.netdev = bp->dev;
>>          pp.dev = &bp->pdev->dev;
>>          pp.dma_dir = bp->rx_dir;
>> -       pp.max_len = PAGE_SIZE;
>> +       pp.max_len = PAGE_SIZE << pp.order;
> 
> nit: I assume this could be `pp.max_len = bp->rx_ring_size;` if you
> wanted, since bnxt is not actually using the full compound page in the
> case that bp->rx_ring_size is not a power of 2. Though doesn't matter
> much, either way:

Yeah, thought it's cleaner to derive it from order in case
sth about rx_page_size changes again and it was already
overlooked once, and it's pow2 anyway

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-20 13:39   ` Pavel Begunkov
@ 2025-08-20 13:59     ` Mina Almasry
  2025-08-21  1:26       ` Jakub Kicinski
  2025-08-21  1:37     ` Jakub Kicinski
  1 sibling, 1 reply; 58+ messages in thread
From: Mina Almasry @ 2025-08-20 13:59 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Wed, Aug 20, 2025 at 6:38 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 8/20/25 03:31, Jakub Kicinski wrote:
> > On Mon, 18 Aug 2025 14:57:16 +0100 Pavel Begunkov wrote:
> >> Jakub Kicinski (20):
> >
> > I think we need to revisit how we operate.
> > When we started the ZC work w/ io-uring I suggested a permanent shared
> > branch. That's perhaps an overkill. What I did not expect is that you
> > will not even CC netdev@ on changes to io_uring/zcrx.*
> >
> > I don't mean to assert any sort of ownership of that code, but you're
> > not meeting basic collaboration standards for the kernel. This needs
> > to change first.
>
> You're throwing quite allegations. Basic collaboration standards don't
> include spamming people with unrelated changes via an already busy list.
> I cc'ed netdev on patches that meaningfully change how it interacts
> (incl indirectly) with netdev and/or might be of interest, which is
> beyond of the usual standard expected of a project using infrastructure
> provided by a subsystem. There are pieces that don't touch netdev, like
> how io_uring pins pages, accounts memory, sets up rings, etc. In the
> very same way generic io_uring patches are not normally posted to
> netdev, and netdev patches are not redirected to mm because there
> are kmalloc calls, even though, it's not even the standard used here.
>
> If you have some way you want to work, I'd appreciate a clear
> indication of that, because that message you mentioned was answered
> and I've never heard any objection, or anything else really.
>

We could use tags in the MAINTAINERS file similar to these:

F: include/linux/*fence.h
F: include/linux/dma-buf.h
F: include/linux/dma-resv.h
K: \bdma_(?:buf|fence|resv)\b

We could make sure anything touching io_uring/zcrx. and anything using
netmem_ref/net_iov goes to netdev. I think roughly adding something
like this to general networking entry?

F: io_uring/zcrx.*
K: \bnet(mem_ref|_iov)\b

I had suggested this before but never had time to suggest the actual
changes, and in the back of my mind was a bit weary of spamming the
maintainers, but it seems this is not as much a concern as the patches
not getting to netdev.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-20 13:59     ` Mina Almasry
@ 2025-08-21  1:26       ` Jakub Kicinski
  0 siblings, 0 replies; 58+ messages in thread
From: Jakub Kicinski @ 2025-08-21  1:26 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Pavel Begunkov, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Wed, 20 Aug 2025 06:59:51 -0700 Mina Almasry wrote:
> We could make sure anything touching io_uring/zcrx. and anything using
> netmem_ref/net_iov goes to netdev. I think roughly adding something
> like this to general networking entry?
> 
> F: io_uring/zcrx.*
> K: \bnet(mem_ref|_iov)\b

Right, I think clearest would be to add a new entry for this, and copy
the real metadata (Jens as the maintainer, his tree etc.). If we just
add the match to netdev it will look like the patches will flow via
net-next. No strong preference, tho. As long as get_maintainer suggests
CCing netdev I'll be happy.

> I had suggested this before but never had time to suggest the actual
> changes, and in the back of my mind was a bit weary of spamming the
> maintainers, but it seems this is not as much a concern as the patches
> not getting to netdev.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-20 13:39   ` Pavel Begunkov
  2025-08-20 13:59     ` Mina Almasry
@ 2025-08-21  1:37     ` Jakub Kicinski
  2025-08-21 15:04       ` Pavel Begunkov
  1 sibling, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2025-08-21  1:37 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On Wed, 20 Aug 2025 14:39:51 +0100 Pavel Begunkov wrote:
> On 8/20/25 03:31, Jakub Kicinski wrote:
> > On Mon, 18 Aug 2025 14:57:16 +0100 Pavel Begunkov wrote:  
> >> Jakub Kicinski (20):  
> > 
> > I think we need to revisit how we operate.
> > When we started the ZC work w/ io-uring I suggested a permanent shared
> > branch. That's perhaps an overkill. What I did not expect is that you
> > will not even CC netdev@ on changes to io_uring/zcrx.*
> > 
> > I don't mean to assert any sort of ownership of that code, but you're
> > not meeting basic collaboration standards for the kernel. This needs
> > to change first.  
> 
> You're throwing quite allegations. Basic collaboration standards don't
> include spamming people with unrelated changes via an already busy list.
> I cc'ed netdev on patches that meaningfully change how it interacts
> (incl indirectly) with netdev and/or might be of interest, which is
> beyond of the usual standard expected of a project using infrastructure
> provided by a subsystem.

To me iouring is a fancy syscall layer. It's good at its job, sure,
but saying that netdev provides infrastructure to a syscall layer is
laughable.

> There are pieces that don't touch netdev, like
> how io_uring pins pages, accounts memory, sets up rings, etc. In the
> very same way generic io_uring patches are not normally posted to
> netdev, and netdev patches are not redirected to mm because there
> are kmalloc calls, even though, it's not even the standard used here.

I'm asking you to CC netdev, and people who work on ZC like Mina.
Normal reaction to someone asking to be CCed on patches is "Sure."
I don't understand what you're afraid of.

> If you have some way you want to work, I'd appreciate a clear
> indication of that, because that message you mentioned was answered
> and I've never heard any objection, or anything else really.

It honestly didn't cross my mind that you'd only CC netdev on patches
which touch code under net/. I'd have let you know sooner but it's hard
to reply to messages one doesn't see. I found out that there's whole
bunch of ZC work that landed in iouring from talking to David Wei.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers
  2025-08-21  1:37     ` Jakub Kicinski
@ 2025-08-21 15:04       ` Pavel Begunkov
  0 siblings, 0 replies; 58+ messages in thread
From: Pavel Begunkov @ 2025-08-21 15:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Eric Dumazet, Willem de Bruijn, Paolo Abeni,
	andrew+netdev, horms, davem, sdf, almasrymina, dw, michael.chan,
	dtatulea, ap420073, linux-kernel, io-uring

On 8/21/25 02:37, Jakub Kicinski wrote:
> On Wed, 20 Aug 2025 14:39:51 +0100 Pavel Begunkov wrote:
>> On 8/20/25 03:31, Jakub Kicinski wrote:
>>> On Mon, 18 Aug 2025 14:57:16 +0100 Pavel Begunkov wrote:
>>>> Jakub Kicinski (20):
>>>
>>> I think we need to revisit how we operate.
>>> When we started the ZC work w/ io-uring I suggested a permanent shared
>>> branch. That's perhaps an overkill. What I did not expect is that you
>>> will not even CC netdev@ on changes to io_uring/zcrx.*
>>>
>>> I don't mean to assert any sort of ownership of that code, but you're
>>> not meeting basic collaboration standards for the kernel. This needs
>>> to change first.
>>
>> You're throwing quite allegations. Basic collaboration standards don't
>> include spamming people with unrelated changes via an already busy list.
>> I cc'ed netdev on patches that meaningfully change how it interacts
>> (incl indirectly) with netdev and/or might be of interest, which is
>> beyond of the usual standard expected of a project using infrastructure
>> provided by a subsystem.
> 
> To me iouring is a fancy syscall layer. It's good at its job, sure,
> but saying that netdev provides infrastructure to a syscall layer is
> laughable.

?

>> There are pieces that don't touch netdev, like
>> how io_uring pins pages, accounts memory, sets up rings, etc. In the
>> very same way generic io_uring patches are not normally posted to
>> netdev, and netdev patches are not redirected to mm because there
>> are kmalloc calls, even though, it's not even the standard used here.
> 
> I'm asking you to CC netdev, and people who work on ZC like Mina.
> Normal reaction to someone asking to be CCed on patches is "Sure."
> I don't understand what you're afraid of.

Normal reaction is to ask to CC and not attempt to slander as you
just did. That's not appreciated. All that cherry topped with a
signal that you're not going to take my work until I learn how to
read your mind.

https://lore.kernel.org/all/bcf5a9e8-5014-44cc-85a0-2974e3039cb6@gmail.com/

When you brought this topic before, I fully outlined what I believe
would be a good workflow, and since there was no answer, I've been
sticking to it. And let me note, you didn't directly and clearly
ask to CC netdev. And I'm pretty sure, ignoring messages and
smearing is not in the spirit of the "basic collaboration standards",
whatever those are.

>> If you have some way you want to work, I'd appreciate a clear
>> indication of that, because that message you mentioned was answered
>> and I've never heard any objection, or anything else really.
> 
> It honestly didn't cross my mind that you'd only CC netdev on patches
> which touch code under net/. I'd have let you know sooner but it's hard

If you refer to the directory, that's clearly not true.

> to reply to messages one doesn't see. I found out that there's whole
> bunch of ZC work that landed in iouring from talking to David Wei.

The linked thread above indicates the opposite. 	

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size
  2025-08-20 13:51     ` Pavel Begunkov
@ 2025-08-25  6:09       ` Somnath Kotur
  0 siblings, 0 replies; 58+ messages in thread
From: Somnath Kotur @ 2025-08-25  6:09 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Mina Almasry, Jakub Kicinski, netdev, Eric Dumazet,
	Willem de Bruijn, Paolo Abeni, andrew+netdev, horms, davem, sdf,
	dw, michael.chan, dtatulea, ap420073, linux-kernel, io-uring

[-- Attachment #1: Type: text/plain, Size: 2179 bytes --]

On Wed, Aug 20, 2025 at 7:23 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 8/19/25 20:43, Mina Almasry wrote:
> > On Mon, Aug 18, 2025 at 6:56 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> From: Jakub Kicinski <kuba@kernel.org>
> >>
> >> If user decides to increase the buffer size for agg ring
> >> we need to ask the page pool for higher order pages.
> >> There is no need to use larger pages for header frags,
> >> if user increase the size of agg ring buffers switch
> >> to separate header page automatically.
> >>
> >> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> >> [pavel: calculate adjust max_len]
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>   drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++++-
> >>   1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> >> index 5307b33ea1c7..d3d9b72ef313 100644
> >> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> >> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> >> @@ -3824,11 +3824,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >>          pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
> >>          if (BNXT_RX_PAGE_MODE(bp))
> >>                  pp.pool_size += bp->rx_ring_size / rx_size_fac;
> >> +
> >> +       pp.order = get_order(bp->rx_page_size);
> >>          pp.nid = numa_node;
> >>          pp.netdev = bp->dev;
> >>          pp.dev = &bp->pdev->dev;
> >>          pp.dma_dir = bp->rx_dir;
> >> -       pp.max_len = PAGE_SIZE;
> >> +       pp.max_len = PAGE_SIZE << pp.order;
> >
> > nit: I assume this could be `pp.max_len = bp->rx_ring_size;` if you
> > wanted, since bnxt is not actually using the full compound page in the
> > case that bp->rx_ring_size is not a power of 2. Though doesn't matter
> > much, either way:
>
> Yeah, thought it's cleaner to derive it from order in case
> sth about rx_page_size changes again and it was already
> overlooked once, and it's pow2 anyway
>
> --
> Pavel Begunkov
>
>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4199 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next v3 16/23] eth: bnxt: store the rx buf size per queue
  2025-08-18 13:57 ` [PATCH net-next v3 16/23] eth: bnxt: store the rx buf size per queue Pavel Begunkov
@ 2025-08-25  6:24   ` Somnath Kotur
  0 siblings, 0 replies; 58+ messages in thread
From: Somnath Kotur @ 2025-08-25  6:24 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jakub Kicinski, netdev, Eric Dumazet, Willem de Bruijn,
	Paolo Abeni, andrew+netdev, horms, davem, sdf, almasrymina, dw,
	michael.chan, dtatulea, ap420073, linux-kernel, io-uring

[-- Attachment #1: Type: text/plain, Size: 13757 bytes --]

On Mon, Aug 18, 2025 at 7:45 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> In normal operation only a subset of queues is configured for
> zero-copy. Since zero-copy is the main use for larger buffer
> sizes we need to configure the sizes per queue.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 46 ++++++++++---------
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  6 +--
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |  2 +-
>  4 files changed, 30 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 467e8a0745e1..50f663777843 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -900,7 +900,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
>
>  static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
>  {
> -       return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
> +       return rxr->need_head_pool || PAGE_SIZE > rxr->rx_page_size;
>  }
>
>  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> @@ -910,9 +910,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
>  {
>         struct page *page;
>
> -       if (PAGE_SIZE > bp->rx_page_size) {
> +       if (PAGE_SIZE > rxr->rx_page_size) {
>                 page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
> -                                               bp->rx_page_size);
> +                                               rxr->rx_page_size);
>         } else {
>                 page = page_pool_dev_alloc_pages(rxr->page_pool);
>                 *offset = 0;
> @@ -1150,9 +1150,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
>                 return NULL;
>         }
>         dma_addr -= bp->rx_dma_offset;
> -       dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
> +       dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
>                                 bp->rx_dir);
> -       skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
> +       skb = napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);
>         if (!skb) {
>                 page_pool_recycle_direct(rxr->page_pool, page);
>                 return NULL;
> @@ -1184,7 +1184,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
>                 return NULL;
>         }
>         dma_addr -= bp->rx_dma_offset;
> -       dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
> +       dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
>                                 bp->rx_dir);
>
>         if (unlikely(!payload))
> @@ -1198,7 +1198,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
>
>         skb_mark_for_recycle(skb);
>         off = (void *)data_ptr - page_address(page);
> -       skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
> +       skb_add_rx_frag(skb, 0, page, off, len, rxr->rx_page_size);
>         memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
>                payload + NET_IP_ALIGN);
>
> @@ -1283,7 +1283,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
>                 if (skb) {
>                         skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
>                                                cons_rx_buf->offset,
> -                                              frag_len, bp->rx_page_size);
> +                                              frag_len, rxr->rx_page_size);
>                 } else {
>                         skb_frag_t *frag = &shinfo->frags[i];
>
> @@ -1308,7 +1308,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
>                         if (skb) {
>                                 skb->len -= frag_len;
>                                 skb->data_len -= frag_len;
> -                               skb->truesize -= bp->rx_page_size;
> +                               skb->truesize -= rxr->rx_page_size;
>                         }
>
>                         --shinfo->nr_frags;
> @@ -1323,7 +1323,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
>                 }
>
>                 page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
> -                                                 bp->rx_page_size);
> +                                                 rxr->rx_page_size);
>
>                 total_frag_len += frag_len;
>                 prod = NEXT_RX_AGG(prod);
> @@ -2276,8 +2276,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
>                         if (!skb)
>                                 goto oom_next_rx;
>                 } else {
> -                       skb = bnxt_xdp_build_skb(bp, skb, agg_bufs,
> -                                                rxr->page_pool, &xdp);
> +                       skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr, &xdp);
>                         if (!skb) {
>                                 /* we should be able to free the old skb here */
>                                 bnxt_xdp_buff_frags_free(rxr, &xdp);
> @@ -3825,7 +3824,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>         if (BNXT_RX_PAGE_MODE(bp))
>                 pp.pool_size += bp->rx_ring_size / rx_size_fac;
>
> -       pp.order = get_order(bp->rx_page_size);
> +       pp.order = get_order(rxr->rx_page_size);
>         pp.nid = numa_node;
>         pp.netdev = bp->dev;
>         pp.dev = &bp->pdev->dev;
> @@ -4318,6 +4317,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
>                 if (!rxr)
>                         goto skip_rx;
>
> +               rxr->rx_page_size = bp->rx_page_size;
> +
>                 ring = &rxr->rx_ring_struct;
>                 rmem = &ring->ring_mem;
>                 rmem->nr_pages = bp->rx_nr_pages;
> @@ -4477,7 +4478,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
>         ring = &rxr->rx_agg_ring_struct;
>         ring->fw_ring_id = INVALID_HW_RING_ID;
>         if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
> -               type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
> +               type = ((u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |
>                         RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
>
>                 bnxt_init_rxbd_pages(ring, type);
> @@ -7042,6 +7043,7 @@ static void bnxt_hwrm_ring_grp_free(struct bnxt *bp)
>
>  static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
>                                        struct hwrm_ring_alloc_input *req,
> +                                      struct bnxt_rx_ring_info *rxr,
>                                        struct bnxt_ring_struct *ring)
>  {
>         struct bnxt_ring_grp_info *grp_info = &bp->grp_info[ring->grp_idx];
> @@ -7051,7 +7053,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
>         if (ring_type == HWRM_RING_ALLOC_AGG) {
>                 req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
>                 req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
> -               req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
> +               req->rx_buf_size = cpu_to_le16(rxr->rx_page_size);
>                 enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
>         } else {
>                 req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
> @@ -7065,6 +7067,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
>  }
>
>  static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
> +                                   struct bnxt_rx_ring_info *rxr,
>                                     struct bnxt_ring_struct *ring,
>                                     u32 ring_type, u32 map_index)
>  {
> @@ -7121,7 +7124,8 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
>                               cpu_to_le32(bp->rx_ring_mask + 1) :
>                               cpu_to_le32(bp->rx_agg_ring_mask + 1);
>                 if (bp->flags & BNXT_FLAG_CHIP_P5_PLUS)
> -                       bnxt_set_rx_ring_params_p5(bp, ring_type, req, ring);
> +                       bnxt_set_rx_ring_params_p5(bp, ring_type, req,
> +                                                  rxr, ring);
>                 break;
>         case HWRM_RING_ALLOC_CMPL:
>                 req->ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL;
> @@ -7269,7 +7273,7 @@ static int bnxt_hwrm_rx_ring_alloc(struct bnxt *bp,
>         u32 map_idx = bnapi->index;
>         int rc;
>
> -       rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
> +       rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
>         if (rc)
>                 return rc;
>
> @@ -7289,7 +7293,7 @@ static int bnxt_hwrm_rx_agg_ring_alloc(struct bnxt *bp,
>         int rc;
>
>         map_idx = grp_idx + bp->rx_nr_rings;
> -       rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
> +       rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
>         if (rc)
>                 return rc;
>
> @@ -7313,7 +7317,7 @@ static int bnxt_hwrm_cp_ring_alloc_p5(struct bnxt *bp,
>
>         ring = &cpr->cp_ring_struct;
>         ring->handle = BNXT_SET_NQ_HDL(cpr);
> -       rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
> +       rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
>         if (rc)
>                 return rc;
>         bnxt_set_db(bp, &cpr->cp_db, type, map_idx, ring->fw_ring_id);
> @@ -7328,7 +7332,7 @@ static int bnxt_hwrm_tx_ring_alloc(struct bnxt *bp,
>         const u32 type = HWRM_RING_ALLOC_TX;
>         int rc;
>
> -       rc = hwrm_ring_alloc_send_msg(bp, ring, type, tx_idx);
> +       rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, tx_idx);
>         if (rc)
>                 return rc;
>         bnxt_set_db(bp, &txr->tx_db, type, tx_idx, ring->fw_ring_id);
> @@ -7354,7 +7358,7 @@ static int bnxt_hwrm_ring_alloc(struct bnxt *bp)
>
>                 vector = bp->irq_tbl[map_idx].vector;
>                 disable_irq_nosync(vector);
> -               rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
> +               rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
>                 if (rc) {
>                         enable_irq(vector);
>                         goto err_out;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index 56aafae568f8..4f9d4c71c0e2 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
>
>         unsigned long           *rx_agg_bmap;
>         u16                     rx_agg_bmap_size;
> +       u16                     rx_page_size;
>         bool                    need_head_pool;
>
>         dma_addr_t              rx_desc_mapping[MAX_RX_PAGES];
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
> index 41d3ba56ba41..19dda0201c69 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
> @@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
>                         u16 cons, u8 *data_ptr, unsigned int len,
>                         struct xdp_buff *xdp)
>  {
> -       u32 buflen = bp->rx_page_size;
> +       u32 buflen = rxr->rx_page_size;
>         struct bnxt_sw_rx_bd *rx_buf;
>         struct pci_dev *pdev;
>         dma_addr_t mapping;
> @@ -461,7 +461,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>
>  struct sk_buff *
>  bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
> -                  struct page_pool *pool, struct xdp_buff *xdp)
> +                  struct bnxt_rx_ring_info *rxr, struct xdp_buff *xdp)
>  {
>         struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>
> @@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
>
>         xdp_update_skb_shared_info(skb, num_frags,
>                                    sinfo->xdp_frags_size,
> -                                  bp->rx_page_size * num_frags,
> +                                  rxr->rx_page_size * num_frags,
>                                    xdp_buff_is_frag_pfmemalloc(xdp));
>         return skb;
>  }
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
> index 220285e190fc..8933a0dec09a 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
> @@ -32,6 +32,6 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
>  void bnxt_xdp_buff_frags_free(struct bnxt_rx_ring_info *rxr,
>                               struct xdp_buff *xdp);
>  struct sk_buff *bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb,
> -                                  u8 num_frags, struct page_pool *pool,
> +                                  u8 num_frags, struct bnxt_rx_ring_info *rxr,
>                                    struct xdp_buff *xdp);
>  #endif
> --
> 2.49.0
>
>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4199 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-08-25  6:24 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18 13:57 [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 01/23] net: page_pool: sanitise allocation order Pavel Begunkov
2025-08-18 23:33   ` Mina Almasry
2025-08-19 15:53     ` Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 02/23] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-08-18 23:50   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 03/23] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-08-19  0:00   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 04/23] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-08-19  0:07   ` Mina Almasry
2025-08-19 15:52     ` Pavel Begunkov
2025-08-19 19:27       ` Mina Almasry
2025-08-20 11:53         ` Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 05/23] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-08-19  1:46   ` Mina Almasry
2025-08-20 12:04     ` Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 06/23] net: add rx_buf_len to netdev config Pavel Begunkov
2025-08-19 19:32   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 07/23] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-08-19 19:37   ` Mina Almasry
2025-08-20 13:43     ` Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 08/23] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-08-19 19:43   ` Mina Almasry
2025-08-20 13:51     ` Pavel Begunkov
2025-08-25  6:09       ` Somnath Kotur
2025-08-18 13:57 ` [PATCH net-next v3 09/23] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-08-19 20:10   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 10/23] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-08-19 20:15   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 11/23] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 12/23] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-08-19 21:29   ` Mina Almasry
2025-08-20  1:32     ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 13/23] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-08-19 21:30   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 14/23] net: add queue config validation callback Pavel Begunkov
2025-08-19 21:54   ` Mina Almasry
2025-08-20  1:31     ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 15/23] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-08-19 21:57   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 16/23] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-08-25  6:24   ` Somnath Kotur
2025-08-18 13:57 ` [PATCH net-next v3 17/23] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 18/23] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-08-19 22:36   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 19/23] net: wipe the setting of deactived queues Pavel Begunkov
2025-08-19 22:49   ` Mina Almasry
2025-08-18 13:57 ` [PATCH net-next v3 20/23] eth: bnxt: use queue op config validate Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 21/23] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 22/23] net: let pp memory provider to specify rx buf len Pavel Begunkov
2025-08-18 13:57 ` [PATCH net-next v3 23/23] net: validate driver supports passed qcfg params Pavel Begunkov
2025-08-18 13:59 ` [PATCH net-next v3 00/23][pull request] Queue configs and large buffer providers Pavel Begunkov
2025-08-20  2:31 ` Jakub Kicinski
2025-08-20 13:39   ` Pavel Begunkov
2025-08-20 13:59     ` Mina Almasry
2025-08-21  1:26       ` Jakub Kicinski
2025-08-21  1:37     ` Jakub Kicinski
2025-08-21 15:04       ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).