[RFC net-next 00/22] net: per-queue rx-buf-len configuration

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net-next 00/22] net: per-queue rx-buf-len configuration
@ 2025-04-21 22:28 Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
                   ` (22 more replies)
  0 siblings, 23 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Add support for per-queue rx-buf-len configuration.

I'm sending this as RFC because I'd like to ponder the uAPI side
a little longer but it's good enough for people to work on
the memory provider side and support in other drivers.

The direct motivation for the series is that zero-copy Rx queues would
like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
and can coalesce payloads into pages much larger than than the MTU.
Enabling larger buffers globally is a bit precarious as it exposes us
to potentially very inefficient memory use. Also allocating large
buffers may not be easy or cheap under load. Zero-copy queues service
only select traffic and have pre-allocated memory so the concerns don't
apply as much.

The per-queue config has to address 3 problems:
- user API
- driver API
- memory provider API

For user API the main question is whether we expose the config via
ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
than extending the ethtool RINGS_GET API. I worry slightly that queue
GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
settings we have in ethtool which are not going via RINGS_SET is
IRQ coalescing.

My goal for the driver API was to avoid complexity in the drivers.
The queue management API has gained two ops, responsible for preparing
configuration for a given queue, and validating whether the config
is supported. The validating is used both for NIC-wide and per-queue
changes. Queue alloc/start ops have a new "config" argument which
contains the current config for a given queue (we use queue restart
to apply per-queue settings). Outside of queue reset paths drivers
can call netdev_queue_config() which returns the config for an arbitrary
queue. Long story short I anticipate it to be used during ndo_open.

In the core I extended struct netdev_config with per queue settings.
All in all this isn't too far from what was there in my "queue API
prototype" a few years ago. One thing I was hoping to support but
haven't gotten to is providing the settings at the RSS context level.
Zero-copy users often depend on RSS for load spreading. It'd be more
convenient for them to provide the settings per RSS context.
We may be better off converting the QUEUE_SET netlink op to CONFIG_SET
and accept multiple "scopes" (queue, rss context)?

Memory provider API is a bit tricky. Initially I wasn't sure whether
the buffer size should be a MP attribute or a device attribute.
IOW whether it's the device that should be telling the MP what page
size it wants, or the MP telling the device what page size it has.
In some ways the latter is more flexible, but the implementation
gets hairy rather quickly. Drivers expect to know their parameters
early in the init process, page pools are allocated relatively late.

Jakub Kicinski (22):
  docs: ethtool: document that rx_buf_len must control payload lengths
  net: ethtool: report max value for rx-buf-len
  net: use zero value to restore rx_buf_len to default
  net: clarify the meaning of netdev_config members
  net: add rx_buf_len to netdev config
  eth: bnxt: read the page size from the adapter struct
  eth: bnxt: set page pool page order based on rx_page_size
  eth: bnxt: support setting size of agg buffers via ethtool
  net: move netdev_config manipulation to dedicated helpers
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: allocate per-queue config structs and pass them thru the queue
    API
  net: pass extack to netdev_rx_queue_restart()
  net: add queue config validation callback
  eth: bnxt: always set the queue mgmt ops
  eth: bnxt: store the rx buf size per queue
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  netdev: add support for setting rx-buf-len per queue
  net: wipe the setting of deactived queues
  eth: bnxt: use queue op config validate
  eth: bnxt: support per queue configuration of rx-buf-len
  selftests: drv-net: add helper/wrapper for bpftrace
  selftests: drv-net: add test for rx-buf-len

 Documentation/netlink/specs/ethtool.yaml      |   4 +
 Documentation/netlink/specs/netdev.yaml       |  15 +
 Documentation/networking/ethtool-netlink.rst  |   7 +-
 net/core/Makefile                             |   2 +-
 .../testing/selftests/drivers/net/hw/Makefile |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 include/linux/ethtool.h                       |   3 +
 include/net/netdev_queues.h                   |  83 ++++-
 include/net/netdev_rx_queue.h                 |   3 +-
 include/net/netlink.h                         |  19 ++
 .../uapi/linux/ethtool_netlink_generated.h    |   1 +
 include/uapi/linux/netdev.h                   |   2 +
 net/core/dev.h                                |  12 +
 net/core/netdev-genl-gen.h                    |   1 +
 tools/include/uapi/linux/netdev.h             |   2 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 135 ++++++--
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 net/core/dev.c                                |  12 +-
 net/core/netdev-genl-gen.c                    |  15 +
 net/core/netdev-genl.c                        |  92 ++++++
 net/core/netdev_config.c                      | 150 +++++++++
 net/core/netdev_rx_queue.c                    |  24 +-
 net/ethtool/common.c                          |   4 +-
 net/ethtool/netlink.c                         |  14 +-
 net/ethtool/rings.c                           |  14 +-
 .../selftests/drivers/net/hw/rx_buf_len.py    | 299 ++++++++++++++++++
 tools/testing/selftests/net/lib/py/utils.py   |  33 ++
 32 files changed, 913 insertions(+), 79 deletions(-)
 create mode 100644 net/core/netdev_config.c
 create mode 100755 tools/testing/selftests/drivers/net/hw/rx_buf_len.py

-- 
2.49.0

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 16:19   ` David Wei
                     ` (2 more replies)
  2025-04-21 22:28 ` [RFC net-next 02/22] net: ethtool: report max value for rx-buf-len Jakub Kicinski
                   ` (21 subsequent siblings)
  22 siblings, 3 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Document the semantics of the rx_buf_len ethtool ring param.
Clarify its meaning in case of HDS, where driver may have
two separate buffer pools.

The various zero-copy TCP Rx schemes we have suffer from memory
management overhead. Specifically applications aren't too impressed
with the number of 4kB buffers they have to juggle. Zero-copy
TCP makes most sense with larger memory transfers so using
16kB or 32kB buffers (with the help of HW-GRO) feels more
natural.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/networking/ethtool-netlink.rst | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index b6e9af4d0f1b..eaa9c17a3cb1 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
 driver. Driver may impose additional constraints and may not support all
 attributes.
 
-
 ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
 Completion queue events (CQE) are the events posted by NIC to indicate the
 completion status of a packet when the packet is sent (like send success or
@@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
 header / data split feature. If a received packet size is larger than this
 threshold value, header and data will be split.
 
+``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
+uses to receive packets. If the device uses different memory polls for headers
+and payload this setting may control the size of the header buffers but must
+control the size of the payload buffers.
+
 CHANNELS_GET
 ============
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 02/22] net: ethtool: report max value for rx-buf-len
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 03/22] net: use zero value to restore rx_buf_len to default Jakub Kicinski
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Unlike most of our APIs the rx-buf-len param does not have an associated
max value. In theory user could set this value pretty high, but in
practice most NICs have limits due to the width of the length fields
in the descriptors.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/ethtool.yaml                  | 4 ++++
 Documentation/networking/ethtool-netlink.rst              | 1 +
 include/linux/ethtool.h                                   | 2 ++
 include/uapi/linux/ethtool_netlink_generated.h            | 1 +
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 ++-
 net/ethtool/rings.c                                       | 5 +++++
 6 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 655d8d10fe24..3c09bfc206e1 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -338,6 +338,9 @@ uapi-header: linux/ethtool_netlink_generated.h
       -
         name: hds-thresh-max
         type: u32
+      -
+        name: rx-buf-len-max
+        type: u32
 
   -
     name: mm-stat
@@ -1781,6 +1784,7 @@ uapi-header: linux/ethtool_netlink_generated.h
             - rx-jumbo
             - tx
             - rx-buf-len
+            - rx-buf-len-max
             - tcp-data-split
             - cqe-size
             - tx-push
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index eaa9c17a3cb1..b7a99dfdffa9 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -893,6 +893,7 @@ Gets ring sizes like ``ETHTOOL_GRINGPARAM`` ioctl request.
   ``ETHTOOL_A_RINGS_RX_JUMBO``              u32     size of RX jumbo ring
   ``ETHTOOL_A_RINGS_TX``                    u32     size of TX ring
   ``ETHTOOL_A_RINGS_RX_BUF_LEN``            u32     size of buffers on the ring
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN_MAX``        u32     max size of rx buffers
   ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``        u8      TCP header / data split
   ``ETHTOOL_A_RINGS_CQE_SIZE``              u32     Size of TX/RX CQE
   ``ETHTOOL_A_RINGS_TX_PUSH``               u8      flag of TX Push mode
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 7edb5f5e7134..1f61f03f354e 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -72,6 +72,7 @@ enum {
 /**
  * struct kernel_ethtool_ringparam - RX/TX ring configuration
  * @rx_buf_len: Current length of buffers on the rx ring.
+ * @rx_buf_len_max: Max length of buffers on the rx ring.
  * @tcp_data_split: Scatter packet headers and data to separate buffers
  * @tx_push: The flag of tx push mode
  * @rx_push: The flag of rx push mode
@@ -84,6 +85,7 @@ enum {
  */
 struct kernel_ethtool_ringparam {
 	u32	rx_buf_len;
+	u32	rx_buf_len_max;
 	u8	tcp_data_split;
 	u8	tx_push;
 	u8	rx_push;
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index fe24c3459ac0..5a6393c0975f 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -157,6 +157,7 @@ enum {
 	ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
 	ETHTOOL_A_RINGS_HDS_THRESH,
 	ETHTOOL_A_RINGS_HDS_THRESH_MAX,
+	ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
 
 	__ETHTOOL_A_RINGS_CNT,
 	ETHTOOL_A_RINGS_MAX = (__ETHTOOL_A_RINGS_CNT - 1)
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
index 010385b29988..2466fe04b642 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
@@ -376,6 +376,7 @@ static void otx2_get_ringparam(struct net_device *netdev,
 	ring->tx_max_pending = Q_COUNT(Q_SIZE_MAX);
 	ring->tx_pending = qs->sqe_cnt ? qs->sqe_cnt : Q_COUNT(Q_SIZE_4K);
 	kernel_ring->rx_buf_len = pfvf->hw.rbuf_len;
+	kernel_ring->rx_buf_len_max = 32768;
 	kernel_ring->cqe_size = pfvf->hw.xqe_size;
 }
 
@@ -398,7 +399,7 @@ static int otx2_set_ringparam(struct net_device *netdev,
 	/* Hardware supports max size of 32k for a receive buffer
 	 * and 1536 is typical ethernet frame size.
 	 */
-	if (rx_buf_len && (rx_buf_len < 1536 || rx_buf_len > 32768)) {
+	if (rx_buf_len && (rx_buf_len < 1536)) {
 		netdev_err(netdev,
 			   "Receive buffer range is 1536 - 32768");
 		return -EINVAL;
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index aeedd5ec6b8c..5e872ceab5dd 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -105,6 +105,9 @@ static int rings_fill_reply(struct sk_buff *skb,
 			  ringparam->tx_pending)))  ||
 	    (kr->rx_buf_len &&
 	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN, kr->rx_buf_len))) ||
+	    (kr->rx_buf_len_max &&
+	     (nla_put_u32(skb, ETHTOOL_A_RINGS_RX_BUF_LEN_MAX,
+			  kr->rx_buf_len_max))) ||
 	    (kr->tcp_data_split &&
 	     (nla_put_u8(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT,
 			 kr->tcp_data_split))) ||
@@ -281,6 +284,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		err_attr = tb[ETHTOOL_A_RINGS_TX];
 	else if (kernel_ringparam.hds_thresh > kernel_ringparam.hds_thresh_max)
 		err_attr = tb[ETHTOOL_A_RINGS_HDS_THRESH];
+	else if (kernel_ringparam.rx_buf_len > kernel_ringparam.rx_buf_len_max)
+		err_attr = tb[ETHTOOL_A_RINGS_RX_BUF_LEN];
 	else
 		err_attr = NULL;
 	if (err_attr) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 03/22] net: use zero value to restore rx_buf_len to default
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 02/22] net: ethtool: report max value for rx-buf-len Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 04/22] net: clarify the meaning of netdev_config members Jakub Kicinski
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Distinguish between rx_buf_len being driver default vs user config.
Use 0 as a special value meaning "unset" or "restore driver default".
This will be necessary later on to configure it per-queue, but
the ability to restore defaults may be useful in itself.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/networking/ethtool-netlink.rst              | 2 +-
 include/linux/ethtool.h                                   | 1 +
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 +++
 net/ethtool/rings.c                                       | 2 +-
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index b7a99dfdffa9..723f8e1a33a7 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -974,7 +974,7 @@ threshold value, header and data will be split.
 ``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
 uses to receive packets. If the device uses different memory polls for headers
 and payload this setting may control the size of the header buffers but must
-control the size of the payload buffers.
+control the size of the payload buffers. Setting to 0 restores driver default.
 
 CHANNELS_GET
 ============
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 1f61f03f354e..84cfa7f4fce0 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -72,6 +72,7 @@ enum {
 /**
  * struct kernel_ethtool_ringparam - RX/TX ring configuration
  * @rx_buf_len: Current length of buffers on the rx ring.
+ *		Setting to 0 means reset to driver default.
  * @rx_buf_len_max: Max length of buffers on the rx ring.
  * @tcp_data_split: Scatter packet headers and data to separate buffers
  * @tx_push: The flag of tx push mode
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
index 2466fe04b642..c294ddb2055f 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c
@@ -396,6 +396,9 @@ static int otx2_set_ringparam(struct net_device *netdev,
 	if (ring->rx_mini_pending || ring->rx_jumbo_pending)
 		return -EINVAL;
 
+	if (!rx_buf_len)
+		rx_buf_len = OTX2_DEFAULT_RBUF_LEN;
+
 	/* Hardware supports max size of 32k for a receive buffer
 	 * and 1536 is typical ethernet frame size.
 	 */
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 5e872ceab5dd..628546a1827b 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -139,7 +139,7 @@ const struct nla_policy ethnl_rings_set_policy[] = {
 	[ETHTOOL_A_RINGS_RX_MINI]		= { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_RX_JUMBO]		= { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_TX]			= { .type = NLA_U32 },
-	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = NLA_POLICY_MIN(NLA_U32, 1),
+	[ETHTOOL_A_RINGS_RX_BUF_LEN]            = { .type = NLA_U32 },
 	[ETHTOOL_A_RINGS_TCP_DATA_SPLIT]	=
 		NLA_POLICY_MAX(NLA_U8, ETHTOOL_TCP_DATA_SPLIT_ENABLED),
 	[ETHTOOL_A_RINGS_CQE_SIZE]		= NLA_POLICY_MIN(NLA_U32, 1),
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 04/22] net: clarify the meaning of netdev_config members
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (2 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 03/22] net: use zero value to restore rx_buf_len to default Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 19:57   ` Joe Damato
  2025-04-21 22:28 ` [RFC net-next 05/22] net: add rx_buf_len to netdev config Jakub Kicinski
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

hds_thresh and hds_config are both inside struct netdev_config
but have quite different semantics. hds_config is the user config
with ternary semantics (on/off/unset). hds_thresh is a straight
up value, populated by the driver at init and only modified by
user space. We don't expect the drivers to have to pick a special
hds_thresh value based on other configuration.

The two approaches have different advantages and downsides.
hds_thresh ("direct value") gives core easy access to current
device settings, but there's no way to express whether the value
comes from the user. It also requires the initialization by
the driver.

hds_config ("user config values") tells us what user wanted, but
doesn't give us the current value in the core.

Try to explain this a bit in the comments, so at we make a conscious
choice for new values which semantics we expect.

Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
returns a hds_thresh value always as 0.") added the setting for the
benefit of netdevsim which doesn't touch the value at all on get.
Again, this is just to clarify the intention, shouldn't cause any
functional change.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_queues.h | 19 +++++++++++++++++--
 net/ethtool/common.c        |  3 ++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index ea709b59d827..f4eab6fc64f4 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -6,11 +6,26 @@
 
 /**
  * struct netdev_config - queue-related configuration for a netdev
- * @hds_thresh:		HDS Threshold value.
- * @hds_config:		HDS value from userspace.
  */
 struct netdev_config {
+	/* Direct value
+	 *
+	 * Driver default is expected to be fixed, and set in this struct
+	 * at init. From that point on user may change the value. There is
+	 * no explicit way to "unset" / restore driver default.
+	 */
+	/** @hds_thresh: HDS Threshold value (ETHTOOL_A_RINGS_HDS_THRESH).
+	 */
 	u32	hds_thresh;
+
+	/* User config values
+	 *
+	 * Contain user configuration. If "set" driver must obey.
+	 * If "unset" driver is free to decide, and may change its choice
+	 * as other parameters change.
+	 */
+	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
+	 */
 	u8	hds_config;
 };
 
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 49bea6b45bd5..502392395d45 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -825,12 +825,13 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
 	memset(param, 0, sizeof(*param));
 	memset(kparam, 0, sizeof(*kparam));
 
+	kparam->hds_thresh = dev->cfg->hds_thresh;
+
 	param->cmd = ETHTOOL_GRINGPARAM;
 	dev->ethtool_ops->get_ringparam(dev, param, kparam, extack);
 
 	/* Driver gives us current state, we want to return current config */
 	kparam->tcp_data_split = dev->cfg->hds_config;
-	kparam->hds_thresh = dev->cfg->hds_thresh;
 }
 
 static void ethtool_init_tsinfo(struct kernel_ethtool_ts_info *info)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 05/22] net: add rx_buf_len to netdev config
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (3 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 04/22] net: clarify the meaning of netdev_config members Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct Jakub Kicinski
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Add rx_buf_len to configuration maintained by the core.
Use "three-state" semantics where 0 means "driver default".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_queues.h | 4 ++++
 net/ethtool/common.c        | 1 +
 net/ethtool/rings.c         | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index f4eab6fc64f4..a9ee147dd914 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -24,6 +24,10 @@ struct netdev_config {
 	 * If "unset" driver is free to decide, and may change its choice
 	 * as other parameters change.
 	 */
+	/** @rx_buf_len: Size of buffers on the Rx ring
+	 *		 (ETHTOOL_A_RINGS_RX_BUF_LEN).
+	 */
+	u32	rx_buf_len;
 	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
 	 */
 	u8	hds_config;
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 502392395d45..def6cb4270c2 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -832,6 +832,7 @@ void ethtool_ringparam_get_cfg(struct net_device *dev,
 
 	/* Driver gives us current state, we want to return current config */
 	kparam->tcp_data_split = dev->cfg->hds_config;
+	kparam->rx_buf_len = dev->cfg->rx_buf_len;
 }
 
 static void ethtool_init_tsinfo(struct kernel_ethtool_ts_info *info)
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 628546a1827b..6a74e7e4064e 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -41,6 +41,7 @@ static int rings_prepare_data(const struct ethnl_req_info *req_base,
 		return ret;
 
 	data->kernel_ringparam.tcp_data_split = dev->cfg->hds_config;
+	data->kernel_ringparam.rx_buf_len = dev->cfg->rx_buf_len;
 	data->kernel_ringparam.hds_thresh = dev->cfg->hds_thresh;
 
 	dev->ethtool_ops->get_ringparam(dev, &data->ringparam,
@@ -302,6 +303,7 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		return -EINVAL;
 	}
 
+	dev->cfg_pending->rx_buf_len = kernel_ringparam.rx_buf_len;
 	dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
 	dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (4 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 05/22] net: add rx_buf_len to netdev config Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-23 20:35   ` Mina Almasry
  2025-04-21 22:28 ` [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size Jakub Kicinski
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
inside struct bnxt. This will allow configuring the page size
at runtime in subsequent patches.

The MSS size calculation for older chip continues to use the constant.
I'm intending to support the configuration only on more recent HW,
looks like on older chips setting this per queue won't work,
and that's the ultimate goal.

This patch should not change the current behavior as value
read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 27 ++++++++++---------
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  4 +--
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 868a2e5a5b02..158b8f96f50c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2358,6 +2358,7 @@ struct bnxt {
 	u16			max_tpa;
 	u32			rx_buf_size;
 	u32			rx_buf_use_size;	/* useable size */
+	u16			rx_page_size;
 	u16			rx_offset;
 	u16			rx_dma_offset;
 	enum dma_data_direction	rx_dir;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 3d9205b9a8c3..b611a5ff6d3c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -895,7 +895,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > BNXT_RX_PAGE_SIZE;
+	return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -905,9 +905,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
+	if (PAGE_SIZE > bp->rx_page_size) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						BNXT_RX_PAGE_SIZE);
+						bp->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -1139,9 +1139,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, BNXT_RX_PAGE_SIZE);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1173,7 +1173,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1187,7 +1187,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE);
+	skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1272,7 +1272,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, BNXT_RX_PAGE_SIZE);
+					       frag_len, bp->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1297,7 +1297,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= BNXT_RX_PAGE_SIZE;
+				skb->truesize -= bp->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1312,7 +1312,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  BNXT_RX_PAGE_SIZE);
+						  bp->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -4444,7 +4444,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)BNXT_RX_PAGE_SIZE << RX_BD_LEN_SHIFT) |
+		type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
 
 		bnxt_init_rxbd_pages(ring, type);
@@ -4706,7 +4706,7 @@ void bnxt_set_ring_params(struct bnxt *bp)
 	bp->rx_agg_nr_pages = 0;
 
 	if (bp->flags & BNXT_FLAG_TPA || bp->flags & BNXT_FLAG_HDS)
-		agg_factor = min_t(u32, 4, 65536 / BNXT_RX_PAGE_SIZE);
+		agg_factor = min_t(u32, 4, 65536 / bp->rx_page_size);
 
 	bp->flags &= ~BNXT_FLAG_JUMBO;
 	if (rx_space > PAGE_SIZE && !(bp->flags & BNXT_FLAG_NO_AGG_RINGS)) {
@@ -7018,7 +7018,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(BNXT_RX_PAGE_SIZE);
+		req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -16486,6 +16486,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bp = netdev_priv(dev);
 	bp->board_idx = ent->driver_data;
 	bp->msg_enable = BNXT_DEF_MSG_ENABLE;
+	bp->rx_page_size = BNXT_RX_PAGE_SIZE;
 	bnxt_set_max_func_irqs(bp, max_irqs);
 
 	if (bnxt_vf_pciid(bp->board_idx))
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index e675611777b5..e035f50fd6e3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = BNXT_RX_PAGE_SIZE;
+	u32 buflen = bp->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 
 	xdp_update_skb_shared_info(skb, num_frags,
 				   sinfo->xdp_frags_size,
-				   BNXT_RX_PAGE_SIZE * num_frags,
+				   bp->rx_page_size * num_frags,
 				   xdp_buff_is_frag_pfmemalloc(xdp));
 	return skb;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (5 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 15:32   ` Stanislav Fomichev
  2025-04-21 22:28 ` [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool Jakub Kicinski
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

If user decides to increase the buffer size for agg ring
we need to ask the page pool for higher order pages.
There is no need to use larger pages for header frags,
if user increase the size of agg ring buffers switch
to separate header page automatically.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b611a5ff6d3c..a86bb2ba5adb 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3802,6 +3802,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.pool_size = bp->rx_agg_ring_size;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size;
+	pp.order = get_order(bp->rx_page_size);
 	pp.nid = numa_node;
 	pp.napi = &rxr->bnapi->napi;
 	pp.netdev = bp->dev;
@@ -3818,7 +3819,9 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	rxr->page_pool = pool;
 
 	rxr->need_head_pool = page_pool_is_unreadable(pool);
+	rxr->need_head_pool |= !!pp.order;
 	if (bnxt_separate_head_pool(rxr)) {
+		pp.order = 0;
 		pp.pool_size = max(bp->rx_ring_size, 1024);
 		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 		pool = page_pool_create(&pp);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (6 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-23 21:00   ` Mina Almasry
  2025-04-21 22:28 ` [RFC net-next 09/22] net: move netdev_config manipulation to dedicated helpers Jakub Kicinski
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

bnxt seems to be able to aggregate data up to 32kB without any issue.
The driver is already capable of doing this for systems with higher
order pages. While for systems with 4k pages we historically preferred
to stick to small buffers because they are easier to allocate, the
zero-copy APIs remove the allocation problem. The ZC mem is
pre-allocated and fixed size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  3 ++-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 21 ++++++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 158b8f96f50c..1723909bde77 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -758,7 +758,8 @@ struct nqe_cn {
 #define BNXT_RX_PAGE_SHIFT PAGE_SHIFT
 #endif
 
-#define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
+#define BNXT_MAX_RX_PAGE_SIZE	(1 << 15)
+#define BNXT_RX_PAGE_SIZE	(1 << BNXT_RX_PAGE_SHIFT)
 
 #define BNXT_MAX_MTU		9500
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 48dd5922e4dd..956f51449709 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -835,6 +835,8 @@ static void bnxt_get_ringparam(struct net_device *dev,
 	ering->rx_jumbo_pending = bp->rx_agg_ring_size;
 	ering->tx_pending = bp->tx_ring_size;
 
+	kernel_ering->rx_buf_len_max = BNXT_MAX_RX_PAGE_SIZE;
+	kernel_ering->rx_buf_len = bp->rx_page_size;
 	kernel_ering->hds_thresh_max = BNXT_HDS_THRESHOLD_MAX;
 }
 
@@ -862,6 +864,21 @@ static int bnxt_set_ringparam(struct net_device *dev,
 		return -EINVAL;
 	}
 
+	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
+		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
+
+	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
+	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
+	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
+	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
+		return -ERANGE;
+	}
+
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
@@ -874,6 +891,7 @@ static int bnxt_set_ringparam(struct net_device *dev,
 
 	bp->rx_ring_size = ering->rx_pending;
 	bp->tx_ring_size = ering->tx_pending;
+	bp->rx_page_size = kernel_ering->rx_buf_len;
 	bnxt_set_ring_params(bp);
 
 	if (netif_running(dev))
@@ -5463,7 +5481,8 @@ const struct ethtool_ops bnxt_ethtool_ops = {
 				     ETHTOOL_COALESCE_STATS_BLOCK_USECS |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
 				     ETHTOOL_COALESCE_USE_CQE,
-	.supported_ring_params	= ETHTOOL_RING_USE_TCP_DATA_SPLIT |
+	.supported_ring_params	= ETHTOOL_RING_USE_RX_BUF_LEN |
+				  ETHTOOL_RING_USE_TCP_DATA_SPLIT |
 				  ETHTOOL_RING_USE_HDS_THRS,
 	.get_link_ksettings	= bnxt_get_link_ksettings,
 	.set_link_ksettings	= bnxt_set_link_ksettings,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 09/22] net: move netdev_config manipulation to dedicated helpers
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (7 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Jakub Kicinski
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

netdev_config manipulation will become slightly more complicated
soon and we will need to call if from ethtool as well as queue API.
Encapsulate the logic into helper functions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/Makefile        |  2 +-
 net/core/dev.h           |  5 +++++
 net/core/dev.c           |  7 ++-----
 net/core/netdev_config.c | 43 ++++++++++++++++++++++++++++++++++++++++
 net/ethtool/netlink.c    | 14 ++++++-------
 5 files changed, 57 insertions(+), 14 deletions(-)
 create mode 100644 net/core/netdev_config.c

diff --git a/net/core/Makefile b/net/core/Makefile
index b2a76ce33932..4db487396094 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,7 +19,7 @@ obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
 
 obj-y += net-sysfs.o
 obj-y += hotdata.o
-obj-y += netdev_rx_queue.o
+obj-y += netdev_config.o netdev_rx_queue.o
 obj-$(CONFIG_PAGE_POOL) += page_pool.o page_pool_user.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
diff --git a/net/core/dev.h b/net/core/dev.h
index e93f36b7ddf3..c8971c6f1fcd 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -92,6 +92,11 @@ extern struct rw_semaphore dev_addr_sem;
 extern struct list_head net_todo_list;
 void netdev_run_todo(void);
 
+int netdev_alloc_config(struct net_device *dev);
+void __netdev_free_config(struct netdev_config *cfg);
+void netdev_free_config(struct net_device *dev);
+int netdev_reconfig_start(struct net_device *dev);
+
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
 	struct hlist_node hlist;
diff --git a/net/core/dev.c b/net/core/dev.c
index d1a8cad0c99c..7930b57d1767 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11743,10 +11743,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool)
 		goto free_all;
 
-	dev->cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
-	if (!dev->cfg)
+	if (netdev_alloc_config(dev))
 		goto free_all;
-	dev->cfg_pending = dev->cfg;
 
 	napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
 	dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
@@ -11816,8 +11814,7 @@ void free_netdev(struct net_device *dev)
 		return;
 	}
 
-	WARN_ON(dev->cfg != dev->cfg_pending);
-	kfree(dev->cfg);
+	netdev_free_config(dev);
 	kfree(dev->ethtool);
 	netif_free_tx_queues(dev);
 	netif_free_rx_queues(dev);
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
new file mode 100644
index 000000000000..270b7f10a192
--- /dev/null
+++ b/net/core/netdev_config.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/netdevice.h>
+#include <net/netdev_queues.h>
+
+#include "dev.h"
+
+int netdev_alloc_config(struct net_device *dev)
+{
+	struct netdev_config *cfg;
+
+	cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg)
+		return -ENOMEM;
+
+	dev->cfg = cfg;
+	dev->cfg_pending = cfg;
+	return 0;
+}
+
+void __netdev_free_config(struct netdev_config *cfg)
+{
+	kfree(cfg);
+}
+
+void netdev_free_config(struct net_device *dev)
+{
+	WARN_ON(dev->cfg != dev->cfg_pending);
+	__netdev_free_config(dev->cfg);
+}
+
+int netdev_reconfig_start(struct net_device *dev)
+{
+	struct netdev_config *cfg;
+
+	WARN_ON(dev->cfg != dev->cfg_pending);
+	cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg)
+		return -ENOMEM;
+
+	dev->cfg_pending = cfg;
+	return 0;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 977beeaaa2f9..df26071b3842 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -6,6 +6,7 @@
 #include <linux/ethtool_netlink.h>
 #include <linux/phy_link_topology.h>
 #include <linux/pm_runtime.h>
+#include "../core/dev.h"
 #include "netlink.h"
 #include "module_fw.h"
 
@@ -704,12 +705,9 @@ static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 
 	rtnl_lock();
 	netdev_lock_ops(dev);
-	dev->cfg_pending = kmemdup(dev->cfg, sizeof(*dev->cfg),
-				   GFP_KERNEL_ACCOUNT);
-	if (!dev->cfg_pending) {
-		ret = -ENOMEM;
-		goto out_tie_cfg;
-	}
+	ret = netdev_reconfig_start(dev);
+	if (ret)
+		goto out_unlock;
 
 	ret = ethnl_ops_begin(dev);
 	if (ret < 0)
@@ -728,9 +726,9 @@ static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 out_ops:
 	ethnl_ops_complete(dev);
 out_free_cfg:
-	kfree(dev->cfg_pending);
-out_tie_cfg:
+	__netdev_free_config(dev->cfg_pending);
 	dev->cfg_pending = dev->cfg;
+out_unlock:
 	netdev_unlock_ops(dev);
 	rtnl_unlock();
 out_dev:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (8 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 09/22] net: move netdev_config manipulation to dedicated helpers Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-23 21:04   ` Mina Almasry
  2025-04-21 22:28 ` [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API Jakub Kicinski
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Trivial change, reduce the indent. I think the original is copied
from real NDOs. It's unnecessarily deep, makes passing struct args
problematic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_queues.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index a9ee147dd914..c50de8db72ce 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -145,18 +145,18 @@ struct netdev_stat_ops {
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
-	size_t			ndo_queue_mem_size;
-	int			(*ndo_queue_mem_alloc)(struct net_device *dev,
-						       void *per_queue_mem,
-						       int idx);
-	void			(*ndo_queue_mem_free)(struct net_device *dev,
-						      void *per_queue_mem);
-	int			(*ndo_queue_start)(struct net_device *dev,
-						   void *per_queue_mem,
-						   int idx);
-	int			(*ndo_queue_stop)(struct net_device *dev,
-						  void *per_queue_mem,
-						  int idx);
+	size_t	ndo_queue_mem_size;
+	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       void *per_queue_mem,
+				       int idx);
+	void	(*ndo_queue_mem_free)(struct net_device *dev,
+				      void *per_queue_mem);
+	int	(*ndo_queue_start)(struct net_device *dev,
+				   void *per_queue_mem,
+				   int idx);
+	int	(*ndo_queue_stop)(struct net_device *dev,
+				  void *per_queue_mem,
+				  int idx);
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (9 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-23 21:17   ` Mina Almasry
  2025-04-21 22:28 ` [RFC net-next 12/22] net: pass extack to netdev_rx_queue_restart() Jakub Kicinski
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Create an array of config structs to store per-queue config.
Pass these structs in the queue API. Drivers can also retrieve
the config for a single queue calling netdev_queue_config()
directly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_queues.h                | 19 +++++++
 net/core/dev.h                             |  3 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  8 ++-
 drivers/net/ethernet/google/gve/gve_main.c |  9 ++--
 drivers/net/netdevsim/netdev.c             |  6 ++-
 net/core/netdev_config.c                   | 58 ++++++++++++++++++++++
 net/core/netdev_rx_queue.c                 | 11 ++--
 7 files changed, 104 insertions(+), 10 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index c50de8db72ce..1fb621a00962 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -31,6 +31,13 @@ struct netdev_config {
 	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
 	 */
 	u8	hds_config;
+
+	/** @qcfg: per-queue configuration */
+	struct netdev_queue_config *qcfg;
+};
+
+/* Same semantics as fields in struct netdev_config */
+struct netdev_queue_config {
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -129,6 +136,10 @@ struct netdev_stat_ops {
  *
  * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
  *
+ * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
+ *			defaults. Queue config structs are passed to this
+ *			helper before the user-requested settings are applied.
+ *
  * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
  *			 The new memory is written at the specified address.
  *
@@ -146,12 +157,17 @@ struct netdev_stat_ops {
  */
 struct netdev_queue_mgmt_ops {
 	size_t	ndo_queue_mem_size;
+	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
+					  int idx,
+					  struct netdev_queue_config *qcfg);
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
 				       int idx);
 	void	(*ndo_queue_mem_free)(struct net_device *dev,
 				      void *per_queue_mem);
 	int	(*ndo_queue_start)(struct net_device *dev,
+				   struct netdev_queue_config *qcfg,
 				   void *per_queue_mem,
 				   int idx);
 	int	(*ndo_queue_stop)(struct net_device *dev,
@@ -159,6 +175,9 @@ struct netdev_queue_mgmt_ops {
 				  int idx);
 };
 
+void netdev_queue_config(struct net_device *dev, int rxq,
+			 struct netdev_queue_config *qcfg);
+
 /**
  * DOC: Lockless queue stopping / waking helpers.
  *
diff --git a/net/core/dev.h b/net/core/dev.h
index c8971c6f1fcd..6d7f5e920018 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -9,6 +9,7 @@
 #include <net/netdev_lock.h>
 
 struct net;
+struct netdev_queue_config;
 struct netlink_ext_ack;
 struct cpumask;
 
@@ -96,6 +97,8 @@ int netdev_alloc_config(struct net_device *dev);
 void __netdev_free_config(struct netdev_config *cfg);
 void netdev_free_config(struct net_device *dev);
 int netdev_reconfig_start(struct net_device *dev);
+void __netdev_queue_config(struct net_device *dev, int rxq,
+			   struct netdev_queue_config *qcfg, bool pending);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a86bb2ba5adb..fbbe02cefdf1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15728,7 +15728,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
 	.get_base_stats		= bnxt_get_base_stats,
 };
 
-static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_mem_alloc(struct net_device *dev,
+				struct netdev_queue_config *qcfg,
+				void *qmem, int idx)
 {
 	struct bnxt_rx_ring_info *rxr, *clone;
 	struct bnxt *bp = netdev_priv(dev);
@@ -15896,7 +15898,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	dst->rx_agg_bmap = src->rx_agg_bmap;
 }
 
-static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_start(struct net_device *dev,
+			    struct netdev_queue_config *qcfg,
+			    void *qmem, int idx)
 {
 	struct bnxt *bp = netdev_priv(dev);
 	struct bnxt_rx_ring_info *rxr, *clone;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 8aaac9101377..088ae19ddb24 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2428,8 +2428,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
 		gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
 }
 
-static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
-				  int idx)
+static int gve_rx_queue_mem_alloc(struct net_device *dev,
+				  struct netdev_queue_config *qcfg,
+				  void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_alloc_rings_cfg cfg = {0};
@@ -2450,7 +2451,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
 	return err;
 }
 
-static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
+static int gve_rx_queue_start(struct net_device *dev,
+			      struct netdev_queue_config *qcfg,
+			      void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_ring *gve_per_q_mem;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0e0321a7ddd7..a2aa85a85e0f 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -661,7 +661,8 @@ struct nsim_queue_mem {
 };
 
 static int
-nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_mem_alloc(struct net_device *dev, struct netdev_queue_config *qcfg,
+		     void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
@@ -710,7 +711,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
 }
 
 static int
-nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
+		 void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index 270b7f10a192..bad2d53522f0 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -8,18 +8,29 @@
 int netdev_alloc_config(struct net_device *dev)
 {
 	struct netdev_config *cfg;
+	unsigned int maxqs;
 
 	cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
 	if (!cfg)
 		return -ENOMEM;
 
+	maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
+	cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);
+	if (!cfg->qcfg)
+		goto err_free_cfg;
+
 	dev->cfg = cfg;
 	dev->cfg_pending = cfg;
 	return 0;
+
+err_free_cfg:
+	kfree(cfg);
+	return -ENOMEM;
 }
 
 void __netdev_free_config(struct netdev_config *cfg)
 {
+	kfree(cfg->qcfg);
 	kfree(cfg);
 }
 
@@ -32,12 +43,59 @@ void netdev_free_config(struct net_device *dev)
 int netdev_reconfig_start(struct net_device *dev)
 {
 	struct netdev_config *cfg;
+	unsigned int maxqs;
 
 	WARN_ON(dev->cfg != dev->cfg_pending);
 	cfg = kmemdup(dev->cfg, sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
 	if (!cfg)
 		return -ENOMEM;
 
+	maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
+	cfg->qcfg = kmemdup_array(dev->cfg->qcfg, maxqs, sizeof(*cfg->qcfg),
+				  GFP_KERNEL_ACCOUNT);
+	if (!cfg->qcfg)
+		goto err_free_cfg;
+
 	dev->cfg_pending = cfg;
 	return 0;
+
+err_free_cfg:
+	kfree(cfg);
+	return -ENOMEM;
 }
+
+void __netdev_queue_config(struct net_device *dev, int rxq,
+			   struct netdev_queue_config *qcfg, bool pending)
+{
+	memset(qcfg, 0, sizeof(*qcfg));
+
+	/* Get defaults from the driver, in case user config not set */
+	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
+		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+}
+
+/**
+ * netdev_queue_config() - get configuration for a given queue
+ * @dev:  net_device instance
+ * @rxq:  index of the queue of interest
+ * @qcfg: queue configuration struct (output)
+ *
+ * Render the configuration for a given queue. This helper should be used
+ * by drivers which support queue configuration to retrieve config for
+ * a particular queue.
+ *
+ * @qcfg is an output parameter and is always fully initialized by this
+ * function. Some values may not be set by the user, drivers may either
+ * deal with the "unset" values in @qcfg, or provide the callback
+ * to populate defaults in queue_management_ops.
+ *
+ * Note that this helper returns pending config, as it is expected that
+ * "old" queues are retained until config is successful so they can
+ * be restored directly without asking for the config.
+ */
+void netdev_queue_config(struct net_device *dev, int rxq,
+			 struct netdev_queue_config *qcfg)
+{
+	__netdev_queue_config(dev, rxq, qcfg, true);
+}
+EXPORT_SYMBOL(netdev_queue_config);
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index d126f10197bf..d8a710db21cd 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -7,12 +7,14 @@
 #include <net/netdev_rx_queue.h>
 #include <net/page_pool/memory_provider.h>
 
+#include "dev.h"
 #include "page_pool_priv.h"
 
 int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
 	void *new_mem, *old_mem;
 	int err;
 
@@ -32,7 +34,9 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		goto err_free_new_mem;
 	}
 
-	err = qops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx);
+	netdev_queue_config(dev, rxq_idx, &qcfg);
+
+	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
 
@@ -45,7 +49,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		if (err)
 			goto err_free_new_queue_mem;
 
-		err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
+		err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
 		if (err)
 			goto err_start_queue;
 	} else {
@@ -60,6 +64,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	return 0;
 
 err_start_queue:
+	__netdev_queue_config(dev, rxq_idx, &qcfg, false);
 	/* Restarting the queue with old_mem should be successful as we haven't
 	 * changed any of the queue configuration, and there is not much we can
 	 * do to recover from a failure here.
@@ -67,7 +72,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	 * WARN if we fail to recover the old rx queue, and at least free
 	 * old_mem so we don't also leak that.
 	 */
-	if (qops->ndo_queue_start(dev, old_mem, rxq_idx)) {
+	if (qops->ndo_queue_start(dev, &qcfg, old_mem, rxq_idx)) {
 		WARN(1,
 		     "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
 		     rxq_idx);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 12/22] net: pass extack to netdev_rx_queue_restart()
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (10 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 13/22] net: add queue config validation callback Jakub Kicinski
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Pass extack to netdev_rx_queue_restart(). Subsequent change will need it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_rx_queue.h             | 3 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 drivers/net/netdevsim/netdev.c            | 2 +-
 net/core/netdev_rx_queue.c                | 7 ++++---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 8cdcd138b33f..a7def1f94823 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -56,6 +56,7 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
 	return index;
 }
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq,
+			    struct netlink_ext_ack *extack);
 
 #endif
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index fbbe02cefdf1..627132ce10df 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11443,7 +11443,7 @@ static void bnxt_irq_affinity_notify(struct irq_affinity_notify *notify,
 
 	netdev_lock(irq->bp->dev);
 	if (netif_running(irq->bp->dev)) {
-		err = netdev_rx_queue_restart(irq->bp->dev, irq->ring_nr);
+		err = netdev_rx_queue_restart(irq->bp->dev, irq->ring_nr, NULL);
 		if (err)
 			netdev_err(irq->bp->dev,
 				   "RX queue restart failed: err=%d\n", err);
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index a2aa85a85e0f..a9975985aa24 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -797,7 +797,7 @@ nsim_qreset_write(struct file *file, const char __user *data,
 	}
 
 	ns->rq_reset_mode = mode;
-	ret = netdev_rx_queue_restart(ns->netdev, queue);
+	ret = netdev_rx_queue_restart(ns->netdev, queue, NULL);
 	ns->rq_reset_mode = 0;
 	if (ret)
 		goto exit_unlock;
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index d8a710db21cd..b0523eb44e10 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -10,7 +10,8 @@
 #include "dev.h"
 #include "page_pool_priv.h"
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
+			    struct netlink_ext_ack *extack)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
@@ -136,7 +137,7 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 #endif
 
 	rxq->mp_params = *p;
-	ret = netdev_rx_queue_restart(dev, rxq_idx);
+	ret = netdev_rx_queue_restart(dev, rxq_idx, extack);
 	if (ret) {
 		rxq->mp_params.mp_ops = NULL;
 		rxq->mp_params.mp_priv = NULL;
@@ -179,7 +180,7 @@ void __net_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
 
 	rxq->mp_params.mp_ops = NULL;
 	rxq->mp_params.mp_priv = NULL;
-	err = netdev_rx_queue_restart(dev, ifq_idx);
+	err = netdev_rx_queue_restart(dev, ifq_idx, NULL);
 	WARN_ON(err && err != -ENETDOWN);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 13/22] net: add queue config validation callback
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (11 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 12/22] net: pass extack to netdev_rx_queue_restart() Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 15:49   ` Stanislav Fomichev
  2025-04-22 20:16   ` Joe Damato
  2025-04-21 22:28 ` [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops Jakub Kicinski
                   ` (9 subsequent siblings)
  22 siblings, 2 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before each queue restart
and when ring params are modified.

For NIC-wide changes the callback gets invoked for each active
(or active to-be) queue, and additionally with a negative queue
index for NIC-wide defaults. The NIC-wide check is needed in
case all queues have an override active when NIC-wide setting
is changed to an unsupported one. Alternatively we could check
the settings when new queues are enabled (in the channel API),
but accepting invalid config is a bad idea. Users may expect
that resetting a queue override will always work.

The "trick" of passing a negative index is a bit ugly, we may
want to revisit if it causes confusion and bugs. Existing drivers
don't care about the index so it "just works".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_queues.h | 12 ++++++++++++
 net/core/dev.h              |  2 ++
 net/core/netdev_config.c    | 20 ++++++++++++++++++++
 net/core/netdev_rx_queue.c  |  6 ++++++
 net/ethtool/rings.c         |  5 +++++
 5 files changed, 45 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 1fb621a00962..d1bf53aa6f7e 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -140,6 +140,14 @@ struct netdev_stat_ops {
  *			defaults. Queue config structs are passed to this
  *			helper before the user-requested settings are applied.
  *
+ * @ndo_queue_cfg_validate: (Optional) Check if queue config is supported.
+ *			Called when configuration affecting a queue may be
+ *			changing, either due to NIC-wide config, or config
+ *			scoped to the queue at a specified index.
+ *			When NIC-wide config is changed the callback will
+ *			be invoked for all queues, and in addition to that
+ *			with a negative queue index for the base settings.
+ *
  * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
  *			 The new memory is written at the specified address.
  *
@@ -160,6 +168,10 @@ struct netdev_queue_mgmt_ops {
 	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
 					  int idx,
 					  struct netdev_queue_config *qcfg);
+	int	(*ndo_queue_cfg_validate)(struct net_device *dev,
+					  int idx,
+					  struct netdev_queue_config *qcfg,
+					  struct netlink_ext_ack *extack);
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
 				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
diff --git a/net/core/dev.h b/net/core/dev.h
index 6d7f5e920018..e0d433fb6325 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -99,6 +99,8 @@ void netdev_free_config(struct net_device *dev);
 int netdev_reconfig_start(struct net_device *dev);
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending);
+int netdev_queue_config_revalidate(struct net_device *dev,
+				   struct netlink_ext_ack *extack);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index bad2d53522f0..fc700b77e4eb 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
 	__netdev_queue_config(dev, rxq, qcfg, true);
 }
 EXPORT_SYMBOL(netdev_queue_config);
+
+int netdev_queue_config_revalidate(struct net_device *dev,
+				   struct netlink_ext_ack *extack)
+{
+	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
+	int i, err;
+
+	if (!qops || !qops->ndo_queue_cfg_validate)
+		return 0;
+
+	for (i = -1; i < (int)dev->real_num_rx_queues; i++) {
+		netdev_queue_config(dev, i, &qcfg);
+		err = qops->ndo_queue_cfg_validate(dev, i, &qcfg, extack);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index b0523eb44e10..7c691eb1a48b 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -37,6 +37,12 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx,
 
 	netdev_queue_config(dev, rxq_idx, &qcfg);
 
+	if (qops->ndo_queue_cfg_validate) {
+		err = qops->ndo_queue_cfg_validate(dev, rxq_idx, &qcfg, extack);
+		if (err)
+			goto err_free_old_mem;
+	}
+
 	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index 6a74e7e4064e..7884d10c090f 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -4,6 +4,7 @@
 
 #include "netlink.h"
 #include "common.h"
+#include "../core/dev.h"
 
 struct rings_req_info {
 	struct ethnl_req_info		base;
@@ -307,6 +308,10 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 	dev->cfg_pending->hds_config = kernel_ringparam.tcp_data_split;
 	dev->cfg_pending->hds_thresh = kernel_ringparam.hds_thresh;
 
+	ret = netdev_queue_config_revalidate(dev, info->extack);
+	if (ret)
+		return ret;
+
 	ret = dev->ethtool_ops->set_ringparam(dev, &ringparam,
 					      &kernel_ringparam, info->extack);
 	return ret < 0 ? ret : 1;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (12 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 13/22] net: add queue config validation callback Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 15:50   ` Stanislav Fomichev
  2025-04-22 20:18   ` Joe Damato
  2025-04-21 22:28 ` [RFC net-next 15/22] eth: bnxt: store the rx buf size per queue Jakub Kicinski
                   ` (8 subsequent siblings)
  22 siblings, 2 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Core provides a centralized callback for validating per-queue settings
but the callback is part of the queue management ops. Having the ops
conditionally set complicates the parts of the driver which could
otherwise lean on the core to feel it the correct settings.

Always set the queue ops, but provide no restart-related callbacks if
queue ops are not supported by the device. This should maintain current
behavior, the check in netdev_rx_queue_restart() looks both at op struct
and individual ops.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 627132ce10df..49d66f4a5ad0 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16039,6 +16039,9 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_stop		= bnxt_queue_stop,
 };
 
+static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
+};
+
 static void bnxt_remove_one(struct pci_dev *pdev)
 {
 	struct net_device *dev = pci_get_drvdata(pdev);
@@ -16694,7 +16697,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		bp->rss_cap |= BNXT_RSS_CAP_MULTI_RSS_CTX;
 	if (BNXT_SUPPORTS_QUEUE_API(bp))
 		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;
-	dev->request_ops_lock = true;
+	else
+		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
 
 	rc = register_netdev(dev);
 	if (rc)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 15/22] eth: bnxt: store the rx buf size per queue
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (13 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Jakub Kicinski
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

In normal operation only a subset of queues is configured for
zero-copy. Since zero-copy is the main use for larger buffer
sizes we need to configure the sizes per queue.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 46 ++++++++++---------
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  6 +--
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1723909bde77..12728cfa7db8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
 
 	unsigned long		*rx_agg_bmap;
 	u16			rx_agg_bmap_size;
+	u16			rx_page_size;
 	bool                    need_head_pool;
 
 	dma_addr_t		rx_desc_mapping[MAX_RX_PAGES];
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
index 220285e190fc..8933a0dec09a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
@@ -32,6 +32,6 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 void bnxt_xdp_buff_frags_free(struct bnxt_rx_ring_info *rxr,
 			      struct xdp_buff *xdp);
 struct sk_buff *bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb,
-				   u8 num_frags, struct page_pool *pool,
+				   u8 num_frags, struct bnxt_rx_ring_info *rxr,
 				   struct xdp_buff *xdp);
 #endif
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 49d66f4a5ad0..28f8a4e0d41b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -895,7 +895,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > rxr->bnapi->bp->rx_page_size;
+	return rxr->need_head_pool || PAGE_SIZE > rxr->rx_page_size;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -905,9 +905,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > bp->rx_page_size) {
+	if (PAGE_SIZE > rxr->rx_page_size) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						bp->rx_page_size);
+						rxr->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -1139,9 +1139,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, bp->rx_page_size);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1173,7 +1173,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, bp->rx_page_size,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1187,7 +1187,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, bp->rx_page_size);
+	skb_add_rx_frag(skb, 0, page, off, len, rxr->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1272,7 +1272,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, bp->rx_page_size);
+					       frag_len, rxr->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1297,7 +1297,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= bp->rx_page_size;
+				skb->truesize -= rxr->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1312,7 +1312,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  bp->rx_page_size);
+						  rxr->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -2264,8 +2264,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 			if (!skb)
 				goto oom_next_rx;
 		} else {
-			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs,
-						 rxr->page_pool, &xdp);
+			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr, &xdp);
 			if (!skb) {
 				/* we should be able to free the old skb here */
 				bnxt_xdp_buff_frags_free(rxr, &xdp);
@@ -3802,7 +3801,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.pool_size = bp->rx_agg_ring_size;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size;
-	pp.order = get_order(bp->rx_page_size);
+	pp.order = get_order(rxr->rx_page_size);
 	pp.nid = numa_node;
 	pp.napi = &rxr->bnapi->napi;
 	pp.netdev = bp->dev;
@@ -4288,6 +4287,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
+		rxr->rx_page_size = bp->rx_page_size;
+
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
 		rmem->nr_pages = bp->rx_nr_pages;
@@ -4447,7 +4448,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)bp->rx_page_size << RX_BD_LEN_SHIFT) |
+		type = ((u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD | RX_BD_FLAGS_SOP;
 
 		bnxt_init_rxbd_pages(ring, type);
@@ -7012,6 +7013,7 @@ static void bnxt_hwrm_ring_grp_free(struct bnxt *bp)
 
 static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 				       struct hwrm_ring_alloc_input *req,
+				       struct bnxt_rx_ring_info *rxr,
 				       struct bnxt_ring_struct *ring)
 {
 	struct bnxt_ring_grp_info *grp_info = &bp->grp_info[ring->grp_idx];
@@ -7021,7 +7023,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(bp->rx_page_size);
+		req->rx_buf_size = cpu_to_le16(rxr->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -7035,6 +7037,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 }
 
 static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
+				    struct bnxt_rx_ring_info *rxr,
 				    struct bnxt_ring_struct *ring,
 				    u32 ring_type, u32 map_index)
 {
@@ -7091,7 +7094,8 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
 			      cpu_to_le32(bp->rx_ring_mask + 1) :
 			      cpu_to_le32(bp->rx_agg_ring_mask + 1);
 		if (bp->flags & BNXT_FLAG_CHIP_P5_PLUS)
-			bnxt_set_rx_ring_params_p5(bp, ring_type, req, ring);
+			bnxt_set_rx_ring_params_p5(bp, ring_type, req,
+						   rxr, ring);
 		break;
 	case HWRM_RING_ALLOC_CMPL:
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL;
@@ -7239,7 +7243,7 @@ static int bnxt_hwrm_rx_ring_alloc(struct bnxt *bp,
 	u32 map_idx = bnapi->index;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7259,7 +7263,7 @@ static int bnxt_hwrm_rx_agg_ring_alloc(struct bnxt *bp,
 	int rc;
 
 	map_idx = grp_idx + bp->rx_nr_rings;
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7283,7 +7287,7 @@ static int bnxt_hwrm_cp_ring_alloc_p5(struct bnxt *bp,
 
 	ring = &cpr->cp_ring_struct;
 	ring->handle = BNXT_SET_NQ_HDL(cpr);
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &cpr->cp_db, type, map_idx, ring->fw_ring_id);
@@ -7298,7 +7302,7 @@ static int bnxt_hwrm_tx_ring_alloc(struct bnxt *bp,
 	const u32 type = HWRM_RING_ALLOC_TX;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, tx_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, tx_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &txr->tx_db, type, tx_idx, ring->fw_ring_id);
@@ -7324,7 +7328,7 @@ static int bnxt_hwrm_ring_alloc(struct bnxt *bp)
 
 		vector = bp->irq_tbl[map_idx].vector;
 		disable_irq_nosync(vector);
-		rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+		rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 		if (rc) {
 			enable_irq(vector);
 			goto err_out;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index e035f50fd6e3..a970eb0b058c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = bp->rx_page_size;
+	u32 buflen = rxr->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -461,7 +461,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 
 struct sk_buff *
 bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
-		   struct page_pool *pool, struct xdp_buff *xdp)
+		   struct bnxt_rx_ring_info *rxr, struct xdp_buff *xdp)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 
@@ -470,7 +470,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 
 	xdp_update_skb_shared_info(skb, num_frags,
 				   sinfo->xdp_frags_size,
-				   bp->rx_page_size * num_frags,
+				   rxr->rx_page_size * num_frags,
 				   xdp_buff_is_frag_pfmemalloc(xdp));
 	return skb;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (14 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 15/22] eth: bnxt: store the rx buf size per queue Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 16:13   ` Stanislav Fomichev
  2025-04-21 22:28 ` [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue Jakub Kicinski
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

The driver tries to provision more agg buffers than header buffers
since multiple agg segments can reuse the same header. The calculation
/ heuristic tries to provide enough pages for 65k of data for each header
(or 4 frags per header if the result is too big). This calculation is
currently global to the adapter. If we increase the buffer sizes 8x
we don't want 8x the amount of memory sitting on the rings.
Luckily we don't have to fill the rings completely, adjust
the fill level dynamically in case particular queue has buffers
larger than the global size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 28f8a4e0d41b..43497b335329 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3791,6 +3791,21 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 	}
 }
 
+static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
+				       struct bnxt_rx_ring_info *rxr)
+{
+	/* User may have chosen larger than default rx_page_size,
+	 * we keep the ring sizes uniform and also want uniform amount
+	 * of bytes consumed per ring, so cap how much of the rings we fill.
+	 */
+	int fill_level = bp->rx_agg_ring_size;
+
+	if (rxr->rx_page_size > bp->rx_page_size)
+		fill_level /= rxr->rx_page_size / bp->rx_page_size;
+
+	return fill_level;
+}
+
 static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 				   struct bnxt_rx_ring_info *rxr,
 				   int numa_node)
@@ -3798,7 +3813,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	struct page_pool_params pp = { 0 };
 	struct page_pool *pool;
 
-	pp.pool_size = bp->rx_agg_ring_size;
+	pp.pool_size = bnxt_rx_agg_ring_fill_level(bp, rxr);
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size;
 	pp.order = get_order(rxr->rx_page_size);
@@ -4366,11 +4381,13 @@ static void bnxt_alloc_one_rx_ring_netmem(struct bnxt *bp,
 					  struct bnxt_rx_ring_info *rxr,
 					  int ring_nr)
 {
+	int fill_level, i;
 	u32 prod;
-	int i;
+
+	fill_level = bnxt_rx_agg_ring_fill_level(bp, rxr);
 
 	prod = rxr->rx_agg_prod;
-	for (i = 0; i < bp->rx_agg_ring_size; i++) {
+	for (i = 0; i < fill_level; i++) {
 		if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
 			netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
 				    ring_nr, i, bp->rx_ring_size);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (15 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 16:15   ` Stanislav Fomichev
  2025-04-21 22:28 ` [RFC net-next 18/22] net: wipe the setting of deactived queues Jakub Kicinski
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Zero-copy APIs increase the cost of buffer management. They also extend
this cost to user space applications which may be used to dealing with
much larger buffers. Allow setting rx-buf-len per queue, devices with
HW-GRO support can commonly fill buffers up to 32k (or rather 64k - 1
but that's not a power of 2..)

The implementation adds a new option to the netdev netlink, rather
than ethtool. The NIC-wide setting lives in ethtool ringparams so
one could argue that we should be extending the ethtool API.
OTOH netdev API is where we already have queue-get, and it's how
zero-copy applications bind memory providers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml | 15 ++++
 include/net/netdev_queues.h             |  5 ++
 include/net/netlink.h                   | 19 +++++
 include/uapi/linux/netdev.h             |  2 +
 net/core/netdev-genl-gen.h              |  1 +
 tools/include/uapi/linux/netdev.h       |  2 +
 net/core/netdev-genl-gen.c              | 15 ++++
 net/core/netdev-genl.c                  | 92 +++++++++++++++++++++++++
 net/core/netdev_config.c                | 16 +++++
 9 files changed, 167 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index f5e0750ab71d..b0dfa970ee83 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -324,6 +324,10 @@ name: netdev
         doc: XSK information for this queue, if any.
         type: nest
         nested-attributes: xsk-info
+      -
+        name: rx-buf-len
+        doc: Per-queue configuration of ETHTOOL_A_RINGS_RX_BUF_LEN.
+        type: u32
   -
     name: qstats
     doc: |
@@ -743,6 +747,17 @@ name: netdev
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+    -
+      name: queue-set
+      doc: Set per-queue configurable options.
+      attribute-set: queue
+      do:
+        request:
+          attributes:
+            - ifindex
+            - type
+            - id
+            - rx-buf-len
 
 kernel-family:
   headers: [ "net/netdev_netlink.h"]
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index d1bf53aa6f7e..3b09ffee7d4e 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -38,6 +38,7 @@ struct netdev_config {
 
 /* Same semantics as fields in struct netdev_config */
 struct netdev_queue_config {
+	u32	rx_buf_len;
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -134,6 +135,8 @@ struct netdev_stat_ops {
 /**
  * struct netdev_queue_mgmt_ops - netdev ops for queue management
  *
+ * @supported_ring_params: ring params supported per queue (ETHTOOL_RING_USE_*).
+ *
  * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
  *
  * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
@@ -164,6 +167,8 @@ struct netdev_stat_ops {
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
+	u32     supported_ring_params;
+
 	size_t	ndo_queue_mem_size;
 	void	(*ndo_queue_cfg_defaults)(struct net_device *dev,
 					  int idx,
diff --git a/include/net/netlink.h b/include/net/netlink.h
index 82e07e272290..d27d50543166 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -2180,6 +2180,25 @@ static inline struct nla_bitfield32 nla_get_bitfield32(const struct nlattr *nla)
 	return tmp;
 }
 
+/**
+ * nla_update_u32() - update u32 value from NLA_U32 attribute
+ * @dst:  value to update
+ * @attr: netlink attribute with new value or null
+ *
+ * Copy the u32 value from NLA_U32 netlink attribute @attr into variable
+ * pointed to by @dst; do nothing if @attr is null.
+ *
+ * Return: true if this function changed the value of @dst, otherwise false.
+ */
+static inline bool nla_update_u32(u32 *dst, const struct nlattr *attr)
+{
+	u32 old_val = *dst;
+
+	if (attr)
+		*dst = nla_get_u32(attr);
+	return *dst != old_val;
+}
+
 /**
  * nla_memdup - duplicate attribute memory (kmemdup)
  * @src: netlink attribute to duplicate from
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 7600bf62dbdf..b6acd5be3d0b 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -152,6 +152,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_RX_BUF_LEN,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
@@ -219,6 +220,7 @@ enum {
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
+	NETDEV_CMD_QUEUE_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 17d39fd64c94..b0ce910a8846 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -34,6 +34,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 7600bf62dbdf..b6acd5be3d0b 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -152,6 +152,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_RX_BUF_LEN,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
@@ -219,6 +220,7 @@ enum {
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
+	NETDEV_CMD_QUEUE_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 739f7b6506a6..85f190e59ea5 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -99,6 +99,14 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
 };
 
+/* NETDEV_CMD_QUEUE_SET - do */
+static const struct nla_policy netdev_queue_set_nl_policy[NETDEV_A_QUEUE_RX_BUF_LEN + 1] = {
+	[NETDEV_A_QUEUE_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+	[NETDEV_A_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_QUEUE_ID] = { .type = NLA_U32, },
+	[NETDEV_A_QUEUE_RX_BUF_LEN] = { .type = NLA_U32, },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -190,6 +198,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_QUEUE_SET,
+		.doit		= netdev_nl_queue_set_doit,
+		.policy		= netdev_queue_set_nl_policy,
+		.maxattr	= NETDEV_A_QUEUE_RX_BUF_LEN,
+		.flags		= GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 2c104947d224..1efaaeb30c57 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -372,6 +372,30 @@ static int nla_put_napi_id(struct sk_buff *skb, const struct napi_struct *napi)
 	return 0;
 }
 
+static int
+netdev_nl_queue_fill_cfg(struct sk_buff *rsp, struct net_device *netdev,
+			 u32 q_idx, u32 q_type)
+{
+	struct netdev_queue_config *qcfg;
+
+	if (!netdev_need_ops_lock(netdev))
+		return 0;
+
+	qcfg = &netdev->cfg->qcfg[q_idx];
+	switch (q_type) {
+	case NETDEV_QUEUE_TYPE_RX:
+		if (qcfg->rx_buf_len &&
+		    nla_put_u32(rsp, NETDEV_A_QUEUE_RX_BUF_LEN,
+				qcfg->rx_buf_len))
+			return -EMSGSIZE;
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
@@ -419,6 +443,9 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 		break;
 	}
 
+	if (netdev_nl_queue_fill_cfg(rsp, netdev, q_idx, q_type))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
@@ -558,6 +585,71 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+int netdev_nl_queue_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr * const *tb = info->attrs;
+	struct netdev_queue_config *qcfg;
+	u32 q_id, q_type, ifindex;
+	struct net_device *netdev;
+	bool mod;
+	int ret;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_TYPE) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_IFINDEX))
+		return -EINVAL;
+
+	q_id = nla_get_u32(tb[NETDEV_A_QUEUE_ID]);
+	q_type = nla_get_u32(tb[NETDEV_A_QUEUE_TYPE]);
+	ifindex = nla_get_u32(tb[NETDEV_A_QUEUE_IFINDEX]);
+
+	if (q_type != NETDEV_QUEUE_TYPE_RX) {
+		/* Only Rx params exist right now */
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_TYPE]);
+		return -EINVAL;
+	}
+
+	ret = 0;
+	netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex);
+	if (!netdev || !netif_device_present(netdev))
+		ret = -ENODEV;
+	else if (!netdev->queue_mgmt_ops)
+		ret = -EOPNOTSUPP;
+	if (ret) {
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_IFINDEX]);
+		goto exit_unlock;
+	}
+
+	ret = netdev_nl_queue_validate(netdev, q_id, q_type);
+	if (ret) {
+		NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_ID]);
+		goto exit_unlock;
+	}
+
+	ret = netdev_reconfig_start(netdev);
+	if (ret)
+		goto exit_unlock;
+
+	qcfg = &netdev->cfg_pending->qcfg[q_id];
+	mod = nla_update_u32(&qcfg->rx_buf_len, tb[NETDEV_A_QUEUE_RX_BUF_LEN]);
+	if (!mod)
+		goto exit_free_cfg;
+
+	ret = netdev_rx_queue_restart(netdev, q_id, info->extack);
+	if (ret)
+		goto exit_free_cfg;
+
+	swap(netdev->cfg, netdev->cfg_pending);
+
+exit_free_cfg:
+	__netdev_free_config(netdev->cfg_pending);
+	netdev->cfg_pending = netdev->cfg;
+exit_unlock:
+	if (netdev)
+		netdev_unlock(netdev);
+	return ret;
+}
+
 #define NETDEV_STAT_NOT_SET		(~0ULL)
 
 static void netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index fc700b77e4eb..ede02b77470e 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -67,11 +67,27 @@ int netdev_reconfig_start(struct net_device *dev)
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
+	const struct netdev_config *cfg;
+
+	cfg = pending ? dev->cfg_pending : dev->cfg;
+
 	memset(qcfg, 0, sizeof(*qcfg));
 
 	/* Get defaults from the driver, in case user config not set */
 	if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
 		dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
+
+	/* Set config based on device-level settings */
+	if (cfg->rx_buf_len)
+		qcfg->rx_buf_len = cfg->rx_buf_len;
+
+	/* Set config dedicated to this queue */
+	if (rxq >= 0) {
+		const struct netdev_queue_config *user_cfg = &cfg->qcfg[rxq];
+
+		if (user_cfg->rx_buf_len)
+			qcfg->rx_buf_len = user_cfg->rx_buf_len;
+	}
 }
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 18/22] net: wipe the setting of deactived queues
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (16 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 16:21   ` Stanislav Fomichev
  2025-04-21 22:28 ` [RFC net-next 19/22] eth: bnxt: use queue op config validate Jakub Kicinski
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Clear out all settings of deactived queues when user changes
the number of channels. We already perform similar cleanup
for shapers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/dev.h           |  2 ++
 net/core/dev.c           |  5 +++++
 net/core/netdev_config.c | 13 +++++++++++++
 3 files changed, 20 insertions(+)

diff --git a/net/core/dev.h b/net/core/dev.h
index e0d433fb6325..4cdd8ac7df4f 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -101,6 +101,8 @@ void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending);
 int netdev_queue_config_revalidate(struct net_device *dev,
 				   struct netlink_ext_ack *extack);
+void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
+				    unsigned int rxq);
 
 /* netdev management, shared between various uAPI entry points */
 struct netdev_name_node {
diff --git a/net/core/dev.c b/net/core/dev.c
index 7930b57d1767..c1f9b6ce6500 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3188,6 +3188,8 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 		if (dev->num_tc)
 			netif_setup_tc(dev, txq);
 
+		netdev_queue_config_update_cnt(dev, txq,
+					       dev->real_num_rx_queues);
 		net_shaper_set_real_num_tx_queues(dev, txq);
 
 		dev_qdisc_change_real_num_tx(dev, txq);
@@ -3234,6 +3236,9 @@ int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
 						  rxq);
 		if (rc)
 			return rc;
+
+		netdev_queue_config_update_cnt(dev, dev->real_num_tx_queues,
+					       rxq);
 	}
 
 	dev->real_num_rx_queues = rxq;
diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
index ede02b77470e..c5ae39e76f40 100644
--- a/net/core/netdev_config.c
+++ b/net/core/netdev_config.c
@@ -64,6 +64,19 @@ int netdev_reconfig_start(struct net_device *dev)
 	return -ENOMEM;
 }
 
+void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
+				    unsigned int rxq)
+{
+	size_t len;
+
+	if (rxq < dev->real_num_rx_queues) {
+		len = (dev->real_num_rx_queues - rxq) * sizeof(*dev->cfg->qcfg);
+
+		memset(&dev->cfg->qcfg[rxq], 0, len);
+		memset(&dev->cfg_pending->qcfg[rxq], 0, len);
+	}
+}
+
 void __netdev_queue_config(struct net_device *dev, int rxq,
 			   struct netdev_queue_config *qcfg, bool pending)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (17 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 18/22] net: wipe the setting of deactived queues Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-23 10:00   ` Dragos Tatulea
  2025-06-12 11:56   ` Dragos Tatulea
  2025-04-21 22:28 ` [RFC net-next 20/22] eth: bnxt: support per queue configuration of rx-buf-len Jakub Kicinski
                   ` (3 subsequent siblings)
  22 siblings, 2 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Move the rx-buf-len config validation to the queue ops.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 40 +++++++++++++++++++
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 ------
 2 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 43497b335329..a772ffaf3e5b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -16052,8 +16052,46 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
 	return 0;
 }
 
+static int
+bnxt_queue_cfg_validate(struct net_device *dev, int idx,
+			struct netdev_queue_config *qcfg,
+			struct netlink_ext_ack *extack)
+{
+	struct bnxt *bp = netdev_priv(dev);
+
+	/* Older chips need MSS calc so rx_buf_len is not supported,
+	 * but we don't set queue ops for them so we should never get here.
+	 */
+	if (qcfg->rx_buf_len != bp->rx_page_size &&
+	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
+		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(qcfg->rx_buf_len)) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
+		return -ERANGE;
+	}
+	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
+	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
+		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
+		return -ERANGE;
+	}
+	return 0;
+}
+
+static void
+bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
+			struct netdev_queue_config *qcfg)
+{
+	qcfg->rx_buf_len	= BNXT_RX_PAGE_SIZE;
+}
+
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
+
+	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
+	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
 	.ndo_queue_mem_alloc	= bnxt_queue_mem_alloc,
 	.ndo_queue_mem_free	= bnxt_queue_mem_free,
 	.ndo_queue_start	= bnxt_queue_start,
@@ -16061,6 +16099,8 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 };
 
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
+	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
+	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
 };
 
 static void bnxt_remove_one(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 956f51449709..8842390f687f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -867,18 +867,6 @@ static int bnxt_set_ringparam(struct net_device *dev,
 	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
 		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
 
-	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
-	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
-		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
-		return -EINVAL;
-	}
-	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
-	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
-	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
-		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
-		return -ERANGE;
-	}
-
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 20/22] eth: bnxt: support per queue configuration of rx-buf-len
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (18 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 19/22] eth: bnxt: use queue op config validate Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-21 22:28 ` [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace Jakub Kicinski
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Now that the rx_buf_len is stored and validated per queue allow
it being set differently for different queues. Instead of copying
the device setting for each queue ask the core for the config
via netdev_queue_config().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a772ffaf3e5b..141acfed0f91 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4280,6 +4280,7 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 
 	for (i = 0; i < bp->cp_nr_rings; i++) {
 		struct bnxt_napi *bnapi = bp->bnapi[i];
+		struct netdev_queue_config qcfg;
 		struct bnxt_ring_mem_info *rmem;
 		struct bnxt_cp_ring_info *cpr;
 		struct bnxt_rx_ring_info *rxr;
@@ -4302,7 +4303,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
-		rxr->rx_page_size = bp->rx_page_size;
+		netdev_queue_config(bp->dev, i,	&qcfg);
+		rxr->rx_page_size = qcfg.rx_buf_len;
 
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
@@ -15771,6 +15773,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev,
 	clone->rx_agg_prod = 0;
 	clone->rx_sw_agg_prod = 0;
 	clone->rx_next_cons = 0;
+	clone->rx_page_size = qcfg->rx_buf_len;
 	clone->need_head_pool = false;
 
 	rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
@@ -15877,6 +15880,8 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	src_ring = &src->rx_ring_struct;
 	src_rmem = &src_ring->ring_mem;
 
+	dst->rx_page_size = src->rx_page_size;
+
 	WARN_ON(dst_rmem->nr_pages != src_rmem->nr_pages);
 	WARN_ON(dst_rmem->page_size != src_rmem->page_size);
 	WARN_ON(dst_rmem->flags != src_rmem->flags);
@@ -16088,6 +16093,7 @@ bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
 }
 
 static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
+	.supported_ring_params	= ETHTOOL_RING_USE_RX_BUF_LEN,
 	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
 
 	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (19 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 20/22] eth: bnxt: support per queue configuration of rx-buf-len Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 16:36   ` Stanislav Fomichev
  2025-06-25 12:23   ` Breno Leitao
  2025-04-21 22:28 ` [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len Jakub Kicinski
  2025-04-23 20:02 ` [RFC net-next 00/22] net: per-queue rx-buf-len configuration Mina Almasry
  22 siblings, 2 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

bpftrace is very useful for low level driver testing. perf or trace-cmd
would also do for collecting data from tracepoints, but they require
much more post-processing.

Add a wrapper for running bpftrace and sanitizing its output.
bpftrace has JSON output, which is great, but it prints loose objects
and in a slightly inconvenient format. We have to read the objects
line by line, and while at it return them indexed by the map name.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 tools/testing/selftests/net/lib/py/utils.py | 33 +++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/net/lib/py/utils.py b/tools/testing/selftests/net/lib/py/utils.py
index 34470d65d871..760ccf6fcccc 100644
--- a/tools/testing/selftests/net/lib/py/utils.py
+++ b/tools/testing/selftests/net/lib/py/utils.py
@@ -185,6 +185,39 @@ global_defer_queue = []
     return tool('ethtool', args, json=json, ns=ns, host=host)
 
 
+def bpftrace(expr, json=None, ns=None, host=None, timeout=None):
+    """
+    Run bpftrace and return map data (if json=True).
+    The output of bpftrace is inconvenient, so the helper converts
+    to a dict indexed by map name, e.g.:
+     {
+       "@":     { ... },
+       "@map2": { ... },
+     }
+    """
+    cmd_arr = ['bpftrace']
+    # Throw in --quiet if json, otherwise the output has two objects
+    if json:
+        cmd_arr += ['-f', 'json', '-q']
+    if timeout:
+        expr += ' interval:s:' + str(timeout) + ' { exit(); }'
+    cmd_arr += ['-e', expr]
+    cmd_obj = cmd(cmd_arr, ns=ns, host=host, shell=False)
+    if json:
+        # bpftrace prints objects as lines
+        ret = {}
+        for l in cmd_obj.stdout.split('\n'):
+            if not l.strip():
+                continue
+            one = _json.loads(l)
+            if one.get('type') != 'map':
+                continue
+            for k, v in one["data"].items():
+                ret[k] = v
+        return ret
+    return cmd_obj
+
+
 def rand_port(type=socket.SOCK_STREAM):
     """
     Get a random unprivileged port.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (20 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace Jakub Kicinski
@ 2025-04-21 22:28 ` Jakub Kicinski
  2025-04-22 17:06   ` Stanislav Fomichev
  2025-04-23 20:02 ` [RFC net-next 00/22] net: per-queue rx-buf-len configuration Mina Almasry
  22 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-21 22:28 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan, Jakub Kicinski

Add a test for rx-buf-len. Check both nic-wide and per-queue settings.

  # ./drivers/net/hw/rx_buf_len.py
  TAP version 13
  1..6
  ok 1 rx_buf_len.nic_wide
  ok 2 rx_buf_len.nic_wide_check_packets
  ok 3 rx_buf_len.per_queue
  ok 4 rx_buf_len.per_queue_check_packets
  ok 5 rx_buf_len.queue_check_ring_count
  ok 6 rx_buf_len.queue_check_ring_count_check_packets
  # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 .../testing/selftests/drivers/net/hw/Makefile |   1 +
 .../selftests/drivers/net/hw/rx_buf_len.py    | 299 ++++++++++++++++++
 2 files changed, 300 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/hw/rx_buf_len.py

diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 07cddb19ba35..88625c0e86c8 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -20,6 +20,7 @@ TEST_PROGS = \
 	pp_alloc_fail.py \
 	rss_ctx.py \
 	rss_input_xfrm.py \
+	rx_buf_len.py \
 	tso.py \
 	#
 
diff --git a/tools/testing/selftests/drivers/net/hw/rx_buf_len.py b/tools/testing/selftests/drivers/net/hw/rx_buf_len.py
new file mode 100755
index 000000000000..d8a6d07fac5e
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/rx_buf_len.py
@@ -0,0 +1,299 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import errno, time
+from typing import Tuple
+from lib.py import ksft_run, ksft_exit, KsftSkipEx, KsftFailEx
+from lib.py import ksft_eq, ksft_ge, ksft_in, ksft_not_in
+from lib.py import EthtoolFamily, NetdevFamily, NlError
+from lib.py import NetDrvEpEnv, GenerateTraffic
+from lib.py import cmd, defer, bpftrace, ethtool, rand_port
+
+
+def _do_bpftrace(cfg, mul, base_size, tgt_queue=None):
+    queue_filter = ''
+    if tgt_queue is not None:
+        queue_filter = 'if ($skb->queue_mapping != %d) {return;}' % (tgt_queue + 1, )
+
+    t = ('tracepoint:net:netif_receive_skb { ' +
+         '$skb = (struct sk_buff *)args->skbaddr; '+
+         '$sh = (struct skb_shared_info *)($skb->head + $skb->end); ' +
+         'if ($skb->dev->ifindex != ' + str(cfg.ifindex) + ') {return;} ' +
+         queue_filter +
+         '@[$skb->len - $skb->data_len] = count(); ' +
+         '@h[$skb->len - $skb->data_len] = count(); ' +
+         'if ($sh->nr_frags > 0) { @[$sh->frags[0].len] = count(); @d[$sh->frags[0].len] = count();} }'
+        )
+    maps = bpftrace(t, json=True, timeout=2)
+    # We expect one-dim array with something like:
+    # {"type": "map", "data": {"@": {"1500": 1, "719": 1,
+    sizes = maps["@"]
+    h = maps["@h"]
+    d = maps["@d"]
+    good = 0
+    bad = 0
+    for k, v in sizes.items():
+        k = int(k)
+        if mul == 1 and k > base_size:
+            bad += v
+        elif mul > 1 and k > base_size:
+            good += v
+        elif mul < 1 and k >= base_size:
+            bad += v
+    ksft_eq(bad, 0, "buffer was decreased but large buffers seen")
+    if mul > 1:
+        ksft_ge(good, 100, "buffer was increased but no large buffers seen")
+
+
+def _ethtool_create(cfg, act, opts):
+    output = ethtool(f"{act} {cfg.ifname} {opts}").stdout
+    # Output will be something like: "New RSS context is 1" or
+    # "Added rule with ID 7", we want the integer from the end
+    return int(output.split()[-1])
+
+
+def nic_wide(cfg, check_geometry=False) -> None:
+    """
+    Apply NIC wide rx-buf-len change. Run some traffic to make sure there
+    are no crashes. Test that setting 0 restores driver default.
+    Assume we start with the default.
+    """
+    try:
+        rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+    except NlError as e:
+        rings = {}
+    if "rx-buf-len" not in rings:
+        raise KsftSkipEx('rx-buf-len configuration not supported by device')
+
+    if rings['rx-buf-len'] * 2 <= rings['rx-buf-len-max']:
+        mul = 2
+    else:
+        mul = 1/2
+    cfg.ethnl.rings_set({'header': {'dev-index': cfg.ifindex},
+                         'rx-buf-len': rings['rx-buf-len'] * mul})
+
+    # Use zero to restore default, per uAPI, we assume we started with default
+    reset = defer(cfg.ethnl.rings_set, {'header': {'dev-index': cfg.ifindex},
+                                       'rx-buf-len': 0})
+
+    new = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+    ksft_eq(new['rx-buf-len'], rings['rx-buf-len'] * mul, "config after change")
+
+    # Runs some traffic thru them buffers, to make things implode if they do
+    traf = GenerateTraffic(cfg)
+    try:
+        if check_geometry:
+            _do_bpftrace(cfg, mul, rings['rx-buf-len'])
+    finally:
+        traf.wait_pkts_and_stop(20000)
+
+    reset.exec()
+    new = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+    ksft_eq(new['rx-buf-len'], rings['rx-buf-len'], "config reset to default")
+
+
+def nic_wide_check_packets(cfg) -> None:
+    nic_wide(cfg, check_geometry=True)
+
+
+def _check_queues_with_config(cfg, buf_len, qset):
+    cnt = 0
+    queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
+    for q in queues:
+        if 'rx-buf-len' in q:
+            cnt += 1
+            ksft_eq(q['type'], "rx")
+            ksft_in(q['id'], qset)
+            ksft_eq(q['rx-buf-len'], buf_len, "buf size")
+    if cnt != len(qset):
+        raise KsftFailEx('queue rx-buf-len config invalid')
+
+
+def _per_queue_configure(cfg) -> Tuple[dict, int, defer]:
+    """
+    Prep for per queue test. Set the config on one queue and return
+    the original ring settings, the multiplier and reset defer.
+    """
+    # Validate / get initial settings
+    try:
+        rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+    except NlError as e:
+        rings = {}
+    if "rx-buf-len" not in rings:
+        raise KsftSkipEx('rx-buf-len configuration not supported by device')
+
+    try:
+        queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
+    except NlError as e:
+        raise KsftSkipEx('queue configuration not supported by device')
+
+    if len(queues) < 2:
+        raise KsftSkipEx('not enough queues: ' + str(len(queues)))
+    for q in queues:
+        if 'rx-buf-len' in q:
+            raise KsftFailEx('queue rx-buf-len already configured')
+
+    # Apply a change, we'll target queue 1
+    if rings['rx-buf-len'] * 2 <= rings['rx-buf-len-max']:
+        mul = 2
+    else:
+        mul = 1/2
+    try:
+        cfg.netnl.queue_set({'ifindex': cfg.ifindex, "type": "rx", "id": 1,
+                             'rx-buf-len': rings['rx-buf-len'] * mul })
+    except NlError as e:
+        if e.error == errno.EOPNOTSUPP:
+            raise KsftSkipEx('per-queue rx-buf-len configuration not supported')
+        raise
+
+    reset = defer(cfg.netnl.queue_set, {'ifindex': cfg.ifindex,
+                                        "type": "rx", "id": 1,
+                                        'rx-buf-len': 0})
+    # Make sure config stuck
+    _check_queues_with_config(cfg, rings['rx-buf-len'] * mul, {1})
+
+    return rings, mul, reset
+
+
+def per_queue(cfg, check_geometry=False) -> None:
+    """
+    Similar test to nic_wide, but done a single queue (queue 1).
+    Flow filter is used to direct traffic to that queue.
+    """
+
+    rings, mul, reset = _per_queue_configure(cfg)
+    _check_queues_with_config(cfg, rings['rx-buf-len'] * mul, {1})
+
+    # Check with traffic, we need to direct the traffic to the expected queue
+    port = rand_port()
+    flow = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port} action 1"
+    nid = _ethtool_create(cfg, "-N", flow)
+    ntuple = defer(ethtool, f"-N {cfg.ifname} delete {nid}")
+
+    traf = GenerateTraffic(cfg, port=port)
+    try:
+        if check_geometry:
+            _do_bpftrace(cfg, mul, rings['rx-buf-len'], tgt_queue=1)
+    finally:
+        traf.wait_pkts_and_stop(20000)
+    ntuple.exec()
+
+    # And now direct to another queue
+    flow = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port} action 0"
+    nid = _ethtool_create(cfg, "-N", flow)
+    ntuple = defer(ethtool, f"-N {cfg.ifname} delete {nid}")
+
+    traf = GenerateTraffic(cfg, port=port)
+    try:
+        if check_geometry:
+            _do_bpftrace(cfg, 1, rings['rx-buf-len'], tgt_queue=0)
+    finally:
+        traf.wait_pkts_and_stop(20000)
+
+    # Back to default
+    reset.exec()
+    queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
+    for q in queues:
+        ksft_not_in('rx-buf-len', q)
+
+
+def per_queue_check_packets(cfg) -> None:
+    per_queue(cfg, check_geometry=True)
+
+
+def queue_check_ring_count(cfg, check_geometry=False) -> None:
+    """
+    Make sure the change of ring count is handled correctly.
+    """
+    rings, mul, reset = _per_queue_configure(cfg)
+
+    channels = cfg.ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
+    if channels.get('combined-count', 0) < 4:
+        raise KsftSkipEx('need at least 4 rings, have',
+                         channels.get('combined-count'))
+
+    # Move the channel count up and down, should make no difference
+    moves = [1, 0]
+    if channels['combined-count'] == channels['combined-max']:
+        moves = [-1, 0]
+    for move in moves:
+        target = channels['combined-count'] + move
+        cfg.ethnl.channels_set({'header': {'dev-index': cfg.ifindex},
+                                'combined-count': target})
+
+    _check_queues_with_config(cfg, rings['rx-buf-len'] * mul, {1})
+
+    # Check with traffic, we need to direct the traffic to the expected queue
+    port1 = rand_port()
+    flow1 = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port1} action 1"
+    nid = _ethtool_create(cfg, "-N", flow1)
+    ntuple = defer(ethtool, f"-N {cfg.ifname} delete {nid}")
+
+    traf = GenerateTraffic(cfg, port=port1)
+    try:
+        if check_geometry:
+            _do_bpftrace(cfg, mul, rings['rx-buf-len'], tgt_queue=1)
+    finally:
+        traf.wait_pkts_and_stop(20000)
+
+    # And now direct to another queue
+    port0 = rand_port()
+    flow = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port0} action 0"
+    nid = _ethtool_create(cfg, "-N", flow)
+    defer(ethtool, f"-N {cfg.ifname} delete {nid}")
+
+    traf = GenerateTraffic(cfg, port=port0)
+    try:
+        if check_geometry:
+            _do_bpftrace(cfg, 1, rings['rx-buf-len'], tgt_queue=0)
+    finally:
+        traf.wait_pkts_and_stop(20000)
+
+    # Go to a single queue, should reset
+    ntuple.exec()
+    cfg.ethnl.channels_set({'header': {'dev-index': cfg.ifindex},
+                            'combined-count': 1})
+    cfg.ethnl.channels_set({'header': {'dev-index': cfg.ifindex},
+                            'combined-count': channels['combined-count']})
+
+    nid = _ethtool_create(cfg, "-N", flow1)
+    defer(ethtool, f"-N {cfg.ifname} delete {nid}")
+
+    queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
+    for q in queues:
+        ksft_not_in('rx-buf-len', q)
+
+    # Check with traffic that queue is now getting normal buffers
+    traf = GenerateTraffic(cfg, port=port1)
+    try:
+        if check_geometry:
+            _do_bpftrace(cfg, 1, rings['rx-buf-len'], tgt_queue=1)
+    finally:
+        traf.wait_pkts_and_stop(20000)
+
+
+def queue_check_ring_count_check_packets(cfg):
+    queue_check_ring_count(cfg, True)
+
+
+def main() -> None:
+    with NetDrvEpEnv(__file__) as cfg:
+        cfg.netnl = NetdevFamily()
+        cfg.ethnl = EthtoolFamily()
+
+        o = [nic_wide,
+             per_queue,
+             nic_wide_check_packets]
+
+        ksft_run([nic_wide,
+                  nic_wide_check_packets,
+                  per_queue,
+                  per_queue_check_packets,
+                  queue_check_ring_count,
+                  queue_check_ring_count_check_packets],
+                 args=(cfg, ))
+    ksft_exit()
+
+
+if __name__ == "__main__":
+    main()
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size
  2025-04-21 22:28 ` [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size Jakub Kicinski
@ 2025-04-22 15:32   ` Stanislav Fomichev
  2025-04-22 15:52     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 15:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> If user decides to increase the buffer size for agg ring
> we need to ask the page pool for higher order pages.
> There is no need to use larger pages for header frags,
> if user increase the size of agg ring buffers switch
> to separate header page automatically.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index b611a5ff6d3c..a86bb2ba5adb 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -3802,6 +3802,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>  	pp.pool_size = bp->rx_agg_ring_size;
>  	if (BNXT_RX_PAGE_MODE(bp))
>  		pp.pool_size += bp->rx_ring_size;
> +	pp.order = get_order(bp->rx_page_size);

Since it's gonna be configured by the users going forward, for the
pps that don't have mp, we might want to check pp.order against
MAX_PAGE_ORDER (and/or PAGE_ALLOC_COSTLY_ORDER?) during
page_pool_create? 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 13/22] net: add queue config validation callback
  2025-04-21 22:28 ` [RFC net-next 13/22] net: add queue config validation callback Jakub Kicinski
@ 2025-04-22 15:49   ` Stanislav Fomichev
  2025-04-22 20:16   ` Joe Damato
  1 sibling, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 15:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> I imagine (tm) that as the number of per-queue configuration
> options grows some of them may conflict for certain drivers.
> While the drivers can obviously do all the validation locally
> doing so is fairly inconvenient as the config is fed to drivers
> piecemeal via different ops (for different params and NIC-wide
> vs per-queue).
> 
> Add a centralized callback for validating the queue config
> in queue ops. The callback gets invoked before each queue restart
> and when ring params are modified.
> 
> For NIC-wide changes the callback gets invoked for each active
> (or active to-be) queue, and additionally with a negative queue
> index for NIC-wide defaults. The NIC-wide check is needed in
> case all queues have an override active when NIC-wide setting
> is changed to an unsupported one. Alternatively we could check
> the settings when new queues are enabled (in the channel API),
> but accepting invalid config is a bad idea. Users may expect
> that resetting a queue override will always work.

[..]

> The "trick" of passing a negative index is a bit ugly, we may
> want to revisit if it causes confusion and bugs. Existing drivers
> don't care about the index so it "just works".

+1, someone is gonna be bitten by it :-) Separate bool might be as ugly,
but more explicit, idk.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops
  2025-04-21 22:28 ` [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops Jakub Kicinski
@ 2025-04-22 15:50   ` Stanislav Fomichev
  2025-04-22 20:18   ` Joe Damato
  1 sibling, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 15:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> Core provides a centralized callback for validating per-queue settings
> but the callback is part of the queue management ops. Having the ops
> conditionally set complicates the parts of the driver which could
> otherwise lean on the core to feel it the correct settings.
> 
> Always set the queue ops, but provide no restart-related callbacks if
> queue ops are not supported by the device. This should maintain current
> behavior, the check in netdev_rx_queue_restart() looks both at op struct
> and individual ops.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 627132ce10df..49d66f4a5ad0 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -16039,6 +16039,9 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
>  	.ndo_queue_stop		= bnxt_queue_stop,
>  };
>  
> +static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
> +};
> +
>  static void bnxt_remove_one(struct pci_dev *pdev)
>  {
>  	struct net_device *dev = pci_get_drvdata(pdev);
> @@ -16694,7 +16697,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		bp->rss_cap |= BNXT_RSS_CAP_MULTI_RSS_CTX;
>  	if (BNXT_SUPPORTS_QUEUE_API(bp))
>  		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;
> -	dev->request_ops_lock = true;
> +	else
> +		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
>  
>  	rc = register_netdev(dev);
>  	if (rc)
> -- 
> 2.49.0
> 

nit: maybe reflow as follows?

dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
if (BNXT_SUPPORTS_QUEUE_API(bp))
	dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size
  2025-04-22 15:32   ` Stanislav Fomichev
@ 2025-04-22 15:52     ` Jakub Kicinski
  2025-04-22 17:27       ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-22 15:52 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On Tue, 22 Apr 2025 08:32:29 -0700 Stanislav Fomichev wrote:
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index b611a5ff6d3c..a86bb2ba5adb 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -3802,6 +3802,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >  	pp.pool_size = bp->rx_agg_ring_size;
> >  	if (BNXT_RX_PAGE_MODE(bp))
> >  		pp.pool_size += bp->rx_ring_size;
> > +	pp.order = get_order(bp->rx_page_size);  
> 
> Since it's gonna be configured by the users going forward, for the
> pps that don't have mp, we might want to check pp.order against
> MAX_PAGE_ORDER (and/or PAGE_ALLOC_COSTLY_ORDER?) during
> page_pool_create? 

Hm, interesting question. Major concern being that users will shoot
themselves in the foot? Or that syzbot will trigger a warning?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2025-04-21 22:28 ` [RFC net-next 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Jakub Kicinski
@ 2025-04-22 16:13   ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 16:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> The driver tries to provision more agg buffers than header buffers
> since multiple agg segments can reuse the same header. The calculation
> / heuristic tries to provide enough pages for 65k of data for each header
> (or 4 frags per header if the result is too big). This calculation is
> currently global to the adapter. If we increase the buffer sizes 8x
> we don't want 8x the amount of memory sitting on the rings.
> Luckily we don't have to fill the rings completely, adjust
> the fill level dynamically in case particular queue has buffers
> larger than the global size.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 28f8a4e0d41b..43497b335329 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -3791,6 +3791,21 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>  	}
>  }
>  
> +static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
> +				       struct bnxt_rx_ring_info *rxr)
> +{
> +	/* User may have chosen larger than default rx_page_size,
> +	 * we keep the ring sizes uniform and also want uniform amount
> +	 * of bytes consumed per ring, so cap how much of the rings we fill.
> +	 */
> +	int fill_level = bp->rx_agg_ring_size;

> +
> +	if (rxr->rx_page_size > bp->rx_page_size)
> +		fill_level /= rxr->rx_page_size / bp->rx_page_size;
> +
> +	return fill_level;
> +}

I don't know the driver well enough (so take it as an optional nit),
but seems like there should be a per-rxr rx_agg_ring_size pre-configured
somewhere. Deriving the fill_level from the per-queue rx_page_size feels
a bit hacky.

(or, rather, reading this snippet makes it hard for me to understand
what would be the size of agg ring given all the code in bnxt_set_ring_params)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue
  2025-04-21 22:28 ` [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue Jakub Kicinski
@ 2025-04-22 16:15   ` Stanislav Fomichev
  2025-04-25 23:41     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 16:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> Zero-copy APIs increase the cost of buffer management. They also extend
> this cost to user space applications which may be used to dealing with
> much larger buffers. Allow setting rx-buf-len per queue, devices with
> HW-GRO support can commonly fill buffers up to 32k (or rather 64k - 1
> but that's not a power of 2..)
> 
> The implementation adds a new option to the netdev netlink, rather
> than ethtool. The NIC-wide setting lives in ethtool ringparams so
> one could argue that we should be extending the ethtool API.
> OTOH netdev API is where we already have queue-get, and it's how
> zero-copy applications bind memory providers.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  Documentation/netlink/specs/netdev.yaml | 15 ++++
>  include/net/netdev_queues.h             |  5 ++
>  include/net/netlink.h                   | 19 +++++
>  include/uapi/linux/netdev.h             |  2 +
>  net/core/netdev-genl-gen.h              |  1 +
>  tools/include/uapi/linux/netdev.h       |  2 +
>  net/core/netdev-genl-gen.c              | 15 ++++
>  net/core/netdev-genl.c                  | 92 +++++++++++++++++++++++++
>  net/core/netdev_config.c                | 16 +++++
>  9 files changed, 167 insertions(+)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index f5e0750ab71d..b0dfa970ee83 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -324,6 +324,10 @@ name: netdev
>          doc: XSK information for this queue, if any.
>          type: nest
>          nested-attributes: xsk-info
> +      -
> +        name: rx-buf-len
> +        doc: Per-queue configuration of ETHTOOL_A_RINGS_RX_BUF_LEN.
> +        type: u32
>    -
>      name: qstats
>      doc: |
> @@ -743,6 +747,17 @@ name: netdev
>              - defer-hard-irqs
>              - gro-flush-timeout
>              - irq-suspend-timeout
> +    -
> +      name: queue-set
> +      doc: Set per-queue configurable options.
> +      attribute-set: queue
> +      do:
> +        request:
> +          attributes:
> +            - ifindex
> +            - type
> +            - id
> +            - rx-buf-len

Do we want some guidance going forward on what belongs to queue-set
vs napi-set? (mostly) HW settings for the queue-set and (mostly) SW
settings for (serving the queues) in napi-set?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
@ 2025-04-22 16:19   ` David Wei
  2025-04-22 19:48   ` Joe Damato
  2025-04-23 20:08   ` Mina Almasry
  2 siblings, 0 replies; 66+ messages in thread
From: David Wei @ 2025-04-22 16:19 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On 2025-04-21 15:28, Jakub Kicinski wrote:
> Document the semantics of the rx_buf_len ethtool ring param.
> Clarify its meaning in case of HDS, where driver may have
> two separate buffer pools.
> 
> The various zero-copy TCP Rx schemes we have suffer from memory
> management overhead. Specifically applications aren't too impressed
> with the number of 4kB buffers they have to juggle. Zero-copy
> TCP makes most sense with larger memory transfers so using
> 16kB or 32kB buffers (with the help of HW-GRO) feels more
> natural.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  Documentation/networking/ethtool-netlink.rst | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index b6e9af4d0f1b..eaa9c17a3cb1 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
>  driver. Driver may impose additional constraints and may not support all
>  attributes.
>  
> -
>  ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
>  Completion queue events (CQE) are the events posted by NIC to indicate the
>  completion status of a packet when the packet is sent (like send success or
> @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
>  header / data split feature. If a received packet size is larger than this
>  threshold value, header and data will be split.
>  
> +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> +uses to receive packets. If the device uses different memory polls for headers

pools

> +and payload this setting may control the size of the header buffers but must
> +control the size of the payload buffers.
> +
>  CHANNELS_GET
>  ============
>  

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 18/22] net: wipe the setting of deactived queues
  2025-04-21 22:28 ` [RFC net-next 18/22] net: wipe the setting of deactived queues Jakub Kicinski
@ 2025-04-22 16:21   ` Stanislav Fomichev
  2025-04-25 23:42     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 16:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> Clear out all settings of deactived queues when user changes
> the number of channels. We already perform similar cleanup
> for shapers.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  net/core/dev.h           |  2 ++
>  net/core/dev.c           |  5 +++++
>  net/core/netdev_config.c | 13 +++++++++++++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/net/core/dev.h b/net/core/dev.h
> index e0d433fb6325..4cdd8ac7df4f 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -101,6 +101,8 @@ void __netdev_queue_config(struct net_device *dev, int rxq,
>  			   struct netdev_queue_config *qcfg, bool pending);
>  int netdev_queue_config_revalidate(struct net_device *dev,
>  				   struct netlink_ext_ack *extack);
> +void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
> +				    unsigned int rxq);
>  
>  /* netdev management, shared between various uAPI entry points */
>  struct netdev_name_node {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7930b57d1767..c1f9b6ce6500 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3188,6 +3188,8 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
>  		if (dev->num_tc)
>  			netif_setup_tc(dev, txq);
>  
> +		netdev_queue_config_update_cnt(dev, txq,
> +					       dev->real_num_rx_queues);
>  		net_shaper_set_real_num_tx_queues(dev, txq);
>  
>  		dev_qdisc_change_real_num_tx(dev, txq);
> @@ -3234,6 +3236,9 @@ int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
>  						  rxq);
>  		if (rc)
>  			return rc;
> +
> +		netdev_queue_config_update_cnt(dev, dev->real_num_tx_queues,
> +					       rxq);
>  	}
>  
>  	dev->real_num_rx_queues = rxq;
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> index ede02b77470e..c5ae39e76f40 100644
> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -64,6 +64,19 @@ int netdev_reconfig_start(struct net_device *dev)
>  	return -ENOMEM;
>  }
>  
> +void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
> +				    unsigned int rxq)
> +{

Presumably txq argument is here for some potential future use cases?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace
  2025-04-21 22:28 ` [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace Jakub Kicinski
@ 2025-04-22 16:36   ` Stanislav Fomichev
  2025-04-22 16:39     ` Stanislav Fomichev
  2025-06-25 12:23   ` Breno Leitao
  1 sibling, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 16:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> bpftrace is very useful for low level driver testing. perf or trace-cmd
> would also do for collecting data from tracepoints, but they require
> much more post-processing.
> 
> Add a wrapper for running bpftrace and sanitizing its output.
> bpftrace has JSON output, which is great, but it prints loose objects
> and in a slightly inconvenient format. We have to read the objects
> line by line, and while at it return them indexed by the map name.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  tools/testing/selftests/net/lib/py/utils.py | 33 +++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/lib/py/utils.py b/tools/testing/selftests/net/lib/py/utils.py
> index 34470d65d871..760ccf6fcccc 100644
> --- a/tools/testing/selftests/net/lib/py/utils.py
> +++ b/tools/testing/selftests/net/lib/py/utils.py
> @@ -185,6 +185,39 @@ global_defer_queue = []
>      return tool('ethtool', args, json=json, ns=ns, host=host)
>  
>  
> +def bpftrace(expr, json=None, ns=None, host=None, timeout=None):
> +    """
> +    Run bpftrace and return map data (if json=True).
> +    The output of bpftrace is inconvenient, so the helper converts
> +    to a dict indexed by map name, e.g.:
> +     {
> +       "@":     { ... },
> +       "@map2": { ... },
> +     }
> +    """
> +    cmd_arr = ['bpftrace']
> +    # Throw in --quiet if json, otherwise the output has two objects
> +    if json:
> +        cmd_arr += ['-f', 'json', '-q']
> +    if timeout:
> +        expr += ' interval:s:' + str(timeout) + ' { exit(); }'

nit: any reason not to use format string here ^^

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace
  2025-04-22 16:36   ` Stanislav Fomichev
@ 2025-04-22 16:39     ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 16:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/22, Stanislav Fomichev wrote:
> On 04/21, Jakub Kicinski wrote:
> > bpftrace is very useful for low level driver testing. perf or trace-cmd
> > would also do for collecting data from tracepoints, but they require
> > much more post-processing.
> > 
> > Add a wrapper for running bpftrace and sanitizing its output.
> > bpftrace has JSON output, which is great, but it prints loose objects
> > and in a slightly inconvenient format. We have to read the objects
> > line by line, and while at it return them indexed by the map name.
> > 
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > ---
> >  tools/testing/selftests/net/lib/py/utils.py | 33 +++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/net/lib/py/utils.py b/tools/testing/selftests/net/lib/py/utils.py
> > index 34470d65d871..760ccf6fcccc 100644
> > --- a/tools/testing/selftests/net/lib/py/utils.py
> > +++ b/tools/testing/selftests/net/lib/py/utils.py
> > @@ -185,6 +185,39 @@ global_defer_queue = []
> >      return tool('ethtool', args, json=json, ns=ns, host=host)
> >  
> >  
> > +def bpftrace(expr, json=None, ns=None, host=None, timeout=None):
> > +    """
> > +    Run bpftrace and return map data (if json=True).
> > +    The output of bpftrace is inconvenient, so the helper converts
> > +    to a dict indexed by map name, e.g.:
> > +     {
> > +       "@":     { ... },
> > +       "@map2": { ... },
> > +     }
> > +    """
> > +    cmd_arr = ['bpftrace']
> > +    # Throw in --quiet if json, otherwise the output has two objects
> > +    if json:
> > +        cmd_arr += ['-f', 'json', '-q']
> > +    if timeout:
> > +        expr += ' interval:s:' + str(timeout) + ' { exit(); }'
> 
> nit: any reason not to use format string here ^^

Ah, probably because {} in the string...

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len
  2025-04-21 22:28 ` [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len Jakub Kicinski
@ 2025-04-22 17:06   ` Stanislav Fomichev
  2025-04-25 23:52     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 17:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/21, Jakub Kicinski wrote:
> Add a test for rx-buf-len. Check both nic-wide and per-queue settings.
> 
>   # ./drivers/net/hw/rx_buf_len.py
>   TAP version 13
>   1..6
>   ok 1 rx_buf_len.nic_wide
>   ok 2 rx_buf_len.nic_wide_check_packets
>   ok 3 rx_buf_len.per_queue
>   ok 4 rx_buf_len.per_queue_check_packets
>   ok 5 rx_buf_len.queue_check_ring_count
>   ok 6 rx_buf_len.queue_check_ring_count_check_packets
>   # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  .../testing/selftests/drivers/net/hw/Makefile |   1 +
>  .../selftests/drivers/net/hw/rx_buf_len.py    | 299 ++++++++++++++++++
>  2 files changed, 300 insertions(+)
>  create mode 100755 tools/testing/selftests/drivers/net/hw/rx_buf_len.py
> 
> diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
> index 07cddb19ba35..88625c0e86c8 100644
> --- a/tools/testing/selftests/drivers/net/hw/Makefile
> +++ b/tools/testing/selftests/drivers/net/hw/Makefile
> @@ -20,6 +20,7 @@ TEST_PROGS = \
>  	pp_alloc_fail.py \
>  	rss_ctx.py \
>  	rss_input_xfrm.py \
> +	rx_buf_len.py \
>  	tso.py \
>  	#
>  
> diff --git a/tools/testing/selftests/drivers/net/hw/rx_buf_len.py b/tools/testing/selftests/drivers/net/hw/rx_buf_len.py
> new file mode 100755
> index 000000000000..d8a6d07fac5e
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/hw/rx_buf_len.py
> @@ -0,0 +1,299 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0
> +
> +import errno, time
> +from typing import Tuple
> +from lib.py import ksft_run, ksft_exit, KsftSkipEx, KsftFailEx
> +from lib.py import ksft_eq, ksft_ge, ksft_in, ksft_not_in
> +from lib.py import EthtoolFamily, NetdevFamily, NlError
> +from lib.py import NetDrvEpEnv, GenerateTraffic
> +from lib.py import cmd, defer, bpftrace, ethtool, rand_port
> +
> +
> +def _do_bpftrace(cfg, mul, base_size, tgt_queue=None):
> +    queue_filter = ''
> +    if tgt_queue is not None:
> +        queue_filter = 'if ($skb->queue_mapping != %d) {return;}' % (tgt_queue + 1, )

if tgt_queue: should work as well?

> +
> +    t = ('tracepoint:net:netif_receive_skb { ' +
> +         '$skb = (struct sk_buff *)args->skbaddr; '+
> +         '$sh = (struct skb_shared_info *)($skb->head + $skb->end); ' +
> +         'if ($skb->dev->ifindex != ' + str(cfg.ifindex) + ') {return;} ' +
> +         queue_filter +
> +         '@[$skb->len - $skb->data_len] = count(); ' +
> +         '@h[$skb->len - $skb->data_len] = count(); ' +
> +         'if ($sh->nr_frags > 0) { @[$sh->frags[0].len] = count(); @d[$sh->frags[0].len] = count();} }'
> +        )

Why do we have @h and @d? We seem to check only the 'sizes'/@?

> +    maps = bpftrace(t, json=True, timeout=2)
> +    # We expect one-dim array with something like:
> +    # {"type": "map", "data": {"@": {"1500": 1, "719": 1,
> +    sizes = maps["@"]
> +    h = maps["@h"]
> +    d = maps["@d"]
> +    good = 0
> +    bad = 0

[..]

> +    for k, v in sizes.items():
> +        k = int(k)
> +        if mul == 1 and k > base_size:
> +            bad += v
> +        elif mul > 1 and k > base_size:
> +            good += v
> +        elif mul < 1 and k >= base_size:
> +            bad += v

I haven't fully processed what's going on here, but will it be
easier if we go from mul*base_size to old_size and new_size? Or maybe
the comments can help?

if old_size == new_size and frag > old_size:
  # unchanged buf len, unexpected large frag
elif new_size < old_size and frag >= old_size:
  # shrank buf len, but got old (big) frag
elif new_size > old_size and frag > old_size:
  # good

> +    ksft_eq(bad, 0, "buffer was decreased but large buffers seen")
> +    if mul > 1:
> +        ksft_ge(good, 100, "buffer was increased but no large buffers seen")

> +
> +
> +def _ethtool_create(cfg, act, opts):
> +    output = ethtool(f"{act} {cfg.ifname} {opts}").stdout
> +    # Output will be something like: "New RSS context is 1" or
> +    # "Added rule with ID 7", we want the integer from the end
> +    return int(output.split()[-1])
> +
> +
> +def nic_wide(cfg, check_geometry=False) -> None:
> +    """
> +    Apply NIC wide rx-buf-len change. Run some traffic to make sure there
> +    are no crashes. Test that setting 0 restores driver default.
> +    Assume we start with the default.
> +    """
> +    try:
> +        rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
> +    except NlError as e:
> +        rings = {}
> +    if "rx-buf-len" not in rings:
> +        raise KsftSkipEx('rx-buf-len configuration not supported by device')
> +
> +    if rings['rx-buf-len'] * 2 <= rings['rx-buf-len-max']:
> +        mul = 2
> +    else:
> +        mul = 1/2

And similarly here? (and elsewhere)

def pick_buf_size(rings):
	""" pick new rx-buf-len depending on current and max settings """
	buf_len = rings['rx-buf-len']
	if buf_len * 2 <= <= rings['rx-buf-len-max']:
 	  # if can grow, try to grow
	  return buf_len, buf_len * 2
	else:
	  # otherwise shrink
	  return buf_len, buf_len / 2

old_buf_len, new_buf_len = pick_buf_size(ring)
...

(or maybe its just me, idk, easier to think in old>new comparisons vs
doing mul*base_size math)

> +    cfg.ethnl.rings_set({'header': {'dev-index': cfg.ifindex},
> +                         'rx-buf-len': rings['rx-buf-len'] * mul})
> +
> +    # Use zero to restore default, per uAPI, we assume we started with default
> +    reset = defer(cfg.ethnl.rings_set, {'header': {'dev-index': cfg.ifindex},
> +                                       'rx-buf-len': 0})
> +
> +    new = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
> +    ksft_eq(new['rx-buf-len'], rings['rx-buf-len'] * mul, "config after change")
> +
> +    # Runs some traffic thru them buffers, to make things implode if they do
> +    traf = GenerateTraffic(cfg)
> +    try:
> +        if check_geometry:
> +            _do_bpftrace(cfg, mul, rings['rx-buf-len'])
> +    finally:
> +        traf.wait_pkts_and_stop(20000)
> +
> +    reset.exec()
> +    new = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
> +    ksft_eq(new['rx-buf-len'], rings['rx-buf-len'], "config reset to default")
> +
> +
> +def nic_wide_check_packets(cfg) -> None:
> +    nic_wide(cfg, check_geometry=True)
> +
> +
> +def _check_queues_with_config(cfg, buf_len, qset):
> +    cnt = 0
> +    queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
> +    for q in queues:
> +        if 'rx-buf-len' in q:
> +            cnt += 1
> +            ksft_eq(q['type'], "rx")
> +            ksft_in(q['id'], qset)
> +            ksft_eq(q['rx-buf-len'], buf_len, "buf size")
> +    if cnt != len(qset):
> +        raise KsftFailEx('queue rx-buf-len config invalid')
> +
> +
> +def _per_queue_configure(cfg) -> Tuple[dict, int, defer]:
> +    """
> +    Prep for per queue test. Set the config on one queue and return
> +    the original ring settings, the multiplier and reset defer.
> +    """
> +    # Validate / get initial settings
> +    try:
> +        rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
> +    except NlError as e:
> +        rings = {}
> +    if "rx-buf-len" not in rings:
> +        raise KsftSkipEx('rx-buf-len configuration not supported by device')
> +
> +    try:
> +        queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
> +    except NlError as e:
> +        raise KsftSkipEx('queue configuration not supported by device')
> +
> +    if len(queues) < 2:
> +        raise KsftSkipEx('not enough queues: ' + str(len(queues)))
> +    for q in queues:
> +        if 'rx-buf-len' in q:
> +            raise KsftFailEx('queue rx-buf-len already configured')
> +
> +    # Apply a change, we'll target queue 1
> +    if rings['rx-buf-len'] * 2 <= rings['rx-buf-len-max']:
> +        mul = 2
> +    else:
> +        mul = 1/2
> +    try:
> +        cfg.netnl.queue_set({'ifindex': cfg.ifindex, "type": "rx", "id": 1,
> +                             'rx-buf-len': rings['rx-buf-len'] * mul })
> +    except NlError as e:
> +        if e.error == errno.EOPNOTSUPP:
> +            raise KsftSkipEx('per-queue rx-buf-len configuration not supported')
> +        raise
> +
> +    reset = defer(cfg.netnl.queue_set, {'ifindex': cfg.ifindex,
> +                                        "type": "rx", "id": 1,
> +                                        'rx-buf-len': 0})
> +    # Make sure config stuck
> +    _check_queues_with_config(cfg, rings['rx-buf-len'] * mul, {1})
> +
> +    return rings, mul, reset
> +
> +
> +def per_queue(cfg, check_geometry=False) -> None:
> +    """
> +    Similar test to nic_wide, but done a single queue (queue 1).
> +    Flow filter is used to direct traffic to that queue.
> +    """
> +
> +    rings, mul, reset = _per_queue_configure(cfg)
> +    _check_queues_with_config(cfg, rings['rx-buf-len'] * mul, {1})
> +
> +    # Check with traffic, we need to direct the traffic to the expected queue
> +    port = rand_port()
> +    flow = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port} action 1"
> +    nid = _ethtool_create(cfg, "-N", flow)
> +    ntuple = defer(ethtool, f"-N {cfg.ifname} delete {nid}")
> +
> +    traf = GenerateTraffic(cfg, port=port)
> +    try:
> +        if check_geometry:
> +            _do_bpftrace(cfg, mul, rings['rx-buf-len'], tgt_queue=1)
> +    finally:
> +        traf.wait_pkts_and_stop(20000)
> +    ntuple.exec()
> +
> +    # And now direct to another queue
> +    flow = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port} action 0"
> +    nid = _ethtool_create(cfg, "-N", flow)
> +    ntuple = defer(ethtool, f"-N {cfg.ifname} delete {nid}")
> +
> +    traf = GenerateTraffic(cfg, port=port)
> +    try:
> +        if check_geometry:
> +            _do_bpftrace(cfg, 1, rings['rx-buf-len'], tgt_queue=0)
> +    finally:
> +        traf.wait_pkts_and_stop(20000)
> +
> +    # Back to default
> +    reset.exec()
> +    queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
> +    for q in queues:
> +        ksft_not_in('rx-buf-len', q)
> +
> +
> +def per_queue_check_packets(cfg) -> None:
> +    per_queue(cfg, check_geometry=True)
> +
> +
> +def queue_check_ring_count(cfg, check_geometry=False) -> None:
> +    """
> +    Make sure the change of ring count is handled correctly.
> +    """
> +    rings, mul, reset = _per_queue_configure(cfg)
> +
> +    channels = cfg.ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
> +    if channels.get('combined-count', 0) < 4:
> +        raise KsftSkipEx('need at least 4 rings, have',
> +                         channels.get('combined-count'))
> +
> +    # Move the channel count up and down, should make no difference
> +    moves = [1, 0]
> +    if channels['combined-count'] == channels['combined-max']:
> +        moves = [-1, 0]
> +    for move in moves:
> +        target = channels['combined-count'] + move
> +        cfg.ethnl.channels_set({'header': {'dev-index': cfg.ifindex},
> +                                'combined-count': target})
> +
> +    _check_queues_with_config(cfg, rings['rx-buf-len'] * mul, {1})
> +
> +    # Check with traffic, we need to direct the traffic to the expected queue
> +    port1 = rand_port()
> +    flow1 = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port1} action 1"
> +    nid = _ethtool_create(cfg, "-N", flow1)
> +    ntuple = defer(ethtool, f"-N {cfg.ifname} delete {nid}")
> +
> +    traf = GenerateTraffic(cfg, port=port1)
> +    try:
> +        if check_geometry:
> +            _do_bpftrace(cfg, mul, rings['rx-buf-len'], tgt_queue=1)
> +    finally:
> +        traf.wait_pkts_and_stop(20000)
> +
> +    # And now direct to another queue
> +    port0 = rand_port()
> +    flow = f"flow-type tcp{cfg.addr_ipver} dst-ip {cfg.addr} dst-port {port0} action 0"
> +    nid = _ethtool_create(cfg, "-N", flow)
> +    defer(ethtool, f"-N {cfg.ifname} delete {nid}")
> +
> +    traf = GenerateTraffic(cfg, port=port0)
> +    try:
> +        if check_geometry:
> +            _do_bpftrace(cfg, 1, rings['rx-buf-len'], tgt_queue=0)
> +    finally:
> +        traf.wait_pkts_and_stop(20000)
> +
> +    # Go to a single queue, should reset
> +    ntuple.exec()
> +    cfg.ethnl.channels_set({'header': {'dev-index': cfg.ifindex},
> +                            'combined-count': 1})
> +    cfg.ethnl.channels_set({'header': {'dev-index': cfg.ifindex},
> +                            'combined-count': channels['combined-count']})
> +
> +    nid = _ethtool_create(cfg, "-N", flow1)
> +    defer(ethtool, f"-N {cfg.ifname} delete {nid}")
> +
> +    queues = cfg.netnl.queue_get({'ifindex': cfg.ifindex}, dump=True)
> +    for q in queues:
> +        ksft_not_in('rx-buf-len', q)
> +
> +    # Check with traffic that queue is now getting normal buffers
> +    traf = GenerateTraffic(cfg, port=port1)
> +    try:
> +        if check_geometry:
> +            _do_bpftrace(cfg, 1, rings['rx-buf-len'], tgt_queue=1)
> +    finally:
> +        traf.wait_pkts_and_stop(20000)
> +
> +
> +def queue_check_ring_count_check_packets(cfg):
> +    queue_check_ring_count(cfg, True)
> +
> +
> +def main() -> None:
> +    with NetDrvEpEnv(__file__) as cfg:
> +        cfg.netnl = NetdevFamily()
> +        cfg.ethnl = EthtoolFamily()
> +
> +        o = [nic_wide,
> +             per_queue,
> +             nic_wide_check_packets]

o?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size
  2025-04-22 15:52     ` Jakub Kicinski
@ 2025-04-22 17:27       ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2025-04-22 17:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On 04/22, Jakub Kicinski wrote:
> On Tue, 22 Apr 2025 08:32:29 -0700 Stanislav Fomichev wrote:
> > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > index b611a5ff6d3c..a86bb2ba5adb 100644
> > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > @@ -3802,6 +3802,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> > >  	pp.pool_size = bp->rx_agg_ring_size;
> > >  	if (BNXT_RX_PAGE_MODE(bp))
> > >  		pp.pool_size += bp->rx_ring_size;
> > > +	pp.order = get_order(bp->rx_page_size);  
> > 
> > Since it's gonna be configured by the users going forward, for the
> > pps that don't have mp, we might want to check pp.order against
> > MAX_PAGE_ORDER (and/or PAGE_ALLOC_COSTLY_ORDER?) during
> > page_pool_create? 
> 
> Hm, interesting question. Major concern being that users will shoot
> themselves in the foot? Or that syzbot will trigger a warning?

Yeah, both, some WARN_ON in the page allocator. I did trigger one for
MAX_PAGE_ORDER at some point, but not sure about PAGE_ALLOC_COSTLY_ORDER.

Just thinking from the overall setup point of view. I'm assuming that if we
are gonna support >PAGE_SIZE devmem chunks we'll have to tune rx-buf-len
_after_ we've bound dmabuf to the queue? (otherwise we are hitting those
PAGE_ALLOC_COSTLY_ORDER if we do it before) And there is no revert back
to reasonable rx-buf-len on unbind. Or for devmem we'll have some TBD way
to communicate "preferred" rx-buf-len per queue (derived from the dmabuf
binding itself) during the bind?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
  2025-04-22 16:19   ` David Wei
@ 2025-04-22 19:48   ` Joe Damato
  2025-04-23 20:08   ` Mina Almasry
  2 siblings, 0 replies; 66+ messages in thread
From: Joe Damato @ 2025-04-22 19:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	dtatulea, michael.chan

On Mon, Apr 21, 2025 at 03:28:06PM -0700, Jakub Kicinski wrote:
> Document the semantics of the rx_buf_len ethtool ring param.
> Clarify its meaning in case of HDS, where driver may have
> two separate buffer pools.

FWIW the docs added below don't explicitly mention HDS, but I
suppose that is implied from the multiple memory pools? Not sure if
it's worth elucidating that.
 
> The various zero-copy TCP Rx schemes we have suffer from memory
> management overhead. Specifically applications aren't too impressed
> with the number of 4kB buffers they have to juggle. Zero-copy
> TCP makes most sense with larger memory transfers so using
> 16kB or 32kB buffers (with the help of HW-GRO) feels more
> natural.

I think adding something about the above into the docs could be
helpful, but then again I am always in favor of more words.

See below for a minor spelling nit.

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  Documentation/networking/ethtool-netlink.rst | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index b6e9af4d0f1b..eaa9c17a3cb1 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
>  driver. Driver may impose additional constraints and may not support all
>  attributes.
>  
> -
>  ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
>  Completion queue events (CQE) are the events posted by NIC to indicate the
>  completion status of a packet when the packet is sent (like send success or
> @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
>  header / data split feature. If a received packet size is larger than this
>  threshold value, header and data will be split.
>  
> +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> +uses to receive packets. If the device uses different memory polls for headers
                                                         pools?  ^^

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 04/22] net: clarify the meaning of netdev_config members
  2025-04-21 22:28 ` [RFC net-next 04/22] net: clarify the meaning of netdev_config members Jakub Kicinski
@ 2025-04-22 19:57   ` Joe Damato
  0 siblings, 0 replies; 66+ messages in thread
From: Joe Damato @ 2025-04-22 19:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	dtatulea, michael.chan

On Mon, Apr 21, 2025 at 03:28:09PM -0700, Jakub Kicinski wrote:
> hds_thresh and hds_config are both inside struct netdev_config
> but have quite different semantics. hds_config is the user config
> with ternary semantics (on/off/unset). hds_thresh is a straight
> up value, populated by the driver at init and only modified by
> user space. We don't expect the drivers to have to pick a special
> hds_thresh value based on other configuration.
> 
> The two approaches have different advantages and downsides.
> hds_thresh ("direct value") gives core easy access to current
> device settings, but there's no way to express whether the value
> comes from the user. It also requires the initialization by
> the driver.
> 
> hds_config ("user config values") tells us what user wanted, but
> doesn't give us the current value in the core.
> 
> Try to explain this a bit in the comments, so at we make a conscious
> choice for new values which semantics we expect.
> 
> Move the init inside ethtool_ringparam_get_cfg() to reflect the semantics.
> Commit 216a61d33c07 ("net: ethtool: fix ethtool_ringparam_get_cfg()
> returns a hds_thresh value always as 0.") added the setting for the
> benefit of netdevsim which doesn't touch the value at all on get.
> Again, this is just to clarify the intention, shouldn't cause any
> functional change.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/netdev_queues.h | 19 +++++++++++++++++--
>  net/ethtool/common.c        |  3 ++-
>  2 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index ea709b59d827..f4eab6fc64f4 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -6,11 +6,26 @@
>  
>  /**
>   * struct netdev_config - queue-related configuration for a netdev
> - * @hds_thresh:		HDS Threshold value.
> - * @hds_config:		HDS value from userspace.
>   */
>  struct netdev_config {
> +	/* Direct value
> +	 *
> +	 * Driver default is expected to be fixed, and set in this struct
> +	 * at init. From that point on user may change the value. There is
> +	 * no explicit way to "unset" / restore driver default.
> +	 */
> +	/** @hds_thresh: HDS Threshold value (ETHTOOL_A_RINGS_HDS_THRESH).
> +	 */
>  	u32	hds_thresh;
> +
> +	/* User config values
> +	 *
> +	 * Contain user configuration. If "set" driver must obey.
> +	 * If "unset" driver is free to decide, and may change its choice
> +	 * as other parameters change.
> +	 */
> +	/** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
> +	 */
>  	u8	hds_config;
>  };

Not your fault and not exactly related to this patch, but in general
I found it a bit confusing that the two fields are different widths.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 13/22] net: add queue config validation callback
  2025-04-21 22:28 ` [RFC net-next 13/22] net: add queue config validation callback Jakub Kicinski
  2025-04-22 15:49   ` Stanislav Fomichev
@ 2025-04-22 20:16   ` Joe Damato
  1 sibling, 0 replies; 66+ messages in thread
From: Joe Damato @ 2025-04-22 20:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	dtatulea, michael.chan

On Mon, Apr 21, 2025 at 03:28:18PM -0700, Jakub Kicinski wrote:

> The "trick" of passing a negative index is a bit ugly, we may
> want to revisit if it causes confusion and bugs. Existing drivers
> don't care about the index so it "just works".

FWIW: I thought some drivers use the "-1 means all queues" trick for
things internally (I thought I saw at least one driver use it for
per-queue vs NIC-wide coalesce settings?) so maybe it isn't *that*
strange

> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -99,3 +99,23 @@ void netdev_queue_config(struct net_device *dev, int rxq,
>  	__netdev_queue_config(dev, rxq, qcfg, true);
>  }
>  EXPORT_SYMBOL(netdev_queue_config);
> +
> +int netdev_queue_config_revalidate(struct net_device *dev,
> +				   struct netlink_ext_ack *extack)
> +{
> +	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
> +	struct netdev_queue_config qcfg;
> +	int i, err;
> +
> +	if (!qops || !qops->ndo_queue_cfg_validate)
> +		return 0;
> +
> +	for (i = -1; i < (int)dev->real_num_rx_queues; i++) {

Not gonna lie: this looks weird. But maybe all that matters is that
people following this changeset are OK with it?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops
  2025-04-21 22:28 ` [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops Jakub Kicinski
  2025-04-22 15:50   ` Stanislav Fomichev
@ 2025-04-22 20:18   ` Joe Damato
  1 sibling, 0 replies; 66+ messages in thread
From: Joe Damato @ 2025-04-22 20:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	dtatulea, michael.chan

On Mon, Apr 21, 2025 at 03:28:19PM -0700, Jakub Kicinski wrote:
> Core provides a centralized callback for validating per-queue settings
> but the callback is part of the queue management ops. Having the ops
> conditionally set complicates the parts of the driver which could
> otherwise lean on the core to feel it the correct settings.
                       feed ?   ^^^^

Otherwise, I think I like Stanislav's suggestion on code
structure/flow.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-04-21 22:28 ` [RFC net-next 19/22] eth: bnxt: use queue op config validate Jakub Kicinski
@ 2025-04-23 10:00   ` Dragos Tatulea
  2025-04-23 13:46     ` Jakub Kicinski
  2025-06-12 11:56   ` Dragos Tatulea
  1 sibling, 1 reply; 66+ messages in thread
From: Dragos Tatulea @ 2025-04-23 10:00 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato,
	michael.chan

On Mon, Apr 21, 2025 at 03:28:24PM -0700, Jakub Kicinski wrote:
> Move the rx-buf-len config validation to the queue ops.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 40 +++++++++++++++++++
>  .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 ------
>  2 files changed, 40 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 43497b335329..a772ffaf3e5b 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -16052,8 +16052,46 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
>  	return 0;
>  }
>  
> +static int
> +bnxt_queue_cfg_validate(struct net_device *dev, int idx,
> +			struct netdev_queue_config *qcfg,
> +			struct netlink_ext_ack *extack)
> +{
> +	struct bnxt *bp = netdev_priv(dev);
> +
> +	/* Older chips need MSS calc so rx_buf_len is not supported,
> +	 * but we don't set queue ops for them so we should never get here.
> +	 */
> +	if (qcfg->rx_buf_len != bp->rx_page_size &&
> +	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
> +		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(qcfg->rx_buf_len)) {
> +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
> +		return -ERANGE;
> +	}
> +	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
> +	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
> +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
> +		return -ERANGE;
> +	}
> +	return 0;
> +}
> +
HDS off and rx_buf_len > 4K seems to be accepted. Is this inteded?

> +static void
> +bnxt_queue_cfg_defaults(struct net_device *dev, int idx,
> +			struct netdev_queue_config *qcfg)
> +{
> +	qcfg->rx_buf_len	= BNXT_RX_PAGE_SIZE;
> +}
> +
>  static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
>  	.ndo_queue_mem_size	= sizeof(struct bnxt_rx_ring_info),
> +
> +	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
> +	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
>  	.ndo_queue_mem_alloc	= bnxt_queue_mem_alloc,
>  	.ndo_queue_mem_free	= bnxt_queue_mem_free,
>  	.ndo_queue_start	= bnxt_queue_start,
> @@ -16061,6 +16099,8 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
>  };
>  
>  static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops_unsupp = {
> +	.ndo_queue_cfg_defaults	= bnxt_queue_cfg_defaults,
> +	.ndo_queue_cfg_validate = bnxt_queue_cfg_validate,
>  };
>  
>  static void bnxt_remove_one(struct pci_dev *pdev)
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index 956f51449709..8842390f687f 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -867,18 +867,6 @@ static int bnxt_set_ringparam(struct net_device *dev,
>  	if (!kernel_ering->rx_buf_len)	/* Zero means restore default */
>  		kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
>  
> -	if (kernel_ering->rx_buf_len != bp->rx_page_size &&
> -	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
> -		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
> -		return -EINVAL;
> -	}
> -	if (!is_power_of_2(kernel_ering->rx_buf_len) ||
> -	    kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
> -	    kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
> -		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");
> -		return -ERANGE;
> -	}
> -
>  	if (netif_running(dev))
>  		bnxt_close_nic(bp, false, false);
>  
> -- 
> 2.49.0
>

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-04-23 10:00   ` Dragos Tatulea
@ 2025-04-23 13:46     ` Jakub Kicinski
  2025-04-23 14:24       ` Dragos Tatulea
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-23 13:46 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Wed, 23 Apr 2025 10:00:01 +0000 Dragos Tatulea wrote:
> > +static int
> > +bnxt_queue_cfg_validate(struct net_device *dev, int idx,
> > +			struct netdev_queue_config *qcfg,
> > +			struct netlink_ext_ack *extack)
> > +{
> > +	struct bnxt *bp = netdev_priv(dev);
> > +
> > +	/* Older chips need MSS calc so rx_buf_len is not supported,
> > +	 * but we don't set queue ops for them so we should never get here.
> > +	 */
> > +	if (qcfg->rx_buf_len != bp->rx_page_size &&
> > +	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
> > +		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!is_power_of_2(qcfg->rx_buf_len)) {
> > +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
> > +		return -ERANGE;
> > +	}
> > +	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
> > +	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
> > +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
> > +		return -ERANGE;
> > +	}
> > +	return 0;
> > +}
> > +  
> HDS off and rx_buf_len > 4K seems to be accepted. Is this inteded?

For bnxt rx_buf_len only applies to the "payload buffers".
I should document that, and retest with XDP. 

I posted a doc recently with a "design guide" for API interfaces, 
it said:

  Visibility
  ==========

  To simplify the implementations configuration parameters of disabled features
  do not have to be hidden, or inaccessible.

Which I intended to mean that configuring something that isn't enabled
is okay. IIRC we also don't reject setting hds threshold if hds is off.

Hope I understood what you're getting at.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-04-23 13:46     ` Jakub Kicinski
@ 2025-04-23 14:24       ` Dragos Tatulea
  2025-04-23 15:33         ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Dragos Tatulea @ 2025-04-23 14:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Wed, Apr 23, 2025 at 06:46:53AM -0700, Jakub Kicinski wrote:
> On Wed, 23 Apr 2025 10:00:01 +0000 Dragos Tatulea wrote:
> > > +static int
> > > +bnxt_queue_cfg_validate(struct net_device *dev, int idx,
> > > +			struct netdev_queue_config *qcfg,
> > > +			struct netlink_ext_ack *extack)
> > > +{
> > > +	struct bnxt *bp = netdev_priv(dev);
> > > +
> > > +	/* Older chips need MSS calc so rx_buf_len is not supported,
> > > +	 * but we don't set queue ops for them so we should never get here.
> > > +	 */
> > > +	if (qcfg->rx_buf_len != bp->rx_page_size &&
> > > +	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
> > > +		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	if (!is_power_of_2(qcfg->rx_buf_len)) {
> > > +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
> > > +		return -ERANGE;
> > > +	}
> > > +	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
> > > +	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
> > > +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
> > > +		return -ERANGE;
> > > +	}
> > > +	return 0;
> > > +}
> > > +  
> > HDS off and rx_buf_len > 4K seems to be accepted. Is this inteded?
> 
> For bnxt rx_buf_len only applies to the "payload buffers".
> I should document that, and retest with XDP. 
> 
> I posted a doc recently with a "design guide" for API interfaces, 
> it said:
> 
>   Visibility
>   ==========
> 
>   To simplify the implementations configuration parameters of disabled features
>   do not have to be hidden, or inaccessible.
> 
> Which I intended to mean that configuring something that isn't enabled
> is okay. IIRC we also don't reject setting hds threshold if hds is off.
>
> Hope I understood what you're getting at.
My bad. My question was too terse and not generic enough. What I meant
to ask was:

With this new API, should drivers be allowed to use high order pages
from the page_pool regardless of HDS mode? From your reply I understand
that it is a yes.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-04-23 14:24       ` Dragos Tatulea
@ 2025-04-23 15:33         ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-23 15:33 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Wed, 23 Apr 2025 14:24:56 +0000 Dragos Tatulea wrote:
> With this new API, should drivers be allowed to use high order pages
> from the page_pool regardless of HDS mode? From your reply I understand
> that it is a yes.

Yes, I was trying to cover that in patch 1, I think? Up to the driver.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 00/22] net: per-queue rx-buf-len configuration
  2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
                   ` (21 preceding siblings ...)
  2025-04-21 22:28 ` [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len Jakub Kicinski
@ 2025-04-23 20:02 ` Mina Almasry
  2025-04-25 23:55   ` Jakub Kicinski
  22 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-04-23 20:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Add support for per-queue rx-buf-len configuration.
>
> I'm sending this as RFC because I'd like to ponder the uAPI side
> a little longer but it's good enough for people to work on
> the memory provider side and support in other drivers.
>

May be silly question, but I assume opting into this is optional for
queue API drivers? Or do you need GVE to implement dependencies of
this very soon otherwise it's blocking your work?

> The direct motivation for the series is that zero-copy Rx queues would
> like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
> and can coalesce payloads into pages much larger than than the MTU.
> Enabling larger buffers globally is a bit precarious as it exposes us
> to potentially very inefficient memory use. Also allocating large
> buffers may not be easy or cheap under load. Zero-copy queues service
> only select traffic and have pre-allocated memory so the concerns don't
> apply as much.
>
> The per-queue config has to address 3 problems:
> - user API
> - driver API
> - memory provider API
>
> For user API the main question is whether we expose the config via
> ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
> than extending the ethtool RINGS_GET API. I worry slightly that queue
> GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
> settings we have in ethtool which are not going via RINGS_SET is
> IRQ coalescing.
>
> My goal for the driver API was to avoid complexity in the drivers.
> The queue management API has gained two ops, responsible for preparing
> configuration for a given queue, and validating whether the config
> is supported. The validating is used both for NIC-wide and per-queue
> changes. Queue alloc/start ops have a new "config" argument which
> contains the current config for a given queue (we use queue restart
> to apply per-queue settings). Outside of queue reset paths drivers
> can call netdev_queue_config() which returns the config for an arbitrary
> queue. Long story short I anticipate it to be used during ndo_open.
>
> In the core I extended struct netdev_config with per queue settings.
> All in all this isn't too far from what was there in my "queue API
> prototype" a few years ago. One thing I was hoping to support but
> haven't gotten to is providing the settings at the RSS context level.
> Zero-copy users often depend on RSS for load spreading. It'd be more
> convenient for them to provide the settings per RSS context.
> We may be better off converting the QUEUE_SET netlink op to CONFIG_SET
> and accept multiple "scopes" (queue, rss context)?
>
> Memory provider API is a bit tricky. Initially I wasn't sure whether
> the buffer size should be a MP attribute or a device attribute.
> IOW whether it's the device that should be telling the MP what page
> size it wants, or the MP telling the device what page size it has.

I think it needs to be the former. Memory providers will have wildly
differing restrictions in regards to size. I think already the dmabuf
mp can allocate any byte size net_iov. I think the io_uring mp can
allocate any multiple of PAGE_SIZE net_iov. MPs communicating their
restrictions over a uniform interface with the driver seems difficult
to define. Better for the driver to ask the pp/mp what it wants, and
the mp can complain if it doesn't support it.

Also this mirrors what we do today with page_pool_params.order arg
IIRC. You probably want to piggy back off that or rework it.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
  2025-04-22 16:19   ` David Wei
  2025-04-22 19:48   ` Joe Damato
@ 2025-04-23 20:08   ` Mina Almasry
  2025-04-25 22:50     ` Jakub Kicinski
  2 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-04-23 20:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Document the semantics of the rx_buf_len ethtool ring param.
> Clarify its meaning in case of HDS, where driver may have
> two separate buffer pools.
>
> The various zero-copy TCP Rx schemes we have suffer from memory
> management overhead. Specifically applications aren't too impressed
> with the number of 4kB buffers they have to juggle. Zero-copy
> TCP makes most sense with larger memory transfers so using
> 16kB or 32kB buffers (with the help of HW-GRO) feels more
> natural.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  Documentation/networking/ethtool-netlink.rst | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index b6e9af4d0f1b..eaa9c17a3cb1 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -957,7 +957,6 @@ Kernel checks that requested ring sizes do not exceed limits reported by
>  driver. Driver may impose additional constraints and may not support all
>  attributes.
>
> -
>  ``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
>  Completion queue events (CQE) are the events posted by NIC to indicate the
>  completion status of a packet when the packet is sent (like send success or
> @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
>  header / data split feature. If a received packet size is larger than this
>  threshold value, header and data will be split.
>
> +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> +uses to receive packets. If the device uses different memory polls for headers
> +and payload this setting may control the size of the header buffers but must
> +control the size of the payload buffers.
> +

FWIW I don't like the ambiguity that the setting may or may not apply
to header buffers. AFAIU header buffers are supposed to be in the
order of tens/hundreds of bytes while the payload buffers are 1-2
orders of magnitude larger. Why would a driver even want this setting
to apply for both? I would prefer this setting to apply to only
payload buffers.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct
  2025-04-21 22:28 ` [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct Jakub Kicinski
@ 2025-04-23 20:35   ` Mina Almasry
  2025-04-25 22:51     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-04-23 20:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
> inside struct bnxt. This will allow configuring the page size
> at runtime in subsequent patches.
>
> The MSS size calculation for older chip continues to use the constant.
> I'm intending to support the configuration only on more recent HW,
> looks like on older chips setting this per queue won't work,
> and that's the ultimate goal.
>
> This patch should not change the current behavior as value
> read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 27 ++++++++++---------
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  4 +--
>  3 files changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index 868a2e5a5b02..158b8f96f50c 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -2358,6 +2358,7 @@ struct bnxt {
>         u16                     max_tpa;
>         u32                     rx_buf_size;
>         u32                     rx_buf_use_size;        /* useable size */
> +       u16                     rx_page_size;
I think you want a hunk that sets:

 rx_page_size = BNXT_RX_PAGE_SIZE;

In this patch? I could not find it, I don't know if I missed it. I
know in latery patches you're going to set this variable differently,
but for bisects and what not you may want to retain the current
behavior.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool
  2025-04-21 22:28 ` [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool Jakub Kicinski
@ 2025-04-23 21:00   ` Mina Almasry
  2025-04-25 22:58     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-04-23 21:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> bnxt seems to be able to aggregate data up to 32kB without any issue.
> The driver is already capable of doing this for systems with higher
> order pages. While for systems with 4k pages we historically preferred
> to stick to small buffers because they are easier to allocate, the
> zero-copy APIs remove the allocation problem. The ZC mem is
> pre-allocated and fixed size.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  3 ++-
>  .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 21 ++++++++++++++++++-
>  2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index 158b8f96f50c..1723909bde77 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -758,7 +758,8 @@ struct nqe_cn {
>  #define BNXT_RX_PAGE_SHIFT PAGE_SHIFT
>  #endif
>
> -#define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
> +#define BNXT_MAX_RX_PAGE_SIZE  (1 << 15)
> +#define BNXT_RX_PAGE_SIZE      (1 << BNXT_RX_PAGE_SHIFT)
>
>  #define BNXT_MAX_MTU           9500
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index 48dd5922e4dd..956f51449709 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -835,6 +835,8 @@ static void bnxt_get_ringparam(struct net_device *dev,
>         ering->rx_jumbo_pending = bp->rx_agg_ring_size;
>         ering->tx_pending = bp->tx_ring_size;
>
> +       kernel_ering->rx_buf_len_max = BNXT_MAX_RX_PAGE_SIZE;
> +       kernel_ering->rx_buf_len = bp->rx_page_size;
>         kernel_ering->hds_thresh_max = BNXT_HDS_THRESHOLD_MAX;
>  }
>
> @@ -862,6 +864,21 @@ static int bnxt_set_ringparam(struct net_device *dev,
>                 return -EINVAL;
>         }
>
> +       if (!kernel_ering->rx_buf_len)  /* Zero means restore default */
> +               kernel_ering->rx_buf_len = BNXT_RX_PAGE_SIZE;
> +
> +       if (kernel_ering->rx_buf_len != bp->rx_page_size &&
> +           !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
> +               NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
> +               return -EINVAL;
> +       }
> +       if (!is_power_of_2(kernel_ering->rx_buf_len) ||
> +           kernel_ering->rx_buf_len < BNXT_RX_PAGE_SIZE ||
> +           kernel_ering->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
> +               NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range, or not power of 2");

Where does the power of 2 limitation come from? bnxt itself? Or some mp issue?

dmabuf mp can do non-power of 2 rx_buf_len, I think. I haven't tested
recently. It may be good to only validate here what bnxt can't do at
all, and let a later check in the pp/mp let us know if the mp doesn't
like the size.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members
  2025-04-21 22:28 ` [RFC net-next 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Jakub Kicinski
@ 2025-04-23 21:04   ` Mina Almasry
  0 siblings, 0 replies; 66+ messages in thread
From: Mina Almasry @ 2025-04-23 21:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Trivial change, reduce the indent. I think the original is copied
> from real NDOs.

That's exactly what happened :D

> It's unnecessarily deep, makes passing struct args
> problematic.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API
  2025-04-21 22:28 ` [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API Jakub Kicinski
@ 2025-04-23 21:17   ` Mina Almasry
  2025-04-25 23:24     ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Mina Almasry @ 2025-04-23 21:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Create an array of config structs to store per-queue config.
> Pass these structs in the queue API. Drivers can also retrieve
> the config for a single queue calling netdev_queue_config()
> directly.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/netdev_queues.h                | 19 +++++++
>  net/core/dev.h                             |  3 ++
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  8 ++-
>  drivers/net/ethernet/google/gve/gve_main.c |  9 ++--
>  drivers/net/netdevsim/netdev.c             |  6 ++-
>  net/core/netdev_config.c                   | 58 ++++++++++++++++++++++
>  net/core/netdev_rx_queue.c                 | 11 ++--
>  7 files changed, 104 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index c50de8db72ce..1fb621a00962 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -31,6 +31,13 @@ struct netdev_config {
>         /** @hds_config: HDS enabled (ETHTOOL_A_RINGS_TCP_DATA_SPLIT).
>          */
>         u8      hds_config;
> +
> +       /** @qcfg: per-queue configuration */
> +       struct netdev_queue_config *qcfg;
> +};
> +
> +/* Same semantics as fields in struct netdev_config */
> +struct netdev_queue_config {
>  };
>
>  /* See the netdev.yaml spec for definition of each statistic */
> @@ -129,6 +136,10 @@ struct netdev_stat_ops {
>   *
>   * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
>   *
> + * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
> + *                     defaults. Queue config structs are passed to this
> + *                     helper before the user-requested settings are applied.
> + *
>   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
>   *                      The new memory is written at the specified address.
>   *
> @@ -146,12 +157,17 @@ struct netdev_stat_ops {
>   */
>  struct netdev_queue_mgmt_ops {
>         size_t  ndo_queue_mem_size;
> +       void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
> +                                         int idx,
> +                                         struct netdev_queue_config *qcfg);
>         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
> +                                      struct netdev_queue_config *qcfg,
>                                        void *per_queue_mem,
>                                        int idx);
>         void    (*ndo_queue_mem_free)(struct net_device *dev,
>                                       void *per_queue_mem);
>         int     (*ndo_queue_start)(struct net_device *dev,
> +                                  struct netdev_queue_config *qcfg,
>                                    void *per_queue_mem,
>                                    int idx);
>         int     (*ndo_queue_stop)(struct net_device *dev,

Doesn't the stop call need to return the config of the queue that was
stopped? Otherwise what do you pass to the start command if you want
to restart the old queue?

In general I thought what when we extend the queue API to handle
configs, the config of each queue would be part of the per_queue_mem
attribute, which is now a void *. Because seems to me the config needs
to be passed around with the per_queue_mem to all the functions?
Maybe.

I imagined you'd create a new wrapper struct that is the per queue
mem, and that one will contain a void * that is driver specific
memory, and a netdev_queue_config * inside of it as well, then the
queue API will use the new struct instead of void * for all the
per_queue_mem. At least that's what I was thinking.
> @@ -159,6 +175,9 @@ struct netdev_queue_mgmt_ops {
>                                   int idx);
>  };
>
> +void netdev_queue_config(struct net_device *dev, int rxq,
> +                        struct netdev_queue_config *qcfg);
> +
>  /**
>   * DOC: Lockless queue stopping / waking helpers.
>   *
> diff --git a/net/core/dev.h b/net/core/dev.h
> index c8971c6f1fcd..6d7f5e920018 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -9,6 +9,7 @@
>  #include <net/netdev_lock.h>
>
>  struct net;
> +struct netdev_queue_config;
>  struct netlink_ext_ack;
>  struct cpumask;
>
> @@ -96,6 +97,8 @@ int netdev_alloc_config(struct net_device *dev);
>  void __netdev_free_config(struct netdev_config *cfg);
>  void netdev_free_config(struct net_device *dev);
>  int netdev_reconfig_start(struct net_device *dev);
> +void __netdev_queue_config(struct net_device *dev, int rxq,
> +                          struct netdev_queue_config *qcfg, bool pending);
>
>  /* netdev management, shared between various uAPI entry points */
>  struct netdev_name_node {
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index a86bb2ba5adb..fbbe02cefdf1 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -15728,7 +15728,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
>         .get_base_stats         = bnxt_get_base_stats,
>  };
>
> -static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
> +static int bnxt_queue_mem_alloc(struct net_device *dev,
> +                               struct netdev_queue_config *qcfg,
> +                               void *qmem, int idx)
>  {
>         struct bnxt_rx_ring_info *rxr, *clone;
>         struct bnxt *bp = netdev_priv(dev);
> @@ -15896,7 +15898,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
>         dst->rx_agg_bmap = src->rx_agg_bmap;
>  }
>
> -static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
> +static int bnxt_queue_start(struct net_device *dev,
> +                           struct netdev_queue_config *qcfg,
> +                           void *qmem, int idx)
>  {
>         struct bnxt *bp = netdev_priv(dev);
>         struct bnxt_rx_ring_info *rxr, *clone;
> diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> index 8aaac9101377..088ae19ddb24 100644
> --- a/drivers/net/ethernet/google/gve/gve_main.c
> +++ b/drivers/net/ethernet/google/gve/gve_main.c
> @@ -2428,8 +2428,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
>                 gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
>  }
>
> -static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
> -                                 int idx)
> +static int gve_rx_queue_mem_alloc(struct net_device *dev,
> +                                 struct netdev_queue_config *qcfg,
> +                                 void *per_q_mem, int idx)
>  {
>         struct gve_priv *priv = netdev_priv(dev);
>         struct gve_rx_alloc_rings_cfg cfg = {0};
> @@ -2450,7 +2451,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
>         return err;
>  }
>
> -static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
> +static int gve_rx_queue_start(struct net_device *dev,
> +                             struct netdev_queue_config *qcfg,
> +                             void *per_q_mem, int idx)
>  {
>         struct gve_priv *priv = netdev_priv(dev);
>         struct gve_rx_ring *gve_per_q_mem;
> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> index 0e0321a7ddd7..a2aa85a85e0f 100644
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -661,7 +661,8 @@ struct nsim_queue_mem {
>  };
>
>  static int
> -nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
> +nsim_queue_mem_alloc(struct net_device *dev, struct netdev_queue_config *qcfg,
> +                    void *per_queue_mem, int idx)
>  {
>         struct nsim_queue_mem *qmem = per_queue_mem;
>         struct netdevsim *ns = netdev_priv(dev);
> @@ -710,7 +711,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
>  }
>
>  static int
> -nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
> +nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
> +                void *per_queue_mem, int idx)
>  {
>         struct nsim_queue_mem *qmem = per_queue_mem;
>         struct netdevsim *ns = netdev_priv(dev);
> diff --git a/net/core/netdev_config.c b/net/core/netdev_config.c
> index 270b7f10a192..bad2d53522f0 100644
> --- a/net/core/netdev_config.c
> +++ b/net/core/netdev_config.c
> @@ -8,18 +8,29 @@
>  int netdev_alloc_config(struct net_device *dev)
>  {
>         struct netdev_config *cfg;
> +       unsigned int maxqs;
>
>         cfg = kzalloc(sizeof(*dev->cfg), GFP_KERNEL_ACCOUNT);
>         if (!cfg)
>                 return -ENOMEM;
>
> +       maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
> +       cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);

I just thought of this, but for devices that can new rx/tx queues on
the fly, isn't num_rx_queues dynamically changing? How do you size
this qcfg array in this case?

Or do they go through a full driver reset everytime they add a queue
which reallocates dev->num_rx_queues?

...

> +
> +void __netdev_queue_config(struct net_device *dev, int rxq,
> +                          struct netdev_queue_config *qcfg, bool pending)
> +{
> +       memset(qcfg, 0, sizeof(*qcfg));
> +
> +       /* Get defaults from the driver, in case user config not set */
> +       if (dev->queue_mgmt_ops->ndo_queue_cfg_defaults)
> +               dev->queue_mgmt_ops->ndo_queue_cfg_defaults(dev, rxq, qcfg);
> +}
> +
> +/**
> + * netdev_queue_config() - get configuration for a given queue
> + * @dev:  net_device instance
> + * @rxq:  index of the queue of interest
> + * @qcfg: queue configuration struct (output)
> + *
> + * Render the configuration for a given queue. This helper should be used
> + * by drivers which support queue configuration to retrieve config for
> + * a particular queue.
> + *
> + * @qcfg is an output parameter and is always fully initialized by this
> + * function. Some values may not be set by the user, drivers may either
> + * deal with the "unset" values in @qcfg, or provide the callback
> + * to populate defaults in queue_management_ops.
> + *
> + * Note that this helper returns pending config, as it is expected that
> + * "old" queues are retained until config is successful so they can
> + * be restored directly without asking for the config.
> + */
> +void netdev_queue_config(struct net_device *dev, int rxq,
> +                        struct netdev_queue_config *qcfg)
> +{
> +       __netdev_queue_config(dev, rxq, qcfg, true);
> +}
> +EXPORT_SYMBOL(netdev_queue_config);
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index d126f10197bf..d8a710db21cd 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -7,12 +7,14 @@
>  #include <net/netdev_rx_queue.h>
>  #include <net/page_pool/memory_provider.h>
>
> +#include "dev.h"
>  #include "page_pool_priv.h"
>
>  int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>  {
>         struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
>         const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
> +       struct netdev_queue_config qcfg;
>         void *new_mem, *old_mem;
>         int err;
>
> @@ -32,7 +34,9 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>                 goto err_free_new_mem;
>         }
>
> -       err = qops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx);
> +       netdev_queue_config(dev, rxq_idx, &qcfg);
> +
> +       err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
>         if (err)
>                 goto err_free_old_mem;
>
> @@ -45,7 +49,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>                 if (err)
>                         goto err_free_new_queue_mem;
>
> -               err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
> +               err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
>                 if (err)
>                         goto err_start_queue;
>         } else {
> @@ -60,6 +64,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>         return 0;
>
>  err_start_queue:
> +       __netdev_queue_config(dev, rxq_idx, &qcfg, false);

Ah, on error, you reset the queue to its defaults, which I'm not sure
is desirable. I think we want to restart the queue with whatever
config it had before.

>         /* Restarting the queue with old_mem should be successful as we haven't
>          * changed any of the queue configuration, and there is not much we can
>          * do to recover from a failure here.
> @@ -67,7 +72,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>          * WARN if we fail to recover the old rx queue, and at least free
>          * old_mem so we don't also leak that.
>          */
> -       if (qops->ndo_queue_start(dev, old_mem, rxq_idx)) {
> +       if (qops->ndo_queue_start(dev, &qcfg, old_mem, rxq_idx)) {
>                 WARN(1,
>                      "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
>                      rxq_idx);
> --
> 2.49.0
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-04-23 20:08   ` Mina Almasry
@ 2025-04-25 22:50     ` Jakub Kicinski
  2025-04-25 23:20       ` Joe Damato
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 22:50 UTC (permalink / raw)
  To: Mina Almasry, Joe Damato
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, dtatulea,
	michael.chan

On Wed, 23 Apr 2025 13:08:33 -0700 Mina Almasry wrote:
> On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
> >  header / data split feature. If a received packet size is larger than this
> >  threshold value, header and data will be split.
> >
> > +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> > +uses to receive packets. If the device uses different memory polls for headers
> > +and payload this setting may control the size of the header buffers but must
> > +control the size of the payload buffers.
> 
> FWIW I don't like the ambiguity that the setting may or may not apply
> to header buffers. AFAIU header buffers are supposed to be in the
> order of tens/hundreds of bytes while the payload buffers are 1-2
> orders of magnitude larger. Why would a driver even want this setting
> to apply for both? I would prefer this setting to apply to only
> payload buffers.

Okay, I have no strong reason to leave the ambiguity.

Converging the thread with Joe:

>> Document the semantics of the rx_buf_len ethtool ring param.
>> Clarify its meaning in case of HDS, where driver may have
>> two separate buffer pools.  
>
> FWIW the docs added below don't explicitly mention HDS, but I
> suppose that is implied from the multiple memory pools? Not sure if
> it's worth elucidating that.

Maybe not sufficiently. Some NICs just have buffer pools for different
sized packets, but than the buffer size should be implied by the size
range?


How about:

 ``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffers driver
 uses to receive packets. If the device uses different buffer pools for
 headers and payload (due to HDS, HW-GRO etc.) this setting must
 control the size of the payload buffers.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct
  2025-04-23 20:35   ` Mina Almasry
@ 2025-04-25 22:51     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 22:51 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Wed, 23 Apr 2025 13:35:00 -0700 Mina Almasry wrote:
> On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > Switch from using a constant to storing the BNXT_RX_PAGE_SIZE
> > inside struct bnxt. This will allow configuring the page size
> > at runtime in subsequent patches.
> >
> > The MSS size calculation for older chip continues to use the constant.
> > I'm intending to support the configuration only on more recent HW,
> > looks like on older chips setting this per queue won't work,
> > and that's the ultimate goal.
> >
> > This patch should not change the current behavior as value
> > read from the struct will always be BNXT_RX_PAGE_SIZE at this stage.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > ---
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 27 ++++++++++---------
> >  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  4 +--
> >  3 files changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > index 868a2e5a5b02..158b8f96f50c 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > @@ -2358,6 +2358,7 @@ struct bnxt {
> >         u16                     max_tpa;
> >         u32                     rx_buf_size;
> >         u32                     rx_buf_use_size;        /* useable size */
> > +       u16                     rx_page_size;  
> I think you want a hunk that sets:
> 
>  rx_page_size = BNXT_RX_PAGE_SIZE;
> 
> In this patch? I could not find it, I don't know if I missed it. I
> know in latery patches you're going to set this variable differently,
> but for bisects and what not you may want to retain the current
> behavior.

Hm, it's here, last chunk for drivers/net/ethernet/broadcom/bnxt/bnxt.c:

@@ -16486,6 +16486,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bp = netdev_priv(dev);
 	bp->board_idx = ent->driver_data;
 	bp->msg_enable = BNXT_DEF_MSG_ENABLE;
+	bp->rx_page_size = BNXT_RX_PAGE_SIZE;
 	bnxt_set_max_func_irqs(bp, max_irqs);
 
 	if (bnxt_vf_pciid(bp->board_idx))

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool
  2025-04-23 21:00   ` Mina Almasry
@ 2025-04-25 22:58     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 22:58 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Wed, 23 Apr 2025 14:00:01 -0700 Mina Almasry wrote:
> Where does the power of 2 limitation come from? bnxt itself? Or some mp issue?
> 
> dmabuf mp can do non-power of 2 rx_buf_len, I think. I haven't tested
> recently. It may be good to only validate here what bnxt can't do at
> all, and let a later check in the pp/mp let us know if the mp doesn't
> like the size.

I haven't actually tested anything else, but no real reason at this
point. I was wondering if it's worth trying to allow 64k - 1 since
that'd still fit on 16 bits. But left that for future work, cause
it will make all copy offsets funny-sized. We'd probably want something
like 64k - 4k ? Dunno, either way slightly unpleasant. There's probably
more that can be done from the NIC side.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths
  2025-04-25 22:50     ` Jakub Kicinski
@ 2025-04-25 23:20       ` Joe Damato
  0 siblings, 0 replies; 66+ messages in thread
From: Joe Damato @ 2025-04-25 23:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Mina Almasry, davem, netdev, edumazet, pabeni, andrew+netdev,
	horms, donald.hunter, sdf, dw, asml.silence, ap420073, dtatulea,
	michael.chan

On Fri, Apr 25, 2025 at 03:50:34PM -0700, Jakub Kicinski wrote:
> On Wed, 23 Apr 2025 13:08:33 -0700 Mina Almasry wrote:
> > On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > @@ -971,6 +970,11 @@ completion queue size can be adjusted in the driver if CQE size is modified.
> > >  header / data split feature. If a received packet size is larger than this
> > >  threshold value, header and data will be split.
> > >
> > > +``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffer chunks driver
> > > +uses to receive packets. If the device uses different memory polls for headers
> > > +and payload this setting may control the size of the header buffers but must
> > > +control the size of the payload buffers.
> > 
> > FWIW I don't like the ambiguity that the setting may or may not apply
> > to header buffers. AFAIU header buffers are supposed to be in the
> > order of tens/hundreds of bytes while the payload buffers are 1-2
> > orders of magnitude larger. Why would a driver even want this setting
> > to apply for both? I would prefer this setting to apply to only
> > payload buffers.
> 
> Okay, I have no strong reason to leave the ambiguity.
> 
> Converging the thread with Joe:
> 
> >> Document the semantics of the rx_buf_len ethtool ring param.
> >> Clarify its meaning in case of HDS, where driver may have
> >> two separate buffer pools.  
> >
> > FWIW the docs added below don't explicitly mention HDS, but I
> > suppose that is implied from the multiple memory pools? Not sure if
> > it's worth elucidating that.
> 
> Maybe not sufficiently. Some NICs just have buffer pools for different
> sized packets, but than the buffer size should be implied by the size
> range?
> 
> 
> How about:
> 
>  ``ETHTOOL_A_RINGS_RX_BUF_LEN`` controls the size of the buffers driver
>  uses to receive packets. If the device uses different buffer pools for
>  headers and payload (due to HDS, HW-GRO etc.) this setting must
>  control the size of the payload buffers.

SGTM

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API
  2025-04-23 21:17   ` Mina Almasry
@ 2025-04-25 23:24     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 23:24 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Wed, 23 Apr 2025 14:17:30 -0700 Mina Almasry wrote:
> > @@ -129,6 +136,10 @@ struct netdev_stat_ops {
> >   *
> >   * @ndo_queue_mem_size: Size of the struct that describes a queue's memory.
> >   *
> > + * @ndo_queue_cfg_defaults: (Optional) Populate queue config struct with
> > + *                     defaults. Queue config structs are passed to this
> > + *                     helper before the user-requested settings are applied.
> > + *
> >   * @ndo_queue_mem_alloc: Allocate memory for an RX queue at the specified index.
> >   *                      The new memory is written at the specified address.
> >   *
> > @@ -146,12 +157,17 @@ struct netdev_stat_ops {
> >   */
> >  struct netdev_queue_mgmt_ops {
> >         size_t  ndo_queue_mem_size;
> > +       void    (*ndo_queue_cfg_defaults)(struct net_device *dev,
> > +                                         int idx,
> > +                                         struct netdev_queue_config *qcfg);
> >         int     (*ndo_queue_mem_alloc)(struct net_device *dev,
> > +                                      struct netdev_queue_config *qcfg,
> >                                        void *per_queue_mem,
> >                                        int idx);
> >         void    (*ndo_queue_mem_free)(struct net_device *dev,
> >                                       void *per_queue_mem);
> >         int     (*ndo_queue_start)(struct net_device *dev,
> > +                                  struct netdev_queue_config *qcfg,
> >                                    void *per_queue_mem,
> >                                    int idx);
> >         int     (*ndo_queue_stop)(struct net_device *dev,  
> 
> Doesn't the stop call need to return the config of the queue that was
> stopped? Otherwise what do you pass to the start command if you want
> to restart the old queue?

As you say below, I expect driver to retain the config within qmem.

> In general I thought what when we extend the queue API to handle
> configs, the config of each queue would be part of the per_queue_mem
> attribute, which is now a void *. Because seems to me the config needs
> to be passed around with the per_queue_mem to all the functions?
> Maybe.
> 
> I imagined you'd create a new wrapper struct that is the per queue
> mem, and that one will contain a void * that is driver specific
> memory, and a netdev_queue_config * inside of it as well, then the
> queue API will use the new struct instead of void * for all the
> per_queue_mem. At least that's what I was thinking.

That sounds nice from the API perspective, but I was worried about
perf impact. Some driver folks count each cycle and we may be mixing
cache cold and cache hot data in the config. At the very least not
all drivers will support all fields. So my gut feeling was that driver
authors will want to assign the fields to their own struct members,
anyway. Perhaps something we can revisit once we have more mileage?

> > +       maxqs = max(dev->num_rx_queues, dev->num_tx_queues);
> > +       cfg->qcfg = kcalloc(maxqs, sizeof(*cfg->qcfg), GFP_KERNEL_ACCOUNT);  
> 
> I just thought of this, but for devices that can new rx/tx queues on
> the fly, isn't num_rx_queues dynamically changing? How do you size
> this qcfg array in this case?
> 
> Or do they go through a full driver reset everytime they add a queue
> which reallocates dev->num_rx_queues?

Yes, good question. Easiest way to deal with that I thought of was 
to act like old memory school management. "normal" queues allocate
indexes 0 -> n, "dynamic" queues allocate n -> 0. Like stack vs heap
growth. And we need to patch up all this code that naively checks
queue ID vs num_real, we'll need a bitmap or a member in the structs.

>>> +       __netdev_queue_config(dev, rxq_idx, &qcfg, false);  
> 
> Ah, on error, you reset the queue to its defaults, which I'm not sure
> is desirable. I think we want to restart the queue with whatever
> config it had before.

Not defaults to be clear. The last argument here (false) tells
the config code to use the _current_ config not the pending one.
So we should be reverting to what's currently installed.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue
  2025-04-22 16:15   ` Stanislav Fomichev
@ 2025-04-25 23:41     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 23:41 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On Tue, 22 Apr 2025 09:15:59 -0700 Stanislav Fomichev wrote:
> > +    -
> > +      name: queue-set
> > +      doc: Set per-queue configurable options.
> > +      attribute-set: queue
> > +      do:
> > +        request:
> > +          attributes:
> > +            - ifindex
> > +            - type
> > +            - id
> > +            - rx-buf-len  
> 
> Do we want some guidance going forward on what belongs to queue-set
> vs napi-set? (mostly) HW settings for the queue-set and (mostly) SW
> settings for (serving the queues) in napi-set?

One NAPI can have multiple queues hanging off, so all HW settings
with the exception of IRQ coalescing should probably be on the queue?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 18/22] net: wipe the setting of deactived queues
  2025-04-22 16:21   ` Stanislav Fomichev
@ 2025-04-25 23:42     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 23:42 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On Tue, 22 Apr 2025 09:21:54 -0700 Stanislav Fomichev wrote:
> > +void netdev_queue_config_update_cnt(struct net_device *dev, unsigned int txq,
> > +				    unsigned int rxq)
> > +{  
> 
> Presumably txq argument is here for some potential future use cases?

Yes, wanted to make sure we handle it all here, instead of adding
another call.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len
  2025-04-22 17:06   ` Stanislav Fomichev
@ 2025-04-25 23:52     ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 23:52 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On Tue, 22 Apr 2025 10:06:48 -0700 Stanislav Fomichev wrote:
> > +def _do_bpftrace(cfg, mul, base_size, tgt_queue=None):
> > +    queue_filter = ''
> > +    if tgt_queue is not None:
> > +        queue_filter = 'if ($skb->queue_mapping != %d) {return;}' % (tgt_queue + 1, )  
> 
> if tgt_queue: should work as well?

Ha! I'm glad I'm not the only one who would fall into this trap.
tgt_queue = 0 is legit and actually used by the test.
So we'd collect the data for the whole system if test asked
for filtering just to queue 0. This breaks the test when queue 1
has large buffers, queue 0 small and we want to prove queue 0
is not getting large ones..

> > +
> > +    t = ('tracepoint:net:netif_receive_skb { ' +
> > +         '$skb = (struct sk_buff *)args->skbaddr; '+
> > +         '$sh = (struct skb_shared_info *)($skb->head + $skb->end); ' +
> > +         'if ($skb->dev->ifindex != ' + str(cfg.ifindex) + ') {return;} ' +
> > +         queue_filter +
> > +         '@[$skb->len - $skb->data_len] = count(); ' +
> > +         '@h[$skb->len - $skb->data_len] = count(); ' +
> > +         'if ($sh->nr_frags > 0) { @[$sh->frags[0].len] = count(); @d[$sh->frags[0].len] = count();} }'
> > +        )  
> 
> Why do we have @h and @d? We seem to check only the 'sizes'/@?

:o sorry leftover debug, thought i deleted it

> > +    maps = bpftrace(t, json=True, timeout=2)
> > +    # We expect one-dim array with something like:
> > +    # {"type": "map", "data": {"@": {"1500": 1, "719": 1,
> > +    sizes = maps["@"]
> > +    h = maps["@h"]
> > +    d = maps["@d"]
> > +    good = 0
> > +    bad = 0  

> > +def nic_wide(cfg, check_geometry=False) -> None:
> > +    """
> > +    Apply NIC wide rx-buf-len change. Run some traffic to make sure there
> > +    are no crashes. Test that setting 0 restores driver default.
> > +    Assume we start with the default.
> > +    """
> > +    try:
> > +        rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
> > +    except NlError as e:
> > +        rings = {}
> > +    if "rx-buf-len" not in rings:
> > +        raise KsftSkipEx('rx-buf-len configuration not supported by device')
> > +
> > +    if rings['rx-buf-len'] * 2 <= rings['rx-buf-len-max']:
> > +        mul = 2
> > +    else:
> > +        mul = 1/2  
> 
> And similarly here? (and elsewhere)
> 
> def pick_buf_size(rings):
> 	""" pick new rx-buf-len depending on current and max settings """
> 	buf_len = rings['rx-buf-len']
> 	if buf_len * 2 <= <= rings['rx-buf-len-max']:
>  	  # if can grow, try to grow
> 	  return buf_len, buf_len * 2
> 	else:
> 	  # otherwise shrink
> 	  return buf_len, buf_len / 2
> 
> old_buf_len, new_buf_len = pick_buf_size(ring)
> ...
> 
> (or maybe its just me, idk, easier to think in old>new comparisons vs
> doing mul*base_size math)

I'd still keep old_buf as base_buf, because for per-queue 
it's not really old as it's still active on remaining queues.
Otherwise SGTM.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 00/22] net: per-queue rx-buf-len configuration
  2025-04-23 20:02 ` [RFC net-next 00/22] net: per-queue rx-buf-len configuration Mina Almasry
@ 2025-04-25 23:55   ` Jakub Kicinski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-04-25 23:55 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, dw, asml.silence, ap420073, jdamato, dtatulea,
	michael.chan

On Wed, 23 Apr 2025 13:02:52 -0700 Mina Almasry wrote:
> On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > Add support for per-queue rx-buf-len configuration.
> >
> > I'm sending this as RFC because I'd like to ponder the uAPI side
> > a little longer but it's good enough for people to work on
> > the memory provider side and support in other drivers.
> >  
> 
> May be silly question, but I assume opting into this is optional for
> queue API drivers? Or do you need GVE to implement dependencies of
> this very soon otherwise it's blocking your work?

Completely optional, I think it has to be.

> I think it needs to be the former. Memory providers will have wildly
> differing restrictions in regards to size. I think already the dmabuf
> mp can allocate any byte size net_iov. I think the io_uring mp can
> allocate any multiple of PAGE_SIZE net_iov. MPs communicating their
> restrictions over a uniform interface with the driver seems difficult
> to define. Better for the driver to ask the pp/mp what it wants, and
> the mp can complain if it doesn't support it.
> 
> Also this mirrors what we do today with page_pool_params.order arg
> IIRC. You probably want to piggy back off that or rework it.

Yup. I added the rx-buf-len-max size reporting because I suspect
applications will need to discover the NIC capabilities.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-04-21 22:28 ` [RFC net-next 19/22] eth: bnxt: use queue op config validate Jakub Kicinski
  2025-04-23 10:00   ` Dragos Tatulea
@ 2025-06-12 11:56   ` Dragos Tatulea
  2025-06-12 14:10     ` Jakub Kicinski
  1 sibling, 1 reply; 66+ messages in thread
From: Dragos Tatulea @ 2025-06-12 11:56 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, donald.hunter,
	sdf, almasrymina, dw, asml.silence, ap420073, jdamato,
	michael.chan

On Mon, Apr 21, 2025 at 03:28:24PM -0700, Jakub Kicinski wrote:
> Move the rx-buf-len config validation to the queue ops.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 40 +++++++++++++++++++
>  .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 ------
>  2 files changed, 40 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 43497b335329..a772ffaf3e5b 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -16052,8 +16052,46 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
>  	return 0;
>  }
>  
> +static int
> +bnxt_queue_cfg_validate(struct net_device *dev, int idx,
> +			struct netdev_queue_config *qcfg,
> +			struct netlink_ext_ack *extack)
> +{
> +	struct bnxt *bp = netdev_priv(dev);
> +
> +	/* Older chips need MSS calc so rx_buf_len is not supported,
> +	 * but we don't set queue ops for them so we should never get here.
> +	 */
> +	if (qcfg->rx_buf_len != bp->rx_page_size &&
> +	    !(bp->flags & BNXT_FLAG_CHIP_P5_PLUS)) {
> +		NL_SET_ERR_MSG_MOD(extack, "changing rx-buf-len not supported");
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(qcfg->rx_buf_len)) {
> +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len is not power of 2");
> +		return -ERANGE;
> +	}
> +	if (qcfg->rx_buf_len < BNXT_RX_PAGE_SIZE ||
> +	    qcfg->rx_buf_len > BNXT_MAX_RX_PAGE_SIZE) {
> +		NL_SET_ERR_MSG_MOD(extack, "rx-buf-len out of range");
> +		return -ERANGE;
> +	}
> +	return 0;
> +}
> +

For the hypothetical situation when the user configures a larger buffer
than the ring size * MTU. Should the check happen in validate or should
the max buffer size be dynamic depending on ring size and MTU?

It is currently hypothetical as BNXT_MAX_RX_PAGE_SIZE is 32K. But for
reference it would be good to know/document.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-06-12 11:56   ` Dragos Tatulea
@ 2025-06-12 14:10     ` Jakub Kicinski
  2025-06-12 15:52       ` Dragos Tatulea
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-06-12 14:10 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Thu, 12 Jun 2025 11:56:26 +0000 Dragos Tatulea wrote:
> For the hypothetical situation when the user configures a larger buffer
> than the ring size * MTU. Should the check happen in validate or should
> the max buffer size be dynamic depending on ring size and MTU?

Hm, why does the ring size come into the calculation?

I don't think it's a practical configuration, so it should be perfectly
fine for the driver to reject it. But in principle if user wants to
configure a 128 entry ring with 1MB buffers.. I guess they must have 
a lot of DRAM to waste, but other than that I don't see a reason to
stop them within the core?

Documenting sounds good, just wanna make sure I understand the potential
ambiguity.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-06-12 14:10     ` Jakub Kicinski
@ 2025-06-12 15:52       ` Dragos Tatulea
  2025-06-12 22:30         ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Dragos Tatulea @ 2025-06-12 15:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Thu, Jun 12, 2025 at 07:10:28AM -0700, Jakub Kicinski wrote:
> On Thu, 12 Jun 2025 11:56:26 +0000 Dragos Tatulea wrote:
> > For the hypothetical situation when the user configures a larger buffer
> > than the ring size * MTU. Should the check happen in validate or should
> > the max buffer size be dynamic depending on ring size and MTU?
> 
> Hm, why does the ring size come into the calculation?
>
There is a relationship between ring size, MTU and how much memory a queue
would need for a full ring, right? Even if relationship is driver dependent.

> I don't think it's a practical configuration, so it should be perfectly
> fine for the driver to reject it. But in principle if user wants to
> configure a 128 entry ring with 1MB buffers.. I guess they must have 
> a lot of DRAM to waste, but other than that I don't see a reason to
> stop them within the core?
>
Ok, so config can be rejected. How about the driver changing the allowed
max based on the current ring size and MTU? This would allow larger
buffers on larger rings and MTUs.

There is another interesting case where the user specifies some large
buffer size which amounts to roughly the max memory for the current ring
and MTU configuration. We'd end up with a page_pool with a size of one
which is not very useful...

> Documenting sounds good, just wanna make sure I understand the potential
> ambiguity.
Is it clearer now? I was just thinking about how to add support for this
in mlx5 and stumbled into this grey area.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-06-12 15:52       ` Dragos Tatulea
@ 2025-06-12 22:30         ` Jakub Kicinski
  2025-06-13 19:02           ` Dragos Tatulea
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-06-12 22:30 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Thu, 12 Jun 2025 15:52:12 +0000 Dragos Tatulea wrote:
> On Thu, Jun 12, 2025 at 07:10:28AM -0700, Jakub Kicinski wrote:
> > On Thu, 12 Jun 2025 11:56:26 +0000 Dragos Tatulea wrote:  
> > > For the hypothetical situation when the user configures a larger buffer
> > > than the ring size * MTU. Should the check happen in validate or should
> > > the max buffer size be dynamic depending on ring size and MTU?  
> > 
> > Hm, why does the ring size come into the calculation?
> 
> There is a relationship between ring size, MTU and how much memory a queue
> would need for a full ring, right? Even if relationship is driver dependent.

I see, yes, I think I did something along those lines in patch 16 here.
But the range of values for bnxt is pretty limited so a lot fewer
corner cases to deal with.

Not sure about the calculation depending on MTU, tho. We're talking
about HW-GRO enabled traffic, they should be tightly packed into the
buffer, right? So MTU of chunks really doesn't matter from the buffer
sizing perspective. If they are not packet using larger buffers is
pointless.

> > I don't think it's a practical configuration, so it should be perfectly
> > fine for the driver to reject it. But in principle if user wants to
> > configure a 128 entry ring with 1MB buffers.. I guess they must have 
> > a lot of DRAM to waste, but other than that I don't see a reason to
> > stop them within the core?
> >  
> Ok, so config can be rejected. How about the driver changing the allowed
> max based on the current ring size and MTU? This would allow larger
> buffers on larger rings and MTUs.

Yes.

> There is another interesting case where the user specifies some large
> buffer size which amounts to roughly the max memory for the current ring
> and MTU configuration. We'd end up with a page_pool with a size of one
> which is not very useful...

Right, we can probably save ourselves from the corner cases by capping
the allowed configuration at the max TSO size so 512kB? Does that help?

> > Documenting sounds good, just wanna make sure I understand the potential
> > ambiguity.  
> Is it clearer now? I was just thinking about how to add support for this
> in mlx5 and stumbled into this grey area.

Yes, thanks!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-06-12 22:30         ` Jakub Kicinski
@ 2025-06-13 19:02           ` Dragos Tatulea
  2025-06-13 23:16             ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Dragos Tatulea @ 2025-06-13 19:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Thu, Jun 12, 2025 at 03:30:37PM -0700, Jakub Kicinski wrote:
> On Thu, 12 Jun 2025 15:52:12 +0000 Dragos Tatulea wrote:
> > On Thu, Jun 12, 2025 at 07:10:28AM -0700, Jakub Kicinski wrote:
> > > On Thu, 12 Jun 2025 11:56:26 +0000 Dragos Tatulea wrote:  
> > > > For the hypothetical situation when the user configures a larger buffer
> > > > than the ring size * MTU. Should the check happen in validate or should
> > > > the max buffer size be dynamic depending on ring size and MTU?  
> > > 
> > > Hm, why does the ring size come into the calculation?
> > 
> > There is a relationship between ring size, MTU and how much memory a queue
> > would need for a full ring, right? Even if relationship is driver dependent.
> 
> I see, yes, I think I did something along those lines in patch 16 here.
> But the range of values for bnxt is pretty limited so a lot fewer
> corner cases to deal with.
>
Indeed.

> Not sure about the calculation depending on MTU, tho. We're talking
> about HW-GRO enabled traffic, they should be tightly packed into the
> buffer, right? So MTU of chunks really doesn't matter from the buffer
> sizing perspective. If they are not packet using larger buffers is
> pointless.
>
But it matters from the perspective of total memory allocatable by the
queue (aka page pool size), right? A 1K ring size with 1500 MTU would
need less total memory than for a 1K queue x 9000 MTU to cover the full
queue.

Side note: We already have the disconnect between how much the driver
*thinks* it needs (based on ring size, MTU and other stuff) and how much
memory is given by a memory provider from the application side.

> > > I don't think it's a practical configuration, so it should be perfectly
> > > fine for the driver to reject it. But in principle if user wants to
> > > configure a 128 entry ring with 1MB buffers.. I guess they must have 
> > > a lot of DRAM to waste, but other than that I don't see a reason to
> > > stop them within the core?
> > >  
> > Ok, so config can be rejected. How about the driver changing the allowed
> > max based on the current ring size and MTU? This would allow larger
> > buffers on larger rings and MTUs.
> 
> Yes.
> 
> > There is another interesting case where the user specifies some large
> > buffer size which amounts to roughly the max memory for the current ring
> > and MTU configuration. We'd end up with a page_pool with a size of one
> > which is not very useful...
> 
> Right, we can probably save ourselves from the corner cases by capping
> the allowed configuration at the max TSO size so 512kB? Does that help?
> 
Yes. Sounds like a good start. We can always investigate the benefits of
bigger buffers later.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-06-13 19:02           ` Dragos Tatulea
@ 2025-06-13 23:16             ` Jakub Kicinski
  2025-06-17 12:36               ` Dragos Tatulea
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-06-13 23:16 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Fri, 13 Jun 2025 19:02:53 +0000 Dragos Tatulea wrote:
> > > There is a relationship between ring size, MTU and how much memory a queue
> > > would need for a full ring, right? Even if relationship is driver dependent.  
> > 
> > I see, yes, I think I did something along those lines in patch 16 here.
> > But the range of values for bnxt is pretty limited so a lot fewer
> > corner cases to deal with.
> 
> Indeed.
> 
> > Not sure about the calculation depending on MTU, tho. We're talking
> > about HW-GRO enabled traffic, they should be tightly packed into the
> > buffer, right? So MTU of chunks really doesn't matter from the buffer
> > sizing perspective. If they are not packet using larger buffers is
> > pointless.
> >  
> But it matters from the perspective of total memory allocatable by the
> queue (aka page pool size), right? A 1K ring size with 1500 MTU would
> need less total memory than for a 1K queue x 9000 MTU to cover the full
> queue.

True but that's only relevant to the "normal" buffers?
IIUC for bnxt and fbnic the ring size for rx-jumbo-pending
(which is where payloads go) is always in 4k buffer units.
Whether the MTU is 1k or 9k we'd GRO the packets together
into the 4k buffers. So I don't see why the MTU matters 
for the amount of memory held on the aggregation ring.

> Side note: We already have the disconnect between how much the driver
> *thinks* it needs (based on ring size, MTU and other stuff) and how much
> memory is given by a memory provider from the application side.

True, tho, I think ideally the drivers would accept starting
with a ring that's not completely filled. I think that's better
user experience.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 19/22] eth: bnxt: use queue op config validate
  2025-06-13 23:16             ` Jakub Kicinski
@ 2025-06-17 12:36               ` Dragos Tatulea
  0 siblings, 0 replies; 66+ messages in thread
From: Dragos Tatulea @ 2025-06-17 12:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, michael.chan

On Fri, Jun 13, 2025 at 04:16:36PM -0700, Jakub Kicinski wrote:
> On Fri, 13 Jun 2025 19:02:53 +0000 Dragos Tatulea wrote:
> > > > There is a relationship between ring size, MTU and how much memory a queue
> > > > would need for a full ring, right? Even if relationship is driver dependent.  
> > > 
> > > I see, yes, I think I did something along those lines in patch 16 here.
> > > But the range of values for bnxt is pretty limited so a lot fewer
> > > corner cases to deal with.
> > 
> > Indeed.
> > 
> > > Not sure about the calculation depending on MTU, tho. We're talking
> > > about HW-GRO enabled traffic, they should be tightly packed into the
> > > buffer, right? So MTU of chunks really doesn't matter from the buffer
> > > sizing perspective. If they are not packet using larger buffers is
> > > pointless.
> > >  
> > But it matters from the perspective of total memory allocatable by the
> > queue (aka page pool size), right? A 1K ring size with 1500 MTU would
> > need less total memory than for a 1K queue x 9000 MTU to cover the full
> > queue.
> 
> True but that's only relevant to the "normal" buffers?
> IIUC for bnxt and fbnic the ring size for rx-jumbo-pending
> (which is where payloads go) is always in 4k buffer units.
> Whether the MTU is 1k or 9k we'd GRO the packets together
> into the 4k buffers. So I don't see why the MTU matters 
> for the amount of memory held on the aggregation ring.
>
I see what you mean. mlx5 sizes the memory requirements according to
ring size and MTU. That's where my misunderstanding came from.

> > Side note: We already have the disconnect between how much the driver
> > *thinks* it needs (based on ring size, MTU and other stuff) and how much
> > memory is given by a memory provider from the application side.
> 
> True, tho, I think ideally the drivers would accept starting
> with a ring that's not completely filled. I think that's better
> user experience.
Agreed.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace
  2025-04-21 22:28 ` [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace Jakub Kicinski
  2025-04-22 16:36   ` Stanislav Fomichev
@ 2025-06-25 12:23   ` Breno Leitao
  1 sibling, 0 replies; 66+ messages in thread
From: Breno Leitao @ 2025-06-25 12:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	donald.hunter, sdf, almasrymina, dw, asml.silence, ap420073,
	jdamato, dtatulea, michael.chan

On Mon, Apr 21, 2025 at 03:28:26PM -0700, Jakub Kicinski wrote:
> bpftrace is very useful for low level driver testing. perf or trace-cmd
> would also do for collecting data from tracepoints, but they require
> much more post-processing.
> 
> Add a wrapper for running bpftrace and sanitizing its output.
> bpftrace has JSON output, which is great, but it prints loose objects
> and in a slightly inconvenient format. We have to read the objects
> line by line, and while at it return them indexed by the map name.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

FYI: I've stolen this patch and send as part of my netpoll test patchset:

https://lore.kernel.org/all/20250625-netpoll_test-v2-1-47d27775222c@debian.org/

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2025-06-25 12:23 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21 22:28 [RFC net-next 00/22] net: per-queue rx-buf-len configuration Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Jakub Kicinski
2025-04-22 16:19   ` David Wei
2025-04-22 19:48   ` Joe Damato
2025-04-23 20:08   ` Mina Almasry
2025-04-25 22:50     ` Jakub Kicinski
2025-04-25 23:20       ` Joe Damato
2025-04-21 22:28 ` [RFC net-next 02/22] net: ethtool: report max value for rx-buf-len Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 03/22] net: use zero value to restore rx_buf_len to default Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 04/22] net: clarify the meaning of netdev_config members Jakub Kicinski
2025-04-22 19:57   ` Joe Damato
2025-04-21 22:28 ` [RFC net-next 05/22] net: add rx_buf_len to netdev config Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 06/22] eth: bnxt: read the page size from the adapter struct Jakub Kicinski
2025-04-23 20:35   ` Mina Almasry
2025-04-25 22:51     ` Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 07/22] eth: bnxt: set page pool page order based on rx_page_size Jakub Kicinski
2025-04-22 15:32   ` Stanislav Fomichev
2025-04-22 15:52     ` Jakub Kicinski
2025-04-22 17:27       ` Stanislav Fomichev
2025-04-21 22:28 ` [RFC net-next 08/22] eth: bnxt: support setting size of agg buffers via ethtool Jakub Kicinski
2025-04-23 21:00   ` Mina Almasry
2025-04-25 22:58     ` Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 09/22] net: move netdev_config manipulation to dedicated helpers Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Jakub Kicinski
2025-04-23 21:04   ` Mina Almasry
2025-04-21 22:28 ` [RFC net-next 11/22] net: allocate per-queue config structs and pass them thru the queue API Jakub Kicinski
2025-04-23 21:17   ` Mina Almasry
2025-04-25 23:24     ` Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 12/22] net: pass extack to netdev_rx_queue_restart() Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 13/22] net: add queue config validation callback Jakub Kicinski
2025-04-22 15:49   ` Stanislav Fomichev
2025-04-22 20:16   ` Joe Damato
2025-04-21 22:28 ` [RFC net-next 14/22] eth: bnxt: always set the queue mgmt ops Jakub Kicinski
2025-04-22 15:50   ` Stanislav Fomichev
2025-04-22 20:18   ` Joe Damato
2025-04-21 22:28 ` [RFC net-next 15/22] eth: bnxt: store the rx buf size per queue Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Jakub Kicinski
2025-04-22 16:13   ` Stanislav Fomichev
2025-04-21 22:28 ` [RFC net-next 17/22] netdev: add support for setting rx-buf-len per queue Jakub Kicinski
2025-04-22 16:15   ` Stanislav Fomichev
2025-04-25 23:41     ` Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 18/22] net: wipe the setting of deactived queues Jakub Kicinski
2025-04-22 16:21   ` Stanislav Fomichev
2025-04-25 23:42     ` Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 19/22] eth: bnxt: use queue op config validate Jakub Kicinski
2025-04-23 10:00   ` Dragos Tatulea
2025-04-23 13:46     ` Jakub Kicinski
2025-04-23 14:24       ` Dragos Tatulea
2025-04-23 15:33         ` Jakub Kicinski
2025-06-12 11:56   ` Dragos Tatulea
2025-06-12 14:10     ` Jakub Kicinski
2025-06-12 15:52       ` Dragos Tatulea
2025-06-12 22:30         ` Jakub Kicinski
2025-06-13 19:02           ` Dragos Tatulea
2025-06-13 23:16             ` Jakub Kicinski
2025-06-17 12:36               ` Dragos Tatulea
2025-04-21 22:28 ` [RFC net-next 20/22] eth: bnxt: support per queue configuration of rx-buf-len Jakub Kicinski
2025-04-21 22:28 ` [RFC net-next 21/22] selftests: drv-net: add helper/wrapper for bpftrace Jakub Kicinski
2025-04-22 16:36   ` Stanislav Fomichev
2025-04-22 16:39     ` Stanislav Fomichev
2025-06-25 12:23   ` Breno Leitao
2025-04-21 22:28 ` [RFC net-next 22/22] selftests: drv-net: add test for rx-buf-len Jakub Kicinski
2025-04-22 17:06   ` Stanislav Fomichev
2025-04-25 23:52     ` Jakub Kicinski
2025-04-23 20:02 ` [RFC net-next 00/22] net: per-queue rx-buf-len configuration Mina Almasry
2025-04-25 23:55   ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).