Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v2] mac802154: llsec: add skb_cow_data() before in-place crypto
From: Stefan Schmidt @ 2026-06-19 20:47 UTC (permalink / raw)
  To: alex.aring, miquel.raynal, Doruk Tan Ozturk
  Cc: Stefan Schmidt, aleksander.lobakin, linux-wpan, netdev, security,
	stable
In-Reply-To: <20260526183726.56100-1-doruk@0sec.ai>

Hello Doruk Tan Ozturk.

On Tue, 26 May 2026 20:37:26 +0200, Doruk Tan Ozturk wrote:
> llsec_do_encrypt_unauth(), llsec_do_encrypt_auth(),
> llsec_do_decrypt_unauth(), and llsec_do_decrypt_auth() all perform
> in-place cryptographic transformations on skb data.  They build a
> scatterlist with sg_init_one() pointing into the skb's linear data area
> and then pass the same scatterlist as both src and dst to the crypto API
> (e.g. crypto_skcipher_encrypt/decrypt, crypto_aead_encrypt/decrypt).
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/1] mac802154: llsec: add skb_cow_data() before in-place crypto
      https://git.kernel.org/wpan/wpan-next/c/84a04eb5b210

regards,
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH net v3 0/3] Avoid calling WARN_ON() on allocation failure in cfg802154_switch_netns()
From: Stefan Schmidt @ 2026-06-19 20:29 UTC (permalink / raw)
  To: Alexander Aring, Ivan Abramov
  Cc: Stefan Schmidt, Miquel Raynal, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, linux-wpan, netdev,
	linux-kernel, lvc-project
In-Reply-To: <20250403101935.991385-1-i.abramov@mt-integration.ru>

Hello Ivan Abramov.

On Thu, 03 Apr 2025 13:19:31 +0300, Ivan Abramov wrote:
> This series was inspired by Syzkaller report on warning in
> cfg802154_switch_netns().
> 
> WARNING: CPU: 0 PID: 5837 at net/ieee802154/core.c:258 cfg802154_switch_netns+0x3c7/0x3d0 net/ieee802154/core.c:258
> Modules linked in:
> CPU: 0 UID: 0 PID: 5837 Comm: syz-executor125 Not tainted 6.13.0-rc6-syzkaller-00918-g7b24f164cf00 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> RIP: 0010:cfg802154_switch_netns+0x3c7/0x3d0 net/ieee802154/core.c:258
> Call Trace:
>  <TASK>
>  nl802154_wpan_phy_netns+0x13d/0x210 net/ieee802154/nl802154.c:1292
>  genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
>  genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
>  genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210
>  netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
>  genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
>  netlink_unicast_kernel net/netlink/af_netlink.c:1322 [inline]
>  netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1348
>  netlink_sendmsg+0x8e4/0xcb0 net/netlink/af_netlink.c:1892
>  sock_sendmsg_nosec net/socket.c:711 [inline]
>  __sock_sendmsg+0x221/0x270 net/socket.c:726
>  ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2594
>  ___sys_sendmsg net/socket.c:2648 [inline]
>  __sys_sendmsg+0x269/0x350 net/socket.c:2680
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/3] ieee802154: Restore initial state on failed device_rename() in cfg802154_switch_netns()
      https://git.kernel.org/wpan/wpan-next/c/a2e06b4bef20
[2/3] ieee802154: Avoid calling WARN_ON() on -ENOMEM in cfg802154_switch_netns()
      https://git.kernel.org/wpan/wpan-next/c/0569f67ed6a7
[3/3] ieee802154: Remove WARN_ON() in cfg802154_pernet_exit()
      https://git.kernel.org/wpan/wpan-next/c/e69ed6fc9fb3

regards,
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH] ieee802154: ca8210: fix cas_ctl leak on spi_async failure
From: Stefan Schmidt @ 2026-06-19 20:29 UTC (permalink / raw)
  To: alex.aring, miquel.raynal, Shitalkumar Gandhi
  Cc: Stefan Schmidt, andrew+netdev, davem, edumazet, kuba, pabeni,
	linux-wpan, netdev, linux-kernel, stable, Shitalkumar Gandhi
In-Reply-To: <20260421073259.2259783-1-shitalkumar.gandhi@cambiumnetworks.com>

Hello Shitalkumar Gandhi.

On Tue, 21 Apr 2026 13:02:59 +0530, Shitalkumar Gandhi wrote:
> ca8210_spi_transfer() allocates cas_ctl with kzalloc_obj(GFP_ATOMIC)
> and relies entirely on the SPI completion callback
> ca8210_spi_transfer_complete() to free it.
> 
> The spi_async() API only invokes the completion callback on successful
> submission.  On failure it returns a negative error code without ever
> queuing the callback, which leaves cas_ctl and its embedded spi_message
> and spi_transfer orphaned.  Every kfree(cas_ctl) in the driver is
> inside the completion callback, so there is no other reclamation path.
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/1] ieee802154: ca8210: fix cas_ctl leak on spi_async failure
      https://git.kernel.org/wpan/wpan-next/c/e09390e439bd

regards,
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH wpan v3] ieee802154: ca8210: fix pointer truncation in kfifo on 64-bit
From: Stefan Schmidt @ 2026-06-19 20:37 UTC (permalink / raw)
  To: Miquel Raynal, Alexander Aring, Shitalkumar Gandhi
  Cc: Stefan Schmidt, Simon Horman, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-wpan, netdev,
	linux-kernel, stable, Shitalkumar Gandhi
In-Reply-To: <20260520105750.30144-1-shitalkumar.gandhi@cambiumnetworks.com>

Hello Shitalkumar Gandhi.

On Wed, 20 May 2026 16:27:50 +0530, Shitalkumar Gandhi wrote:
> ca8210_test_int_driver_write() and ca8210_test_int_user_read() exchange
> a kmalloc'd buffer pointer through a struct kfifo, but pass a literal
> '4' as the byte count to kfifo_in()/kfifo_out().
> 
> This is correct on 32-bit (pointer = 4 bytes), but on 64-bit only the
> low 4 bytes of the 8-byte pointer are written into the FIFO. The reader
> then reads back 4 bytes into an 8-byte local pointer variable, leaving
> the upper 4 bytes uninitialized stack data. The first dereference of
> the reconstructed pointer (fifo_buffer[1]) accesses an arbitrary kernel
> address and generally results in an oops.
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/1] ieee802154: ca8210: fix pointer truncation in kfifo on 64-bit
      https://git.kernel.org/wpan/wpan-next/c/6d7f7bcf225b

regards,
Stefan Schmidt

^ permalink raw reply

* [PATCH rdma-next v8] RDMA: Change capability fields in ib_device_attr from int to u32
From: Erni Sri Satya Vennela @ 2026-06-19 20:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, mkalderon, zyjzyj2000, sagi,
	mgurtovoy, haris.iqbal, jinpu.wang, bvanassche, kbusch,
	Jens Axboe, Christoph Hellwig, kch, smfrench, linkinjeon, metze,
	tom, trondmy, anna, chuck.lever, jlayton, neil, okorniev, Dai.Ngo,
	achender, davem, edumazet, kuba, pabeni, horms, kees, markzhang,
	andriy.shevchenko, ebadger, linux-rdma, linux-kernel,
	target-devel, linux-nvme, linux-cifs, samba-technical, linux-nfs,
	netdev, rds-devel
  Cc: Erni Sri Satya Vennela, Jason Gunthorpe

The capability counter fields in struct ib_device_attr are declared
as signed int, but these values are inherently non-negative. Drivers
maintain their cached caps as u32 and assign them directly into these
int fields; if a cap exceeds INT_MAX the implicit narrowing yields a
negative value visible to the IB core.

Change the signed int capability fields to u32 to match the
underlying nature of the data. Also update consumers across the IB
core, ULPs, NVMe-oF target, RDS, and NFS/RDMA so the new u32 values
are not forced back through signed int or u8 via min()/min_t() or
narrowing local variables.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Acked-by: Stefan Metzmacher <metze@samba.org> # smbdirect
---
Changes in v8:
* Convert the remaining non-negative counter fields max_ee_rd_atom,
  max_ee_init_rd_atom, max_ee, max_rdd, max_raw_ipv6_qp and max_srq_wr
  to u32; keep max_srq as int (its consumer compares it against
  ib_device.num_comp_vectors, still int).
* Drop all remaining min_t() where plain min() now works.
* Make the srq_size module parameters unsigned int so the srq_size min()
  stays a plain min().
* Replace the ternary-inside-min() with the simpler "if (x) x--;".
* Reorder the send_queue_depth min() to min(value, CONST) to match the
  sibling site.
* Restore reverse xmas-tree declaration order.
* Collapse the min()/min3() assignments that now fit onto a single line
  within 100 columns.
* Print the now-u32 fields with %u instead of %d.
Changes in v7:
* Drop min_t() in all sites where a plain min() (or min3()) works
  cleanly
* Guard nvme/host/rdma.c num_inline_segments computation against a
  device reporting max_send_sge == 0, so the u32 subtract
  cannot wrap to UINT_MAX.
* Use %u when printing the newly-u32 capability fields
  in diagnostic messages.
Changes in v6:
* Fix subject prefix: net-next -> rdma-next.
Changes in v5:
* Add U8_MAX clamps in iser_verbs, nvme/host, nvme/target, isert,
* rds/ib_cm, smbdirect/connect and smbdirect/accept where u32 capability
  fields were directly narrowed into u8 rdma_conn_param fields without
  clamping.
* Guard the inline_sge_count calculation in nvmet_rdma_find_get_device()
  to prevent u32 underflow when both max_sge_rd and max_recv_sge are zero.
* Expand type migration to 9 additional fields (max_mw, max_raw_ethy_qp,
  max_mcast_grp, max_mcast_qp_attach, max_total_mcast_qp_attach, max_ah,
  max_srq, max_srq_wr, max_srq_sge)
* Fix min_t(int,...) in svc_rdma_transport; min_t(u32,...) in ipoib,
  srpt, nvme/target, rds/ib, rtrs-clt, rtrs-srv, xprtrdma/verbsdd.
* Fix frwr_ops.c u32 underflow guard (reorder check before subtraction)
* Change sc_max_send_sges to unsigned int, inline_sge_count to u32
* Fix %d -> %u in rxe_qp, rxe_srq, ipoib_cm, ib_isert, svc_rdma_transport
* Update commit message.
Changes in v4:
* Drop clamping the values in mana_ib_query_device, instead update
  the props values from int to u32.
Changes in v3:
* Drop clamping from mana_ib_gd_query_adapter_caps(). The internal u32
  caps cache does not need to be clamped.
* Move all clamping exclusively to mana_ib_query_device(), which is the
  only place the cached u32 values are narrowed into the signed int
  fields of struct ib_device_attr.
* Reframe commit message: this is a u32-to-int type boundary fix, not a
  CVM/untrusted-hardware hardening patch.
Changes in v2:
* Update patch title.
---
 drivers/infiniband/core/cq.c               |  3 +-
 drivers/infiniband/hw/qedr/verbs.c         |  2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c         | 22 +++++-----
 drivers/infiniband/sw/rxe/rxe_srq.c        | 16 +++----
 drivers/infiniband/ulp/ipoib/ipoib_cm.c    | 10 ++---
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  3 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |  5 +--
 drivers/infiniband/ulp/isert/ib_isert.c    |  7 ++-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c     | 11 ++---
 drivers/infiniband/ulp/rtrs/rtrs-srv.c     | 11 ++---
 drivers/infiniband/ulp/srp/ib_srp.c        |  2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c      | 21 +++++----
 drivers/nvme/host/rdma.c                   |  8 ++--
 drivers/nvme/target/rdma.c                 | 13 +++---
 fs/smb/smbdirect/accept.c                  |  5 ++-
 fs/smb/smbdirect/connect.c                 |  5 ++-
 fs/smb/smbdirect/connection.c              |  8 ++--
 include/linux/sunrpc/svc_rdma.h            |  4 +-
 include/rdma/ib_verbs.h                    | 50 +++++++++++-----------
 net/rds/ib.c                               | 10 ++---
 net/rds/ib_cm.c                            | 10 ++---
 net/sunrpc/xprtrdma/frwr_ops.c             |  7 +--
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  5 +--
 net/sunrpc/xprtrdma/verbs.c                |  2 +-
 24 files changed, 117 insertions(+), 123 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 3d7b6cddd131..ee98188e57fb 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -393,8 +393,7 @@ static int ib_alloc_cqs(struct ib_device *dev, unsigned int nr_cqes,
 	 * a reasonable batch size so that we can share CQs between
 	 * multiple users instead of allocating a larger number of CQs.
 	 */
-	nr_cqes = min_t(unsigned int, dev->attrs.max_cqe,
-			max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
+	nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
 	nr_cqs = min_t(unsigned int, dev->num_comp_vectors, num_online_cpus());
 	for (i = 0; i < nr_cqs; i++) {
 		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 679aa6f3a63b..a85ad0171134 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -151,7 +151,7 @@ int qedr_query_device(struct ib_device *ibdev,
 	attr->max_qp_init_rd_atom =
 	    1 << (fls(qattr->max_qp_req_rd_atomic_resc) - 1);
 	attr->max_qp_rd_atom =
-	    min(1 << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
+	    min(1U << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
 		attr->max_qp_init_rd_atom);
 
 	attr->max_srq = qattr->max_srq;
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index f3dff1aea96a..7a0529a17992 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -67,27 +67,27 @@ static int rxe_qp_chk_cap(struct rxe_dev *rxe, struct ib_qp_cap *cap,
 			  int has_srq)
 {
 	if (cap->max_send_wr > rxe->attr.max_qp_wr) {
-		rxe_dbg_dev(rxe, "invalid send wr = %u > %d\n",
-			 cap->max_send_wr, rxe->attr.max_qp_wr);
+		rxe_dbg_dev(rxe, "invalid send wr = %u > %u\n",
+			    cap->max_send_wr, rxe->attr.max_qp_wr);
 		goto err1;
 	}
 
 	if (cap->max_send_sge > rxe->attr.max_send_sge) {
-		rxe_dbg_dev(rxe, "invalid send sge = %u > %d\n",
-			 cap->max_send_sge, rxe->attr.max_send_sge);
+		rxe_dbg_dev(rxe, "invalid send sge = %u > %u\n",
+			    cap->max_send_sge, rxe->attr.max_send_sge);
 		goto err1;
 	}
 
 	if (!has_srq) {
 		if (cap->max_recv_wr > rxe->attr.max_qp_wr) {
-			rxe_dbg_dev(rxe, "invalid recv wr = %u > %d\n",
-				 cap->max_recv_wr, rxe->attr.max_qp_wr);
+			rxe_dbg_dev(rxe, "invalid recv wr = %u > %u\n",
+				    cap->max_recv_wr, rxe->attr.max_qp_wr);
 			goto err1;
 		}
 
 		if (cap->max_recv_sge > rxe->attr.max_recv_sge) {
-			rxe_dbg_dev(rxe, "invalid recv sge = %u > %d\n",
-				 cap->max_recv_sge, rxe->attr.max_recv_sge);
+			rxe_dbg_dev(rxe, "invalid recv sge = %u > %u\n",
+				    cap->max_recv_sge, rxe->attr.max_recv_sge);
 			goto err1;
 		}
 	}
@@ -537,9 +537,9 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	if (mask & IB_QP_MAX_QP_RD_ATOMIC) {
 		if (attr->max_rd_atomic > rxe->attr.max_qp_rd_atom) {
-			rxe_dbg_qp(qp, "invalid max_rd_atomic %d > %d\n",
-				 attr->max_rd_atomic,
-				 rxe->attr.max_qp_rd_atom);
+			rxe_dbg_qp(qp, "invalid max_rd_atomic %u > %u\n",
+				   attr->max_rd_atomic,
+				   rxe->attr.max_qp_rd_atom);
 			goto err1;
 		}
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
index c9a7cd38953d..74904a6fdf2b 100644
--- a/drivers/infiniband/sw/rxe/rxe_srq.c
+++ b/drivers/infiniband/sw/rxe/rxe_srq.c
@@ -13,8 +13,8 @@ int rxe_srq_chk_init(struct rxe_dev *rxe, struct ib_srq_init_attr *init)
 	struct ib_srq_attr *attr = &init->attr;
 
 	if (attr->max_wr > rxe->attr.max_srq_wr) {
-		rxe_dbg_dev(rxe, "max_wr(%d) > max_srq_wr(%d)\n",
-			attr->max_wr, rxe->attr.max_srq_wr);
+		rxe_dbg_dev(rxe, "max_wr(%u) > max_srq_wr(%u)\n",
+			    attr->max_wr, rxe->attr.max_srq_wr);
 		goto err1;
 	}
 
@@ -27,8 +27,8 @@ int rxe_srq_chk_init(struct rxe_dev *rxe, struct ib_srq_init_attr *init)
 		attr->max_wr = RXE_MIN_SRQ_WR;
 
 	if (attr->max_sge > rxe->attr.max_srq_sge) {
-		rxe_dbg_dev(rxe, "max_sge(%d) > max_srq_sge(%d)\n",
-			attr->max_sge, rxe->attr.max_srq_sge);
+		rxe_dbg_dev(rxe, "max_sge(%u) > max_srq_sge(%u)\n",
+			    attr->max_sge, rxe->attr.max_srq_sge);
 		goto err1;
 	}
 
@@ -107,8 +107,8 @@ int rxe_srq_chk_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 
 	if (mask & IB_SRQ_MAX_WR) {
 		if (attr->max_wr > rxe->attr.max_srq_wr) {
-			rxe_dbg_srq(srq, "max_wr(%d) > max_srq_wr(%d)\n",
-				attr->max_wr, rxe->attr.max_srq_wr);
+			rxe_dbg_srq(srq, "max_wr(%u) > max_srq_wr(%u)\n",
+				    attr->max_wr, rxe->attr.max_srq_wr);
 			goto err1;
 		}
 
@@ -129,8 +129,8 @@ int rxe_srq_chk_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 
 	if (mask & IB_SRQ_LIMIT) {
 		if (attr->srq_limit > rxe->attr.max_srq_wr) {
-			rxe_dbg_srq(srq, "srq_limit(%d) > max_srq_wr(%d)\n",
-				attr->srq_limit, rxe->attr.max_srq_wr);
+			rxe_dbg_srq(srq, "srq_limit(%u) > max_srq_wr(%u)\n",
+				    attr->srq_limit, rxe->attr.max_srq_wr);
 			goto err1;
 		}
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 57fec88a1629..ed0592898384 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1071,8 +1071,7 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 	struct ib_qp *tx_qp;
 
 	if (dev->features & NETIF_F_SG)
-		attr.cap.max_send_sge = min_t(u32, priv->ca->attrs.max_send_sge,
-					      MAX_SKB_FRAGS + 1);
+		attr.cap.max_send_sge = min(priv->ca->attrs.max_send_sge, MAX_SKB_FRAGS + 1);
 
 	tx_qp = ib_create_qp(priv->pd, &attr);
 	tx->max_send_sge = attr.cap.max_send_sge;
@@ -1582,7 +1581,8 @@ static void ipoib_cm_create_srq(struct net_device *dev, int max_sge)
 int ipoib_cm_dev_init(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = ipoib_priv(dev);
-	int max_srq_sge, i;
+	u32 max_srq_sge;
+	int i;
 	u8 addr;
 
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
@@ -1600,9 +1600,9 @@ int ipoib_cm_dev_init(struct net_device *dev)
 
 	skb_queue_head_init(&priv->cm.skb_queue);
 
-	ipoib_dbg(priv, "max_srq_sge=%d\n", priv->ca->attrs.max_srq_sge);
+	ipoib_dbg(priv, "max_srq_sge=%u\n", priv->ca->attrs.max_srq_sge);
 
-	max_srq_sge = min_t(int, IPOIB_CM_RX_SG, priv->ca->attrs.max_srq_sge);
+	max_srq_sge = min(priv->ca->attrs.max_srq_sge, IPOIB_CM_RX_SG);
 	ipoib_cm_create_srq(dev, max_srq_sge);
 	if (ipoib_cm_has_srq(dev)) {
 		priv->cm.max_cm_mtu = max_srq_sge * PAGE_SIZE - 0x10;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 3ed1ea566690..2490696a1aab 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -147,8 +147,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		.cap = {
 			.max_send_wr  = ipoib_sendq_size,
 			.max_recv_wr  = ipoib_recvq_size,
-			.max_send_sge = min_t(u32, priv->ca->attrs.max_send_sge,
-					      MAX_SKB_FRAGS + 1),
+			.max_send_sge = min(priv->ca->attrs.max_send_sge, MAX_SKB_FRAGS + 1),
 			.max_recv_sge = IPOIB_UD_RX_SG
 		},
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index f03b3bb3c0c4..55fe68e5b837 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -244,8 +244,7 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 		max_send_wr = ISER_QP_SIG_MAX_REQ_DTOS + 1;
 	else
 		max_send_wr = ISER_QP_MAX_REQ_DTOS + 1;
-	max_send_wr = min_t(unsigned int, max_send_wr,
-			    (unsigned int)ib_dev->attrs.max_qp_wr);
+	max_send_wr = min(max_send_wr, ib_dev->attrs.max_qp_wr);
 
 	cq_size = max_send_wr + ISER_QP_MAX_RECV_DTOS;
 	ib_conn->cq = ib_cq_pool_get(ib_dev, cq_size, -1, IB_POLL_SOFTIRQ);
@@ -589,7 +588,7 @@ static void iser_route_handler(struct rdma_cm_id *cma_id)
 		goto failure;
 
 	memset(&conn_param, 0, sizeof conn_param);
-	conn_param.responder_resources = ib_dev->attrs.max_qp_rd_atom;
+	conn_param.responder_resources = min(ib_dev->attrs.max_qp_rd_atom, U8_MAX);
 	conn_param.initiator_depth = 1;
 	conn_param.retry_count = 7;
 	conn_param.rnr_retry_count = 6;
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 1015a51f750a..4691845bf815 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -214,9 +214,9 @@ isert_create_device_ib_res(struct isert_device *device)
 	struct ib_device *ib_dev = device->ib_device;
 	int ret;
 
-	isert_dbg("devattr->max_send_sge: %d devattr->max_recv_sge %d\n",
+	isert_dbg("devattr->max_send_sge: %u devattr->max_recv_sge %u\n",
 		  ib_dev->attrs.max_send_sge, ib_dev->attrs.max_recv_sge);
-	isert_dbg("devattr->max_sge_rd: %d\n", ib_dev->attrs.max_sge_rd);
+	isert_dbg("devattr->max_sge_rd: %u\n", ib_dev->attrs.max_sge_rd);
 
 	device->pd = ib_alloc_pd(ib_dev, 0);
 	if (IS_ERR(device->pd)) {
@@ -381,8 +381,7 @@ isert_set_nego_params(struct isert_conn *isert_conn,
 	struct ib_device_attr *attr = &isert_conn->device->ib_device->attrs;
 
 	/* Set max inflight RDMA READ requests */
-	isert_conn->initiator_depth = min_t(u8, param->initiator_depth,
-				attr->max_qp_init_rd_atom);
+	isert_conn->initiator_depth = min(param->initiator_depth, attr->max_qp_init_rd_atom);
 	isert_dbg("Using initiator_depth: %u\n", isert_conn->initiator_depth);
 
 	if (param->private_data) {
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index e351552733df..80b08697f96b 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1681,8 +1681,7 @@ static int create_con_cq_qp(struct rtrs_clt_con *con)
 		 * + 2 for drain and heartbeat
 		 * in case qp gets into error state.
 		 */
-		max_send_wr =
-			min_t(int, wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
+		max_send_wr = min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
 		max_recv_wr = max_send_wr;
 	} else {
 		/*
@@ -1698,11 +1697,9 @@ static int create_con_cq_qp(struct rtrs_clt_con *con)
 		wr_limit = clt_path->s.dev->ib_dev->attrs.max_qp_wr;
 		/* Shared between connections */
 		clt_path->s.dev_ref++;
-		max_send_wr = min_t(int, wr_limit,
-			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
-			      clt_path->queue_depth * 4 + 1);
-		max_recv_wr = min_t(int, wr_limit,
-			      clt_path->queue_depth * 3 + 1);
+		/* QD * (REQ + RSP + FR REGS or INVS) + drain */
+		max_send_wr = min(wr_limit, clt_path->queue_depth * 4 + 1);
+		max_recv_wr = min(wr_limit, clt_path->queue_depth * 3 + 1);
 		max_send_sge = 2;
 	}
 	atomic_set(&con->c.sq_wr_avail, max_send_wr);
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 6482ad859bd1..f5a6890235bc 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -1731,21 +1731,16 @@ static int create_con(struct rtrs_srv_path *srv_path,
 		 * All receive and all send (each requiring invalidate)
 		 * + 2 for drain and heartbeat
 		 */
-		max_send_wr = min_t(int, wr_limit,
-				    SERVICE_CON_QUEUE_DEPTH * 2 + 2);
+		max_send_wr = min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
 		max_recv_wr = max_send_wr;
 		s->signal_interval = min_not_zero(srv->queue_depth,
 						  (size_t)SERVICE_CON_QUEUE_DEPTH);
 	} else {
 		/* when always_invlaidate enalbed, we need linv+rinv+mr+imm */
 		if (always_invalidate)
-			max_send_wr =
-				min_t(int, wr_limit,
-				      srv->queue_depth * (1 + 4) + 1);
+			max_send_wr = min(wr_limit, srv->queue_depth * (1 + 4) + 1);
 		else
-			max_send_wr =
-				min_t(int, wr_limit,
-				      srv->queue_depth * (1 + 2) + 1);
+			max_send_wr = min(wr_limit, srv->queue_depth * (1 + 2) + 1);
 
 		max_recv_wr = srv->queue_depth + 1;
 	}
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index acbd787de265..0caebbc2810f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -557,7 +557,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch)
 	init_attr->cap.max_send_wr     = m * target->queue_size;
 	init_attr->cap.max_recv_wr     = target->queue_size + 1;
 	init_attr->cap.max_recv_sge    = 1;
-	init_attr->cap.max_send_sge    = min(SRP_MAX_SGE, attr->max_send_sge);
+	init_attr->cap.max_send_sge    = min(attr->max_send_sge, SRP_MAX_SGE);
 	init_attr->sq_sig_type         = IB_SIGNAL_REQ_WR;
 	init_attr->qp_type             = IB_QPT_RC;
 	init_attr->send_cq             = send_cq;
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 9aec5d80117f..a4e4feba4a02 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -77,8 +77,8 @@ module_param(srp_max_req_size, int, 0444);
 MODULE_PARM_DESC(srp_max_req_size,
 		 "Maximum size of SRP request messages in bytes.");
 
-static int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
-module_param(srpt_srq_size, int, 0444);
+static unsigned int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
+module_param(srpt_srq_size, uint, 0444);
 MODULE_PARM_DESC(srpt_srq_size,
 		 "Shared receive queue (SRQ) size.");
 
@@ -405,8 +405,7 @@ static void srpt_get_ioc(struct srpt_port *sport, u32 slot,
 	if (sdev->use_srq)
 		send_queue_depth = sdev->srq_size;
 	else
-		send_queue_depth = min(MAX_SRPT_RQ_SIZE,
-				       sdev->device->attrs.max_qp_wr);
+		send_queue_depth = min(sdev->device->attrs.max_qp_wr, MAX_SRPT_RQ_SIZE);
 
 	memset(iocp, 0, sizeof(*iocp));
 	strcpy(iocp->id_string, SRPT_ID_STRING);
@@ -1850,7 +1849,7 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 	struct srpt_port *sport = ch->sport;
 	struct srpt_device *sdev = sport->sdev;
 	const struct ib_device_attr *attrs = &sdev->device->attrs;
-	int sq_size = sport->port_attrib.srp_sq_size;
+	u32 sq_size = sport->port_attrib.srp_sq_size;
 	int i, ret;
 
 	WARN_ON(ch->rq_size < 1);
@@ -1911,13 +1910,13 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		bool retry = sq_size > MIN_SRPT_SQ_SIZE;
 
 		if (retry) {
-			pr_debug("failed to create queue pair with sq_size = %d (%d) - retrying\n",
+			pr_debug("failed to create queue pair with sq_size = %u (%d) - retrying\n",
 				 sq_size, ret);
 			ib_cq_pool_put(ch->cq, ch->cq_size);
 			sq_size = max(sq_size / 2, MIN_SRPT_SQ_SIZE);
 			goto retry;
 		} else {
-			pr_err("failed to create queue pair with sq_size = %d (%d)\n",
+			pr_err("failed to create queue pair with sq_size = %u (%d)\n",
 			       sq_size, ret);
 			goto err_destroy_cq;
 		}
@@ -1925,7 +1924,7 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 
 	atomic_set(&ch->sq_wr_avail, qp_init->cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d ch= %p\n",
+	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %u ch= %p\n",
 		 __func__, ch->cq->cqe, qp_init->cap.max_send_sge,
 		 qp_init->cap.max_send_wr, ch);
 
@@ -2298,7 +2297,7 @@ static int srpt_cm_req_recv(struct srpt_device *const sdev,
 	 * depth to avoid that the initiator driver has to report QUEUE_FULL
 	 * to the SCSI mid-layer.
 	 */
-	ch->rq_size = min(MAX_SRPT_RQ_SIZE, sdev->device->attrs.max_qp_wr);
+	ch->rq_size = min(sdev->device->attrs.max_qp_wr, MAX_SRPT_RQ_SIZE);
 	spin_lock_init(&ch->spinlock);
 	ch->state = CH_CONNECTING;
 	INIT_LIST_HEAD(&ch->cmd_wait_list);
@@ -3136,7 +3135,7 @@ static int srpt_alloc_srq(struct srpt_device *sdev)
 		return PTR_ERR(srq);
 	}
 
-	pr_debug("create SRQ #wr= %d max_allow=%d dev= %s\n", sdev->srq_size,
+	pr_debug("create SRQ #wr= %d max_allow=%u dev= %s\n", sdev->srq_size,
 		 sdev->device->attrs.max_srq_wr, dev_name(&device->dev));
 
 	sdev->req_buf_cache = srpt_cache_get(srp_max_req_size);
@@ -3951,7 +3950,7 @@ static int __init srpt_init_module(void)
 
 	if (srpt_srq_size < MIN_SRPT_SRQ_SIZE
 	    || srpt_srq_size > MAX_SRPT_SRQ_SIZE) {
-		pr_err("invalid value %d for kernel module parameter srpt_srq_size -- must be in the range [%d..%d].\n",
+		pr_err("invalid value %u for kernel module parameter srpt_srq_size -- must be in the range [%d..%d].\n",
 		       srpt_srq_size, MIN_SRPT_SRQ_SIZE, MAX_SRPT_SRQ_SIZE);
 		goto out;
 	}
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 6909e3542794..56cd228af1d5 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -394,8 +394,10 @@ nvme_rdma_find_get_device(struct rdma_cm_id *cm_id)
 		goto out_free_pd;
 	}
 
-	ndev->num_inline_segments = min(NVME_RDMA_MAX_INLINE_SEGMENTS,
-					ndev->dev->attrs.max_send_sge - 1);
+	ndev->num_inline_segments = ndev->dev->attrs.max_send_sge;
+	if (ndev->num_inline_segments)
+		ndev->num_inline_segments--;
+	ndev->num_inline_segments = min(ndev->num_inline_segments, NVME_RDMA_MAX_INLINE_SEGMENTS);
 	list_add(&ndev->entry, &device_list);
 out_unlock:
 	mutex_unlock(&device_list_mutex);
@@ -1847,7 +1849,7 @@ static int nvme_rdma_route_resolved(struct nvme_rdma_queue *queue)
 	param.qp_num = queue->qp->qp_num;
 	param.flow_control = 1;
 
-	param.responder_resources = queue->device->dev->attrs.max_qp_rd_atom;
+	param.responder_resources = min(queue->device->dev->attrs.max_qp_rd_atom, U8_MAX);
 	/* maximum retry count */
 	param.retry_count = 7;
 	param.rnr_retry_count = 7;
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ac26f4f774c4..1c332d66222a 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -152,7 +152,7 @@ static const struct kernel_param_ops srq_size_ops = {
 	.get = param_get_int,
 };
 
-static int nvmet_rdma_srq_size = 1024;
+static unsigned int nvmet_rdma_srq_size = 1024;
 module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
 MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 1024)");
 
@@ -1197,7 +1197,7 @@ nvmet_rdma_find_get_device(struct rdma_cm_id *cm_id)
 	struct nvmet_port *nport = port->nport;
 	struct nvmet_rdma_device *ndev;
 	int inline_page_count;
-	int inline_sge_count;
+	u32 inline_sge_count;
 	int ret;
 
 	mutex_lock(&device_list_mutex);
@@ -1213,7 +1213,9 @@ nvmet_rdma_find_get_device(struct rdma_cm_id *cm_id)
 
 	inline_page_count = num_pages(nport->inline_data_size);
 	inline_sge_count = max(cm_id->device->attrs.max_sge_rd,
-				cm_id->device->attrs.max_recv_sge) - 1;
+				cm_id->device->attrs.max_recv_sge);
+	if (inline_sge_count)
+		inline_sge_count--;
 	if (inline_page_count > inline_sge_count) {
 		pr_warn("inline_data_size %d cannot be supported by device %s. Reducing to %lu.\n",
 			nport->inline_data_size, cm_id->device->name,
@@ -1553,8 +1555,9 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 
 	param.rnr_retry_count = 7;
 	param.flow_control = 1;
-	param.initiator_depth = min_t(u8, p->initiator_depth,
-		queue->dev->device->attrs.max_qp_init_rd_atom);
+	param.initiator_depth = min3(p->initiator_depth,
+				     queue->dev->device->attrs.max_qp_init_rd_atom,
+				     U8_MAX);
 	param.private_data = &priv;
 	param.private_data_len = sizeof(priv);
 	priv.recfmt = cpu_to_le16(NVME_RDMA_CM_FMT_1_0);
diff --git a/fs/smb/smbdirect/accept.c b/fs/smb/smbdirect/accept.c
index 529740005838..44b681a20725 100644
--- a/fs/smb/smbdirect/accept.c
+++ b/fs/smb/smbdirect/accept.c
@@ -32,8 +32,9 @@ int smbdirect_accept_connect_request(struct smbdirect_socket *sc,
 	/*
 	 * First set what the we as server are able to support
 	 */
-	sp->initiator_depth = min_t(u8, sp->initiator_depth,
-				    sc->ib.dev->attrs.max_qp_rd_atom);
+	sp->initiator_depth = min3(sp->initiator_depth,
+				   sc->ib.dev->attrs.max_qp_rd_atom,
+				   U8_MAX);
 
 	peer_initiator_depth = param->initiator_depth;
 	peer_responder_resources = param->responder_resources;
diff --git a/fs/smb/smbdirect/connect.c b/fs/smb/smbdirect/connect.c
index cd726b399afe..34a3e72c38fb 100644
--- a/fs/smb/smbdirect/connect.c
+++ b/fs/smb/smbdirect/connect.c
@@ -182,8 +182,9 @@ static int smbdirect_connect_rdma_connect(struct smbdirect_socket *sc)
 	if (sc->ib.dev->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
 		sc->mr_io.type = IB_MR_TYPE_SG_GAPS;
 
-	sp->responder_resources = min_t(u8, sp->responder_resources,
-					sc->ib.dev->attrs.max_qp_rd_atom);
+	sp->responder_resources = min3(sp->responder_resources,
+				       sc->ib.dev->attrs.max_qp_rd_atom,
+				       U8_MAX);
 	smbdirect_log_rdma_mr(sc, SMBDIRECT_LOG_INFO,
 		"responder_resources=%d\n",
 		sp->responder_resources);
diff --git a/fs/smb/smbdirect/connection.c b/fs/smb/smbdirect/connection.c
index 8adf58097534..690acb84e1b5 100644
--- a/fs/smb/smbdirect/connection.c
+++ b/fs/smb/smbdirect/connection.c
@@ -287,7 +287,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 	    qp_cap.max_send_wr > sc->ib.dev->attrs.max_qp_wr) {
 		pr_err("Possible CQE overrun: max_send_wr %d\n",
 		       qp_cap.max_send_wr);
-		pr_err("device %.*s reporting max_cqe %d max_qp_wr %d\n",
+		pr_err("device %.*s reporting max_cqe %u max_qp_wr %u\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_cqe,
@@ -302,7 +302,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 	     max_send_wr >= sc->ib.dev->attrs.max_qp_wr)) {
 		pr_err("Possible CQE overrun: rdma_send_wr %d + max_send_wr %d = %d\n",
 		       rdma_send_wr, qp_cap.max_send_wr, max_send_wr);
-		pr_err("device %.*s reporting max_cqe %d max_qp_wr %d\n",
+		pr_err("device %.*s reporting max_cqe %u max_qp_wr %u\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_cqe,
@@ -316,7 +316,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 	    qp_cap.max_recv_wr > sc->ib.dev->attrs.max_qp_wr) {
 		pr_err("Possible CQE overrun: max_recv_wr %d\n",
 		       qp_cap.max_recv_wr);
-		pr_err("device %.*s reporting max_cqe %d max_qp_wr %d\n",
+		pr_err("device %.*s reporting max_cqe %u max_qp_wr %u\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_cqe,
@@ -328,7 +328,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 
 	if (qp_cap.max_send_sge > sc->ib.dev->attrs.max_send_sge ||
 	    qp_cap.max_recv_sge > sc->ib.dev->attrs.max_recv_sge) {
-		pr_err("device %.*s max_send_sge/max_recv_sge = %d/%d too small\n",
+		pr_err("device %.*s max_send_sge/max_recv_sge = %u/%u too small\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_send_sge,
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad57..217f000be5d6 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -78,8 +78,8 @@ struct svcxprt_rdma {
 	struct rdma_cm_id    *sc_cm_id;		/* RDMA connection id */
 	struct list_head     sc_accept_q;	/* Conn. waiting accept */
 	struct rpcrdma_notification sc_rn;	/* removal notification */
-	int		     sc_ord;		/* RDMA read limit */
-	int                  sc_max_send_sges;
+	u32		     sc_ord;		/* RDMA read limit */
+	unsigned int         sc_max_send_sges;
 	bool		     sc_snd_w_inv;	/* OK to use Send With Invalidate */
 
 	atomic_t             sc_sq_avail;	/* SQEs ready to be consumed */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9dd76f489a0b..b8b221b5f564 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -406,36 +406,36 @@ struct ib_device_attr {
 	u32			vendor_id;
 	u32			vendor_part_id;
 	u32			hw_ver;
-	int			max_qp;
-	int			max_qp_wr;
+	u32			max_qp;
+	u32			max_qp_wr;
 	u64			device_cap_flags;
 	u64			kernel_cap_flags;
-	int			max_send_sge;
-	int			max_recv_sge;
-	int			max_sge_rd;
-	int			max_cq;
-	int			max_cqe;
-	int			max_mr;
-	int			max_pd;
-	int			max_qp_rd_atom;
-	int			max_ee_rd_atom;
-	int			max_res_rd_atom;
-	int			max_qp_init_rd_atom;
-	int			max_ee_init_rd_atom;
+	u32			max_send_sge;
+	u32			max_recv_sge;
+	u32			max_sge_rd;
+	u32			max_cq;
+	u32			max_cqe;
+	u32			max_mr;
+	u32			max_pd;
+	u32			max_qp_rd_atom;
+	u32			max_ee_rd_atom;
+	u32			max_res_rd_atom;
+	u32			max_qp_init_rd_atom;
+	u32			max_ee_init_rd_atom;
 	enum ib_atomic_cap	atomic_cap;
 	enum ib_atomic_cap	masked_atomic_cap;
-	int			max_ee;
-	int			max_rdd;
-	int			max_mw;
-	int			max_raw_ipv6_qp;
-	int			max_raw_ethy_qp;
-	int			max_mcast_grp;
-	int			max_mcast_qp_attach;
-	int			max_total_mcast_qp_attach;
-	int			max_ah;
+	u32			max_ee;
+	u32			max_rdd;
+	u32			max_mw;
+	u32			max_raw_ipv6_qp;
+	u32			max_raw_ethy_qp;
+	u32			max_mcast_grp;
+	u32			max_mcast_qp_attach;
+	u32			max_total_mcast_qp_attach;
+	u32			max_ah;
 	int			max_srq;
-	int			max_srq_wr;
-	int			max_srq_sge;
+	u32			max_srq_wr;
+	u32			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
 	unsigned int		max_pi_fast_reg_page_list_len;
 	u16			max_pkeys;
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 39f87272e071..c62684d4259c 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -162,12 +162,12 @@ static int rds_ib_add_one(struct ib_device *device)
 		   IB_ODP_SUPPORT_READ);
 
 	rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
-		min_t(unsigned int, (device->attrs.max_mr / 2),
-		      rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
+		min(device->attrs.max_mr / 2,
+		    rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
 	rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
-		min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE),
-		      rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
+		min((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE,
+		    rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
 	rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
 	rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -204,7 +204,7 @@ static int rds_ib_add_one(struct ib_device *device)
 		goto put_dev;
 	}
 
-	rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
+	rdsdebug("RDS/IB: max_mr = %u, max_wrs = %d, max_sge = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 		 device->attrs.max_mr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
 		 rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs);
 
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 5667f0173b47..17e587c30076 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -173,11 +173,11 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 
 	memset(conn_param, 0, sizeof(struct rdma_conn_param));
 
-	conn_param->responder_resources =
-		min_t(u32, rds_ibdev->max_responder_resources, max_responder_resources);
-	conn_param->initiator_depth =
-		min_t(u32, rds_ibdev->max_initiator_depth, max_initiator_depth);
-	conn_param->retry_count = min_t(unsigned int, rds_ib_retry_count, 7);
+	conn_param->responder_resources = min3(rds_ibdev->max_responder_resources,
+					       max_responder_resources, U8_MAX);
+	conn_param->initiator_depth = min3(rds_ibdev->max_initiator_depth,
+					   max_initiator_depth, U8_MAX);
+	conn_param->retry_count = min(rds_ib_retry_count, 7U);
 	conn_param->rnr_retry_count = 7;
 
 	if (dp) {
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 7f79a0a2601e..b2e437afe09d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -172,8 +172,9 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
 int frwr_query_device(struct rpcrdma_ep *ep, const struct ib_device *device)
 {
 	const struct ib_device_attr *attrs = &device->attrs;
-	int max_qp_wr, depth, delta;
 	unsigned int max_sge;
+	u32 max_qp_wr;
+	int depth, delta;
 
 	if (!(attrs->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) ||
 	    attrs->max_fast_reg_page_list_len == 0) {
@@ -229,10 +230,10 @@ int frwr_query_device(struct rpcrdma_ep *ep, const struct ib_device *device)
 	}
 
 	max_qp_wr = attrs->max_qp_wr;
+	if (max_qp_wr < RPCRDMA_BACKWARD_WRS + 1 + RPCRDMA_MIN_SLOT_TABLE)
+		return -ENOMEM;
 	max_qp_wr -= RPCRDMA_BACKWARD_WRS;
 	max_qp_wr -= 1;
-	if (max_qp_wr < RPCRDMA_MIN_SLOT_TABLE)
-		return -ENOMEM;
 	if (ep->re_max_requests > max_qp_wr)
 		ep->re_max_requests = max_qp_wr;
 	ep->re_attr.cap.max_send_wr = ep->re_max_requests * depth;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4f..c768cda2e544 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -544,8 +544,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	set_bit(RDMAXPRT_CONN_PENDING, &newxprt->sc_flags);
 	memset(&conn_param, 0, sizeof conn_param);
 	conn_param.responder_resources = 0;
-	conn_param.initiator_depth = min_t(int, newxprt->sc_ord,
-					   dev->attrs.max_qp_init_rd_atom);
+	conn_param.initiator_depth = min(newxprt->sc_ord, dev->attrs.max_qp_init_rd_atom);
 	if (!conn_param.initiator_depth) {
 		ret = -EINVAL;
 		trace_svcrdma_initdepth_err(newxprt, ret);
@@ -570,7 +569,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		dprintk("    local address   : %pIS:%u\n", sap, rpc_get_port(sap));
 		sap = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
 		dprintk("    remote address  : %pIS:%u\n", sap, rpc_get_port(sap));
-		dprintk("    max_sge         : %d\n", newxprt->sc_max_send_sges);
+		dprintk("    max_sge         : %u\n", newxprt->sc_max_send_sges);
 		dprintk("    sq_depth        : %d\n", newxprt->sc_sq_depth);
 		dprintk("    rdma_rw_ctxs    : %d\n", ctxts);
 		dprintk("    max_requests    : %d\n", newxprt->sc_max_requests);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aecf9c0a153f..8ed9da6d2d2f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -453,7 +453,7 @@ static int rpcrdma_ep_create(struct rpcrdma_xprt *r_xprt)
 	/* Client offers RDMA Read but does not initiate */
 	ep->re_remote_cma.initiator_depth = 0;
 	ep->re_remote_cma.responder_resources =
-		min_t(int, U8_MAX, device->attrs.max_qp_rd_atom);
+		min(device->attrs.max_qp_rd_atom, U8_MAX);
 
 	/* Limit transport retries so client can detect server
 	 * GID changes quickly. RPC layer handles re-establishing
-- 
2.34.1


^ permalink raw reply related

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: Thomas Gleixner @ 2026-06-19 20:21 UTC (permalink / raw)
  To: David Woodhouse, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <02564e5f0b6be4aeb6198af87b46269963985768.camel@infradead.org>

On Fri, Jun 19 2026 at 16:34, David Woodhouse wrote:
> On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
>> 
>> This formatting makes my brain hurt. Can you please split that out into
>> a separate function?
>
> Yep. There's also a potential error there — an *additional* discrepancy
> comes from the enforced monotonicity that timekeeping_cycles_to_ns()
> applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).
>
> I couldn't work out if I cared about the clocksource-is-non-monotonic
> casse, and even if I did, what I should do about it.

I think the right thing is just to ignore it.

The problem is very narrow and mostly related to the historically badly
synchronized TSC between sockets. The TSC_ADJUST fixup is obviously
error prone as it adjusts only to the point where the error is not
longer observable. But in the update transition phase it can result in
time going backwards because the readout on the other CPU is slightly
behind tk::tkr_mono::cycles_last. That happens only once in a while and
we talk about a very low single digit number of TSC cycles.

> I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
> or something like that, such that e.g. PTP clients could *ask* for it.

Hell no!

> It's all very well hard-coding it in pps_get_ts() and unconditionally
> changing the behaviour... I *think* we could justify that. But the
> example I actually used in the patch was PTP, and that's slightly
> harder to justify the behavioural change.

Just leave it alone.

If the TSCs between sockets are slightly out of [mostly unobservable]
sync then if you don't hit this corner case at the edge of the update
then you have to live with that discrepancy anyway as you don't know
about it at all. So making a magic extra case for this unlikely event is
overkill. Due to speculation, caches etc. pp the snapshot is anyway in
that low single digit TSC cycles margin of inaccuracy.

Don't try to defeat reality and the underlying physics. Perfect is the
enemy of good.

Thanks,

        tglx

^ permalink raw reply

* [PATCH 2/2] selftests/bpf: validate rx_queue_index in xdp_metadata
From: Siddharth_Cibi @ 2026-06-19 19:57 UTC (permalink / raw)
  To: ast
  Cc: Siddharth_Cibi, Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, open list:XDP (eXpress Data Path),
	open list:XDP (eXpress Data Path),
	open list:KERNEL SELFTEST FRAMEWORK, open list
In-Reply-To: <20260619195759.41254-1-siddharthcibi@icloud.com>

Extend xdp_metadata selftest coverage to validate that
ctx->rx_queue_index is preserved and observable after XDP redirect
execution.

Capture rx_queue_index in metadata and assert that it matches the
expected queue during packet verification.

Signed-off-by: Siddharth_Cibi <siddharthcibi@icloud.com>
---
 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 3 ++-
 tools/testing/selftests/bpf/progs/xdp_metadata.c      | 2 +-
 tools/testing/selftests/bpf/xdp_metadata.h            | 1 +
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 5c31054ad4a4..f8cabbbe7bb7 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -309,7 +309,8 @@ static int verify_xsk_metadata(struct xsk *xsk, bool sent_from_af_xdp)
 
 	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
 		return -1;
-
+	if (!ASSERT_EQ(meta->rx_queue_index, QUEUE_ID, "rx_queue_index"))
+        	return -1;
 	if (!sent_from_af_xdp) {
 		if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
 			return -1;
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index 09bb8a038d52..62ae83860d7f 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -98,7 +98,7 @@ int rx(struct xdp_md *ctx)
 	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
 	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_proto,
 				     &meta->rx_vlan_tci);
-
+	meta->rx_queue_index = ctx->rx_queue_index;
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 87318ad1117a..1f0ae4c00091 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -49,4 +49,5 @@ struct xdp_meta {
 		__s32 rx_vlan_tag_err;
 	};
 	enum xdp_meta_field hint_valid;
+	__u32 rx_queue_index;
 };
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/2] bpf: preserve rx_queue_index across XDP redirects
From: Siddharth_Cibi @ 2026-06-19 19:57 UTC (permalink / raw)
  To: ast
  Cc: Siddharth C, Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Eric Dumazet, Paolo Abeni, Simon Horman, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
	open list:XDP (eXpress Data Path),
	open list:XDP (eXpress Data Path), open list
In-Reply-To: <20260619195759.41254-1-siddharthcibi@icloud.com>

From: Siddharth C <siddharthcibi@icloud.com>

Store rx_queue_index in struct xdp_frame during xdp_buff to
xdp_frame conversion and restore it when rebuilding xdp_rxq_info
for cpumap and devmap execution paths.

This preserves ingress RX queue information for XDP programs
executed after redirect, allowing access to the original
rx_queue_index instead of losing queue context.

Also propagate rx_queue_index for zero-copy XDP frame conversion.

Signed-off-by: Siddharth_Cibi <siddharthcibi@icloud.com>
---
 include/net/xdp.h   | 2 ++
 kernel/bpf/cpumap.c | 2 +-
 kernel/bpf/devmap.c | 5 ++++-
 net/core/xdp.c      | 1 +
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index aa742f413c35..90318b2b76dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -301,6 +301,7 @@ struct xdp_frame {
 	 */
 	enum xdp_mem_type mem_type:32;
 	struct net_device *dev_rx; /* used by cpumap */
+	u32 rx_queue_index;
 	u32 frame_sz;
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
@@ -441,6 +442,7 @@ struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp)
 
 	/* rxq only valid until napi_schedule ends, convert to xdp_mem_type */
 	xdp_frame->mem_type = xdp->rxq->mem.type;
+	xdp_frame->rx_queue_index = xdp->rxq->queue_index;
 
 	return xdp_frame;
 }
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 5e59ab896f05..8f2d7013620f 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -197,7 +197,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 
 		rxq.dev = xdpf->dev_rx;
 		rxq.mem.type = xdpf->mem_type;
-		/* TODO: report queue_index to xdp_rxq_info */
+		rxq.queue_index = xdpf->rx_queue_index;
 
 		xdp_convert_frame_to_buff(xdpf, &xdp);
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index dc7b859e8bbf..f419fa0e53e5 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -339,7 +339,7 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
 				struct net_device *rx_dev)
 {
 	struct xdp_txq_info txq = { .dev = tx_dev };
-	struct xdp_rxq_info rxq = { .dev = rx_dev };
+	struct xdp_rxq_info rxq = { };
 	struct xdp_buff xdp;
 	int i, nframes = 0;
 
@@ -349,6 +349,9 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
 		int err;
 
 		xdp_convert_frame_to_buff(xdpf, &xdp);
+		rxq.dev = rx_dev;
+		rxq.mem.type = xdpf->mem_type;
+		rxq.queue_index = xdpf->rx_queue_index;
 		xdp.txq = &txq;
 		xdp.rxq = &rxq;
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9890a30584ba..9691d8dfadf3 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -606,6 +606,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	xdpf->metasize = metasize;
 	xdpf->frame_sz = PAGE_SIZE;
 	xdpf->mem_type = MEM_TYPE_PAGE_ORDER0;
+	xdpf->rx_queue_index = xdp->rxq->queue_index;
 
 	xsk_buff_free(xdp);
 	return xdpf;
-- 
2.53.0


^ permalink raw reply related

* (no subject)
From: Siddharth_Cibi @ 2026-06-19 19:57 UTC (permalink / raw)
  To: ast
  Cc: Siddharth_Cibi, Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)

Subject: [PATCH 0/2] preserve rx_queue_index across XDP redirects

XDP programs executed after redirect through cpumap and devmap
currently lose ingress RX queue information because rx_queue_index
is not preserved across xdp_buff to xdp_frame conversion.

Preserve rx_queue_index in struct xdp_frame and restore it when
rebuilding xdp_rxq_info for redirected execution paths.

Add a selftest validating that ctx->rx_queue_index remains available
through xdp_metadata after redirect.

Testing:

* Built modified kernel objects
* Ran tools/testing/selftests/bpf/test_progs -t xdp_metadata -v
* Verified xdp_metadata passes
* Added explicit rx_queue_index assertion

Siddharth C (1):
  bpf: preserve rx_queue_index across XDP redirects

Siddharth_Cibi (1):
  selftests/bpf: validate rx_queue_index in xdp_metadata

 include/net/xdp.h                                     | 2 ++
 kernel/bpf/cpumap.c                                   | 2 +-
 kernel/bpf/devmap.c                                   | 5 ++++-
 net/core/xdp.c                                        | 1 +
 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 3 ++-
 tools/testing/selftests/bpf/progs/xdp_metadata.c      | 2 +-
 tools/testing/selftests/bpf/xdp_metadata.h            | 1 +
 7 files changed, 12 insertions(+), 4 deletions(-)

-- 
2.53.0


^ permalink raw reply

* Re: [PATCH net] ipv6: ioam: fix type confusion of dst_entry
From: Justin Iurman @ 2026-06-19 19:42 UTC (permalink / raw)
  To: Jiayuan Chen, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, linux-kernel
In-Reply-To: <20260618104336.48934-1-jiayuan.chen@linux.dev>

On 6/18/26 12:43, Jiayuan Chen wrote:
> IOAM uses a dummy dst_entry(null_dst) to mark that the destination should
> not be changed after the transformation. This dst is stored in the IOAM lwt
> state and may be passed to dst_cache_set_ip6().
> 
> However, the IPv6 dst cache path eventually calls rt6_get_cookie(), which
> treats the dst_entry as part of a struct rt6_info. Since the null_dst was
> embedded directly as a struct dst_entry in struct ioam6_lwt, this resulted
> in an invalid cast and rt6_get_cookie() reading fields from the wrong
> object.
> 
> In practice, the wrong cookie is not used while dst->obsolete is zero, but
> rt6_get_cookie() may also access per-cpu value when rt->sernum is
> zero. In this case, rt->sernum aliases ioam6_lwt::cache::reset_ts, which
> can become zero, making this a potential invalid pointer access.
> 
> Fix this by embedding a full struct rt6_info for the dummy IPv6 route and
> passing its dst member to the dst APIs.

Good catch, thanks!

> Fixes: 47ce7c854563 ("net: ipv6: ioam6: fix double reallocation")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

Reviewed-by: Justin Iurman <justin.iurman@gmail.com>

^ permalink raw reply

* Re: [PATCH rdma-next v7] RDMA: Change capability fields in ib_device_attr from int to u32
From: Erni Sri Satya Vennela @ 2026-06-19 19:32 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: mkalderon, Jason Gunthorpe, Leon Romanovsky, zyjzyj2000, sagi,
	mgurtovoy, haris.iqbal, jinpu.wang, bvanassche, kbusch,
	Jens Axboe, Christoph Hellwig, kch, smfrench, linkinjeon, metze,
	tom, trondmy, anna, chuck.lever, jlayton, neil, okorniev, Dai.Ngo,
	achender, davem, edumazet, kuba, pabeni, horms, kees, ebadger,
	linux-rdma, linux-kernel, target-devel, linux-nvme, linux-cifs,
	samba-technical, linux-nfs, netdev, rds-devel, Jason Gunthorpe
In-Reply-To: <aigwONAwxQx6rLef@ashevche-desk.local>

Hi Andy,

Sorry for delayed response.

> >  	attr->max_qp_init_rd_atom =
> >  	    1 << (fls(qattr->max_qp_req_rd_atomic_resc) - 1);
> 
> FWIW, this one and below looks like reinvention of rounddown_pow_of_two().

Acked.
> 
> >  	attr->max_qp_rd_atom =
> > -	    min(1 << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
> > +	    min(1U << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
> >  		attr->max_qp_init_rd_atom);
> 
> ...
> 
> >  int ipoib_cm_dev_init(struct net_device *dev)
> >  {
> >  	struct ipoib_dev_priv *priv = ipoib_priv(dev);
> > -	int max_srq_sge, i;
> > +	int i;
> > +	u32 max_srq_sge;
> >  	u8 addr;
> 
> It seems the order is reversed xmas tree, why not preserving it?
> 
Right. I'll fix it in the next version.
> ...
> 
> > --- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
> > +++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
> 
> >  		max_send_wr =
> > -			min_t(int, wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
> > +			min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
> 
> Now perfectly a single line
> 
> 		max_send_wr = min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
> 
> >  		max_recv_wr = max_send_wr;
> 
> ...
> 
> > -		max_send_wr = min_t(int, wr_limit,
> > -			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
> > -			      clt_path->queue_depth * 4 + 1);
> > -		max_recv_wr = min_t(int, wr_limit,
> > -			      clt_path->queue_depth * 3 + 1);
> > +		max_send_wr = min_t(u32, wr_limit,
> > +				    /* QD * (REQ + RSP + FR REGS or INVS) + drain */
> > +				    clt_path->queue_depth * 4 + 1);
> > +		max_recv_wr = min_t(u32, wr_limit,
> > +				    clt_path->queue_depth * 3 + 1);
> 
> Can we rather update the type of one of them and use min() instead?
> 
I'll remove all the min_t usages in the next version.
> ...
> 
> > --- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
> > +++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
> 
> Ditto.
> 
> ...
> 
> > -static int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
> > -module_param(srpt_srq_size, int, 0444);
> > +static unsigned int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
> > +module_param(srpt_srq_size, uint, 0444);
> 
> Theoretically this might break ABI (if somebody uses negative values for
> anything. I don't think it's the case, but just be informed.
> 
Okay. Thankyou for the information. 

> >  MODULE_PARM_DESC(srpt_srq_size,
> >  		 "Shared receive queue (SRQ) size.");
> 
> ...
> 
> > --- a/drivers/nvme/target/rdma.c
> > +++ b/drivers/nvme/target/rdma.c
> 
> > -	ndev->srq_size = min(ndev->device->attrs.max_srq_wr,
> > -			     nvmet_rdma_srq_size);
> > -	ndev->srq_count = min(ndev->device->num_comp_vectors,
> > -			      ndev->device->attrs.max_srq);
> > +	ndev->srq_size = min_t(u32, ndev->device->attrs.max_srq_wr,
> > +			       nvmet_rdma_srq_size);
> > +	ndev->srq_count = min_t(u32, ndev->device->num_comp_vectors,
> > +				ndev->device->attrs.max_srq);
> 
> Same question, can we change type type of variables instead?
>
Yes. I'll be doing it in the next version.
 
> >  	mutex_lock(&device_list_mutex);
> 
> ...
> 
> >  	inline_page_count = num_pages(nport->inline_data_size);
> >  	inline_sge_count = max(cm_id->device->attrs.max_sge_rd,
> > -				cm_id->device->attrs.max_recv_sge) - 1;
> > +				cm_id->device->attrs.max_recv_sge);
> > +	inline_sge_count = inline_sge_count ? inline_sge_count - 1 : 0;
> 
> Simple conditional might be better
> 
> 	if (inline_sge_count)
> 		inline_sge_count--;
> 	OR
> 		inline_sge_count -= 1;
Okay. I'll update all such instances.

> 
> ...
> 
> > +++ b/include/rdma/ib_verbs.h
> 
> > -	int			max_qp;
> > -	int			max_qp_wr;
> > +	u32			max_qp;
> > +	u32			max_qp_wr;
> 
> Nice, but please check that none of these (and beyond) were not used in signed
> multiplication or (which is more disasterous) division. Otherwise it might be
> subtle issues that will be hard to debug.
Yes I have checked that for all the variables I updated.

> 
> ...
> 
> >  	conn_param->responder_resources =
> > -		min_t(u32, rds_ibdev->max_responder_resources, max_responder_resources);
> > +		min3(rds_ibdev->max_responder_resources,
> > +		     max_responder_resources, U8_MAX);
> >  	conn_param->initiator_depth =
> > -		min_t(u32, rds_ibdev->max_initiator_depth, max_initiator_depth);
> > +		min3(rds_ibdev->max_initiator_depth,
> > +		     max_initiator_depth, U8_MAX);
> 
> I believe we can go a few characters over and leave them to be single lines.
> 
Okay.

> >  	conn_param->retry_count = min_t(unsigned int, rds_ib_retry_count, 7);
> 
> What about this one?
Sorry. I missed this one, I'll update it.

> 
> >  	conn_param->rnr_retry_count = 7;
> 
> ...
> 
> >  int frwr_query_device(struct rpcrdma_ep *ep, const struct ib_device *device)
> >  {
> >  	const struct ib_device_attr *attrs = &device->attrs;
> > -	int max_qp_wr, depth, delta;
> > +	u32 max_qp_wr;
> > +	int depth, delta;
> >  	unsigned int max_sge;
> 
> Reversed xmas tree order.
Okay

Thankyou for all your suggestions.
The next version will be incorporated with all these changes.

- Vennela
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

^ permalink raw reply

* [PATCH net v2] eth: bnxt: improve the timing of stats
From: Jakub Kicinski @ 2026-06-19 19:15 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	michael.chan, pavan.chebbi

Kernel selftests wait 1.25x of the promised stats refresh time
(as read from ethtool -c). bnxt reports 1sec by default, but
the stats update process has two steps. First device DMAs the
new values, then the service task performs update in full-width
SW counters. So the worst case delay is actually 2x.

Note that the behavior is different for ring stats and port stats.
Port stats are fetched synchronously by the service worker, so
there's no risk of doubling up the delay there.

The problem of stale stats impacts not only tests but real workloads
which monitor egress bandwidth of a NIC. The inaccuracy causes double
counting in the next cycle and spurious overload alarms.

Try to read from the DMA buffer more aggressively, to mitigate
timing issues between DMA and service task. The SW update should
be cheap.

Fixes: 51f307856b60 ("bnxt_en: Allow statistics DMA to be configurable using ethtool -C.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: michael.chan@broadcom.com
CC: pavan.chebbi@broadcom.com

v2:
 - split the accumulate into port and ring
 - make the sync only cover rings
 - remove sync from callbacks which use port stats (which are fetched
   synchronously by the service worker)
v1: https://lore.kernel.org/20260618181358.3037661-1-kuba@kernel.org

With this patch I had a 50 clean runs of ntuple.py in a row.
Previously it'd fail within 5 runs at most.

Hopefully this is good enough, in the past I sent an RFC to
convert the driver to use SW stats for everything. That felt
a little drastic.
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  5 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 48 ++++++++++++++++++-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  1 +
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 6d312259f852..6335dfc14c98 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2620,6 +2620,10 @@ struct bnxt {
 #define BNXT_MIN_STATS_COAL_TICKS	  250000
 #define BNXT_MAX_STATS_COAL_TICKS	 1000000
 
+	/* Protects stats_updated_jiffies and writes to sw_stats */
+	spinlock_t		stats_lock;
+	unsigned long		stats_updated_jiffies;
+
 	struct work_struct	sp_task;
 	unsigned long		sp_event;
 #define BNXT_RX_NTP_FLTR_SP_EVENT	1
@@ -3027,6 +3031,7 @@ void bnxt_reenable_sriov(struct bnxt *bp);
 void bnxt_close_nic(struct bnxt *, bool, bool);
 void bnxt_get_ring_drv_stats(struct bnxt *bp,
 			     struct bnxt_total_ring_drv_stats *stats);
+void bnxt_sync_ring_stats(struct bnxt *bp);
 bool bnxt_rfs_capable(struct bnxt *bp, bool new_rss_ctx);
 int bnxt_dbg_hwrm_rd_reg(struct bnxt *bp, u32 reg_off, u16 num_words,
 			 u32 *reg_buf);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 055e93a417b6..7513618793da 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -10530,7 +10530,7 @@ static void bnxt_accumulate_stats(struct bnxt_stats_mem *stats)
 				stats->hw_masks, stats->len / 8, false);
 }
 
-static void bnxt_accumulate_all_stats(struct bnxt *bp)
+static void bnxt_accumulate_ring_stats(struct bnxt *bp)
 {
 	struct bnxt_stats_mem *ring0_stats;
 	bool ignore_zero = false;
@@ -10553,6 +10553,10 @@ static void bnxt_accumulate_all_stats(struct bnxt *bp)
 					ring0_stats->hw_masks,
 					ring0_stats->len / 8, ignore_zero);
 	}
+}
+
+static void bnxt_accumulate_port_stats(struct bnxt *bp)
+{
 	if (bp->flags & BNXT_FLAG_PORT_STATS) {
 		struct bnxt_stats_mem *stats = &bp->port_stats;
 		__le64 *hw_stats = stats->hw_stats;
@@ -10575,6 +10579,41 @@ static void bnxt_accumulate_all_stats(struct bnxt *bp)
 	}
 }
 
+static void bnxt_accumulate_all_stats(struct bnxt *bp)
+{
+	bnxt_accumulate_ring_stats(bp);
+	bnxt_accumulate_port_stats(bp);
+}
+
+/* Re-accumulate ring stats from DMA buffers if stale.
+ * uAPIs for reading sw_stats should call this first.
+ *
+ * We promise user space update frequency of bp->stats_coal_ticks but
+ * the update is a two step process - first device updates the DMA buffer,
+ * then we have to update from that buffer to driver stats in the service work.
+ * Worst case we would be 2x off from the desired frequency.
+ * Sync the stats sooner, if stale. The 20% threshold was chosen arbitrarily.
+ *
+ * Ideally we would split the user-configured time into two portions,
+ * i.e. also lower the DMA period by the 20%. But the DMA timer seems to have
+ * too coarse granularity to play such tricks.
+ */
+void bnxt_sync_ring_stats(struct bnxt *bp)
+{
+	unsigned long stale;
+
+	if (!netif_running(bp->dev) || !bp->stats_coal_ticks)
+		return;
+
+	spin_lock(&bp->stats_lock);
+	stale = usecs_to_jiffies(bp->stats_coal_ticks / 5);
+	if (time_after_eq(jiffies, bp->stats_updated_jiffies + stale)) {
+		bnxt_accumulate_ring_stats(bp);
+		bp->stats_updated_jiffies = jiffies;
+	}
+	spin_unlock(&bp->stats_lock);
+}
+
 static int bnxt_hwrm_port_qstats(struct bnxt *bp, u8 flags)
 {
 	struct hwrm_port_qstats_input *req;
@@ -13577,6 +13616,7 @@ bnxt_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 		return;
 	}
 
+	bnxt_sync_ring_stats(bp);
 	bnxt_get_ring_stats(bp, stats);
 	bnxt_add_prev_stats(bp, stats);
 
@@ -14753,7 +14793,10 @@ static void bnxt_sp_task(struct work_struct *work)
 	if (test_and_clear_bit(BNXT_PERIODIC_STATS_SP_EVENT, &bp->sp_event)) {
 		bnxt_hwrm_port_qstats(bp, 0);
 		bnxt_hwrm_port_qstats_ext(bp, 0);
+		spin_lock(&bp->stats_lock);
 		bnxt_accumulate_all_stats(bp);
+		bp->stats_updated_jiffies = jiffies;
+		spin_unlock(&bp->stats_lock);
 	}
 
 	if (test_and_clear_bit(BNXT_LINK_CHNG_SP_EVENT, &bp->sp_event)) {
@@ -15488,6 +15531,7 @@ static int bnxt_init_board(struct pci_dev *pdev, struct net_device *dev)
 	INIT_DELAYED_WORK(&bp->fw_reset_task, bnxt_fw_reset_task);
 
 	spin_lock_init(&bp->ntp_fltr_lock);
+	spin_lock_init(&bp->stats_lock);
 #if BITS_PER_LONG == 32
 	spin_lock_init(&bp->db_lock);
 #endif
@@ -16056,6 +16100,7 @@ static void bnxt_get_queue_stats_rx(struct net_device *dev, int i,
 	if (!bp->bnapi)
 		return;
 
+	bnxt_sync_ring_stats(bp);
 	cpr = &bp->bnapi[i]->cp_ring;
 	sw = cpr->stats.sw_stats;
 
@@ -16084,6 +16129,7 @@ static void bnxt_get_queue_stats_tx(struct net_device *dev, int i,
 	if (!bp->tx_ring)
 		return;
 
+	bnxt_sync_ring_stats(bp);
 	bnapi = bp->tx_ring[bp->tx_ring_map[i]].bnapi;
 	sw = bnapi->cp_ring.stats.sw_stats;
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 56d74a3c24b7..62bc9cae613c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -606,6 +606,7 @@ static void bnxt_get_ethtool_stats(struct net_device *dev,
 		goto skip_ring_stats;
 	}
 
+	bnxt_sync_ring_stats(bp);
 	tpa_stats = bnxt_get_num_tpa_ring_stats(bp);
 	for (i = 0; i < bp->cp_nr_rings; i++) {
 		struct bnxt_napi *bnapi = bp->bnapi[i];
-- 
2.54.0


^ permalink raw reply related

* Re: [ANN] Google's Netdev-CI for IDPF and GVE
From: Jakub Kicinski @ 2026-06-19 18:59 UTC (permalink / raw)
  To: Sheena Mohan
  Cc: netdev, andrew+netdev, davem, Eric Dumazet, pabeni, horms,
	Willem de Bruijn, Max Yuan, Pin-yen Lin, Harshitha Ramamurthy,
	Joshua Washington, Danny Gonzalez, David Decotigny, Brian Vazquez
In-Reply-To: <CADWJPTsg5G21=hybo81+QHv0+g64d3a+6gGUaJSm1i7EttCUcw@mail.gmail.com>

On Fri, 29 May 2026 13:44:48 -0700 Sheena Mohan wrote:
> Hi everyone,
> 
> We are happy to share that Netdev-CI testing on both IDPF (running on
> Google Bare Metal) and GVE (running on Google Virtual Machines) is now
> up and running.
> This NIPA integration work enables executing kselftests against the
> current proposed net-next kernel branch on real hardware.
> 
> Thanks to Danny, Max, and Pin-yen for their contributions!
> 
> The test results and logs are available in:
> 
> IDPF Results: https://idpf-netdev-nipa.static.usercontent.goog/json/results.json
> GVE Results: https://gve-netdev-nipa.static.usercontent.goog/json/results.json

Hi Sheena!

The Google runners do not report device info. The results should
contain a "device" object that identifies external components that
may cause regressions (like device FW version), see:
https://github.com/linux-netdev/nipa/wiki/Netdev-CI-system/#device-information
In practice the main use we currently have for it is to auto-categorize
the results as executing on a real driver rather than netdevsim.

^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Andrew Lunn @ 2026-06-19 18:37 UTC (permalink / raw)
  To: Das, Shubham
  Cc: Alexander H Duyck, lee@trager.us, netdev@vger.kernel.org,
	mkubecek@suse.cz, D H, Siddaraju, Chintalapalle, Balaji,
	Lindberg, Magnus, niklas.damberg@ericsson.com
In-Reply-To: <SN7PR11MB8109C173933D08F994FBB084FFE22@SN7PR11MB8109.namprd11.prod.outlook.com>

> The host driver does not directly access any registers but requests
> the PHY FW to manage PRBS on behalf of it.

Maybe a dumb question. Why?

Can you change the firmware to expose the 802.3 registers for PRBS?
You can then write a library which both plylib and your driver can
use.

	Andrew

^ permalink raw reply

* Re: [PATCH] net: add sock_open() for unified socket creation
From: Alex Goltsev @ 2026-06-19 17:54 UTC (permalink / raw)
  To: Al Viro; +Cc: davem, netdev, linux-kernel
In-Reply-To: <20260619163421.GD2636677@ZenIV>

On Fri, 19 Jun 2026 at 19:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Jun 19, 2026 at 01:35:56PM +0300, Alex Goltsev wrote:
> > > What's the point (and why not make it inline, while we are at it)?
> >
> > > Are there really callers that would pass a non-constant value as the last argument,
> > > and if so, what are they doing next?
> >
> >
> > As for `inline`: in this case, it would have no practical significance.
> >
> > The compiler already treats a simple inline function as a regular
> >
> > symbol within the `EXPORT_SYMBOL` context, whereas a static inline
> > function (the standard
> >
> > kernel template for helper functions) would completely break the
> > export to the LKM.
>
> How so?  All three underlying primitives are exported, so static inline
> in whatever include/*/*.h you put it in would work just fine.
>
> > As for the last argument, yes, today it is usually a constant,
> >
> > but that’s not the point. The purpose of the enumeration is to provide
> >
> > a unified, explicit control interface. It’s important that if, in the future,
> >
> > someone adds a new type of socket creation, existing calling programs won’t
> >
> > panic or throw a compilation error, but will smoothly fall back to
> >
> > the default case and return -EINVAL, which is a safe failure mode.
>
> Collapsing several functions together is worthless unless the combination
> can be _used_ other than a (questionable) syntax sugar.  kmalloc() can;
> something that would only result in trading multiple identifiers for
> functions for multiple identifiers for "which function to call" is not
> an improvement.

Thank you for the detailed overview. I understand your point of view,
standardization without adding new features isn’t an improvement. I’ll
consider a v2 version in which flags can be combined to produce unique
behavior, so that the API offers more than just syntactic sugar.

^ permalink raw reply

* Re: [PATCH net-next v7 04/11] net: Enable BIG TCP with partial GSO
From: Alice Mikityanska @ 2026-06-19 17:21 UTC (permalink / raw)
  To: Paolo Abeni, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Xin Long, Willem de Bruijn, Willem de Bruijn,
	David Ahern, Nikolay Aleksandrov
  Cc: Shuah Khan, Stanislav Fomichev, Andrew Lunn, Simon Horman,
	Florian Westphal, netdev, Alice Mikityanska
In-Reply-To: <554ff2bd-e4d7-4e64-8ec4-86dc7da85992@redhat.com>

On Sun, Jun 14, 2026, at 14:19, Paolo Abeni wrote:
> On 6/11/26 9:29 PM, Alice Mikityanska wrote:
>> From: Alice Mikityanska <alice@isovalent.com>
>> 
>> skb_segment is called for partial GSO, when netif_needs_gso returns true
>> in validate_xmit_skb. Partial GSO is needed, for example, when
>> segmentation of tunneled traffic is offloaded to a NIC that only
>> supports inner checksum offload.
>> 
>> Currently, skb_segment clamps the segment length to 65534 bytes, because
>> gso_size == 65535 is a special value GSO_BY_FRAGS, and we don't want
>> to accidentally assign mss = 65535, as it would fall into the
>> GSO_BY_FRAGS check further in the function.
>> 
>> This implementation, however, artificially blocks len > 65534, which is
>> possible since the introduction of BIG TCP. To allow bigger lengths and
>> avoid resegmentation of BIG TCP packets, store the gso_by_frags flag in
>> the beginning and don't use a special value of mss for this purpose
>> after mss was modified.
>> 
>> Signed-off-by: Alice Mikityanska <alice@isovalent.com>
>> Reviewed-by: Willem de Bruijn <willemb@google.com>
>> ---
>>  drivers/net/netdevsim/psp.c |  2 +-
>>  net/core/skbuff.c           | 10 +++++-----
>>  2 files changed, 6 insertions(+), 6 deletions(-)
>> 
>> diff --git a/drivers/net/netdevsim/psp.c b/drivers/net/netdevsim/psp.c
>> index d3e36c74be62..6b3532b5e360 100644
>> --- a/drivers/net/netdevsim/psp.c
>> +++ b/drivers/net/netdevsim/psp.c
>> @@ -92,7 +92,7 @@ nsim_do_psp(struct sk_buff *skb, struct netdevsim *ns,
>>  		 * provide a valid checksum here, so the skb isn't dropped.
>>  		 */
>>  		uh = udp_hdr(skb);
>> -		udplen = ntohs(uh->len) ?: skb->len - skb_transport_offset(skb);
>> +		udplen = udp_get_len(skb, uh, skb_transport_offset(skb));
>>  		csum = skb_checksum(skb, skb_transport_offset(skb),
>>  				    udplen, 0);
>>  
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index c64693fcb2d1..5dcee79df8cf 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4773,6 +4773,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  	struct sk_buff *tail = NULL;
>>  	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>  	unsigned int mss = skb_shinfo(head_skb)->gso_size;
>> +	bool gso_by_frags = mss == GSO_BY_FRAGS;
>>  	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
>>  	unsigned int offset = doffset;
>>  	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
>> @@ -4788,7 +4789,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  	int nfrags, pos;
>>  
>>  	if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) &&
>> -	    mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) {
>> +	    !gso_by_frags && mss != skb_headlen(head_skb)) {
>>  		struct sk_buff *check_skb;
>>  
>>  		for (check_skb = list_skb; check_skb; check_skb = check_skb->next) {
>> @@ -4816,7 +4817,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  	sg = !!(features & NETIF_F_SG);
>>  	csum = !!can_checksum_protocol(features, proto);
>>  
>> -	if (sg && csum && (mss != GSO_BY_FRAGS))  {
>> +	if (sg && csum && !gso_by_frags)  {
>>  		if (!(features & NETIF_F_GSO_PARTIAL)) {
>>  			struct sk_buff *iter;
>>  			unsigned int frag_len;
>> @@ -4850,9 +4851,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>  		/* GSO partial only requires that we trim off any excess that
>>  		 * doesn't fit into an MSS sized block, so take care of that
>>  		 * now.
>> -		 * Cap len to not accidentally hit GSO_BY_FRAGS.
>>  		 */
>> -		partial_segs = min(len, GSO_BY_FRAGS - 1) / mss;
>> +		partial_segs = len / mss;
>
> Sashiko/gemini says the above can lead to hit BUG_ON() later.
>
> I *think* it's not a false positive, as it looks like skb_segment()
> assumes an skb can hold `mss` bytes without resorting to frag_list
> usage, and mss > MAX_SKB_FRAGS * PAGE_SIZE  breaks such assumption.
>
> I think handling correctly this case will requires some non trivial
> surgery to skb_segment: both `while (pos < offset + len) {` loops must
> be updated to feed data from `frags` as needed instead of
> BUG_ON()/net_warn_ratelimited(skb_shinfo(nskb)->nr_frags >= MAX_SKB_FRAGS);

Thanks, these are some valid points. I analyzed the code in skb_segment
better, and from what I can tell, this issue exists before my changes,
even without BIG TCP (I have a reproduction). Moreover, I don't think
that my changes make the situation worse, because what truly matters is
the geometry of the SKB coming into skb_segment, not the sizes of the
frags as such.

The root of the issue is that skb_segment attempts to produce SKBs of
the same size and without frag_list, but the incoming SKB can have a
frag_list and more than 17 frags much smaller than PAGE_SIZE. For
example, if the incoming SKB has 30 frags (using frag_list) of 500 bytes
each, it's just 15000 bytes (much smaller than 64k or MAX_SKB_FRAGS *
PAGE_SIZE), but the partial GSO flow will try to fit almost all of them
(mss will be almost 15000) into a single non-frag_list SKB. Since
skb_segment just reuses the existing frags without combining them in any
way, this will end up with "too many frags".

This means that the assumption that an SKB can hold mss bytes without
frag_list is wrong, and skb_segment is already broken as it relies on
this assumption.

As for the solution, we can't just break the loop early before we put
len bytes into an output SKB, because it will break the guarantee that
the output SKBs have the same size. The only way I can imagine is to
dry-run the loop over all frags in advance and determine the smallest
length that any 17 consecutive frags can take together, and limit len
to that value.

Why adding BIG TCP into the picture doesn't make the situation worse, in
my opinion? There are two main practical ways to get into skb_segment
with BIG TCP:

1. Sending a TCP stream from an application. tcp_sendmsg_locked doesn't
   build frag_list SKBs, stopping at sysctl_max_skb_frags. If a non-
   frag_list SKB enters skb_segment, it can handle it even if it's well
   above 64k. From my observation, 32768-byte high order pages are
   allocated to back the frags.

2. Receiving a TCP stream with GRO, then forwarding it out of another
   network interface, e.g., that uses GSO partial. This is indeed broken
   if GRO resorts to frag_list, but, as shown above, it's broken even
   without BIG TCP.

Down below, I'll attach a reproducer patch (mostly generated with AI)
that adds debug prints to skb_segment and a script that sets up
forwarding to a GSO partial VXLAN netdev. Setting
net.core.high_order_alloc_disable is not even necessary: I can still
observe a 29026-byte SKB with 21 frags and frag_list (most frags are
1448 bytes), it's just these frags are allocated from two order-3 pages.

Here are example logs that show the geometry of an SKB that comes from
regular GRO (no BIG TCP) and breaks skb_segment:

skbuff: skb_segment: start head=ffff888101008a00 len=29026 doffset=66 mss=28960 gso_by_frags=0 features=0x0000010e501d0049
skbuff: skb_segment: input head[0] skb=ffff888101008a00 len=29026 data_len=28960 headlen=66 nr_frags=16 frag_list=ffff888104b7e900 head_frag=0 gso_size=1448 gso_segs=20 gso_type=0x801
skbuff: skb_segment: input head[0] frag[0] page=ffffea000420f7c0 head=ffffea000420f7c0 order=0 off=2480 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[1] page=ffffea000420f7c0 head=ffffea000420f7c0 order=0 off=3928 size=168 base_pages=1
skbuff: skb_segment: input head[0] frag[2] page=ffffea00048261c0 head=ffffea00048261c0 order=0 off=0 size=1280 base_pages=1
skbuff: skb_segment: input head[0] frag[3] page=ffffea00048261c0 head=ffffea00048261c0 order=0 off=1280 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[4] page=ffffea00048261c0 head=ffffea00048261c0 order=0 off=2728 size=1368 base_pages=1
skbuff: skb_segment: input head[0] frag[5] page=ffffea0004886400 head=ffffea0004886400 order=0 off=0 size=80 base_pages=1
skbuff: skb_segment: input head[0] frag[6] page=ffffea0004886400 head=ffffea0004886400 order=0 off=80 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[7] page=ffffea0004886400 head=ffffea0004886400 order=0 off=1528 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[8] page=ffffea0004886400 head=ffffea0004886400 order=0 off=2976 size=1120 base_pages=1
skbuff: skb_segment: input head[0] frag[9] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=0 size=328 base_pages=1
skbuff: skb_segment: input head[0] frag[10] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=328 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[11] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=1776 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[12] page=ffffea00048883c0 head=ffffea00048883c0 order=0 off=3224 size=872 base_pages=1
skbuff: skb_segment: input head[0] frag[13] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=0 size=576 base_pages=1
skbuff: skb_segment: input head[0] frag[14] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=576 size=1448 base_pages=1
skbuff: skb_segment: input head[0] frag[15] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=2024 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] skb=ffff888104b7e900 len=11584 data_len=11584 headlen=0 nr_frags=11 frag_list=0000000000000000 head_frag=0 gso_size=0 gso_segs=0 gso_type=0x0
skbuff: skb_segment: input frag_list[0] frag[0] page=ffffea00041bf840 head=ffffea00041bf840 order=0 off=3472 size=624 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[1] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=0 size=824 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[2] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=824 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[3] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=2272 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[4] page=ffffea000413ae40 head=ffffea000413ae40 order=0 off=3720 size=376 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[5] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=0 size=1072 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[6] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=1072 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[7] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=2520 size=1448 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[8] page=ffffea000421d8c0 head=ffffea000421d8c0 order=0 off=3968 size=128 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[9] page=ffffea0004826440 head=ffffea0004826440 order=0 off=0 size=1320 base_pages=1
skbuff: skb_segment: input frag_list[0] frag[10] page=ffffea0004826440 head=ffffea0004826440 order=0 off=1320 size=1448 base_pages=1
skbuff: skb_segment: setup after push head=ffff888101008a00 len=29026 payload=28960 offset=66 mss=28960 partial_segs=20 sg=1 csum=1 list_skb=ffff888104b7e900 features=0x0000010e501d0049
skbuff: skb_segment: segment begin offset=66 len=28960 end=29026 pos=66 i=0 nfrags=16 hsize=0 list_skb=ffff888104b7e900 frag_skb=ffff888101008a00
skbuff: skb_segment: alloc nskb=ffff888101cf5100 hsize=0 doffset=66 headroom=190
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=0 source_skb=ffff888101008a00 source=frag[0] page=ffffea000420f7c0 off=2480 size=1448 pos=66 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=1 pos=66 next_pos=1514 next_i=1
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=1 source_skb=ffff888101008a00 source=frag[1] page=ffffea000420f7c0 off=3928 size=168 pos=1514 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=2 pos=1514 next_pos=1682 next_i=2
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=2 source_skb=ffff888101008a00 source=frag[2] page=ffffea00048261c0 off=0 size=1280 pos=1682 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=3 pos=1682 next_pos=2962 next_i=3
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=3 source_skb=ffff888101008a00 source=frag[3] page=ffffea00048261c0 off=1280 size=1448 pos=2962 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=4 pos=2962 next_pos=4410 next_i=4
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=4 source_skb=ffff888101008a00 source=frag[4] page=ffffea00048261c0 off=2728 size=1368 pos=4410 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=5 pos=4410 next_pos=5778 next_i=5
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=5 source_skb=ffff888101008a00 source=frag[5] page=ffffea0004886400 off=0 size=80 pos=5778 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=6 pos=5778 next_pos=5858 next_i=6
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=6 source_skb=ffff888101008a00 source=frag[6] page=ffffea0004886400 off=80 size=1448 pos=5858 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=7 pos=5858 next_pos=7306 next_i=7
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=7 source_skb=ffff888101008a00 source=frag[7] page=ffffea0004886400 off=1528 size=1448 pos=7306 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=8 pos=7306 next_pos=8754 next_i=8
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=8 source_skb=ffff888101008a00 source=frag[8] page=ffffea0004886400 off=2976 size=1120 pos=8754 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=9 pos=8754 next_pos=9874 next_i=9
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=9 source_skb=ffff888101008a00 source=frag[9] page=ffffea00048883c0 off=0 size=328 pos=9874 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=10 pos=9874 next_pos=10202 next_i=10
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=10 source_skb=ffff888101008a00 source=frag[10] page=ffffea00048883c0 off=328 size=1448 pos=10202 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=11 pos=10202 next_pos=11650 next_i=11
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=11 source_skb=ffff888101008a00 source=frag[11] page=ffffea00048883c0 off=1776 size=1448 pos=11650 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=12 pos=11650 next_pos=13098 next_i=12
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=12 source_skb=ffff888101008a00 source=frag[12] page=ffffea00048883c0 off=3224 size=872 pos=13098 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=13 pos=13098 next_pos=13970 next_i=13
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=13 source_skb=ffff888101008a00 source=frag[13] page=ffffea00041bf840 off=0 size=576 pos=13970 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=14 pos=13970 next_pos=14546 next_i=14
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=14 source_skb=ffff888101008a00 source=frag[14] page=ffffea00041bf840 off=576 size=1448 pos=14546 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=15 pos=14546 next_pos=15994 next_i=15
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=15 source_skb=ffff888101008a00 source=frag[15] page=ffffea00041bf840 off=2024 size=1448 pos=15994 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=16 pos=15994 next_pos=17442 next_i=16
skbuff: skb_segment: sg source exhausted pos=17442 target_end=29026 old_frag_skb=ffff888101008a00 old_nfrags=16 next_list=ffff888104b7e900
skbuff: skb_segment: sg entered frag_list skb=ffff888104b7e900 len=11584 headlen=0 nr_frags=11 head_frag=0 start_i=0 next_list=0000000000000000
skbuff: skb_segment: sg add nskb=ffff888101cf5100 out_frag=16 source_skb=ffff888104b7e900 source=frag[0] page=ffffea00041bf840 off=3472 size=624 pos=17442 target=[66,29026)
skbuff: skb_segment: sg consumed source nskb=ffff888101cf5100 out_nr_frags=17 pos=17442 next_pos=18066 next_i=1
skbuff: skb_segment: sg output full nskb=ffff888101cf5100 nr_frags=17 max=17 pos=18066 target_end=29026 source_i=1 source_nfrags=11 source_skb=ffff888104b7e900
skbuff: skb_segment: too many frags: 18066 28960
skbuff: skb_segment: error err=-22 segs=ffff888101cf5100 tail=0000000019d2a099 offset=66 len=28960 pos=18066 i=1 nfrags=11 list_skb=0000000000000000 frag_skb=ffff888104b7e900

Here goes the reproducer patch (applies on top of net-next):

--cut--
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 18dabb4e9cfa..f23aff92f857 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -65,6 +65,7 @@
 #include <linux/kcov.h>
 #include <linux/iov_iter.h>
 #include <linux/crc32.h>
+#include <linux/ratelimit.h>
 
 #include <net/protocol.h>
 #include <net/dst.h>
@@ -4759,6 +4760,77 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_segment_list);
 
+static DEFINE_RATELIMIT_STATE(skb_segment_trace_ratelimit, 5 * HZ, 1);
+
+//#define SKB_SEGMENT_TRACE_THRESHOLD GSO_LEGACY_MAX_SIZE
+#define SKB_SEGMENT_TRACE_THRESHOLD 20000
+
+static bool skb_segment_trace(const struct sk_buff *skb,
+			      unsigned int mss)
+{
+	if (!mss || mss == GSO_BY_FRAGS)
+		return false;
+
+	if (mss <= SKB_SEGMENT_TRACE_THRESHOLD && skb->len <= SKB_SEGMENT_TRACE_THRESHOLD)
+		return false;
+
+	return __ratelimit(&skb_segment_trace_ratelimit);
+}
+
+static void skb_segment_dbg_dump_skb(const char *stage, const char *role,
+				     unsigned int idx,
+				     const struct sk_buff *skb)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	unsigned int nr_frags = shinfo->nr_frags;
+	unsigned int i;
+
+	pr_info("skb_segment: %s %s[%u] skb=%px len=%u data_len=%u headlen=%u nr_frags=%u frag_list=%px head_frag=%u gso_size=%u gso_segs=%u gso_type=0x%x\n",
+		stage, role, idx, skb, skb->len, skb->data_len,
+		skb_headlen(skb), nr_frags, shinfo->frag_list,
+		skb->head_frag, shinfo->gso_size, shinfo->gso_segs,
+		shinfo->gso_type);
+
+	for (i = 0; i < nr_frags; i++) {
+		const skb_frag_t *frag = &shinfo->frags[i];
+
+		struct page *page = skb_frag_page(frag);
+		struct page *head = compound_head(page);
+		unsigned int off = skb_frag_off(frag);
+		unsigned int size = skb_frag_size(frag);
+		unsigned int base_pages;
+
+		base_pages = ((off + size - 1) >> PAGE_SHIFT) - (off >> PAGE_SHIFT) + 1;
+
+		pr_info("skb_segment: %s %s[%u] frag[%u] page=%px head=%px order=%u off=%u size=%u base_pages=%u\n",
+			stage, role, idx, i, skb_frag_page(frag),
+			head, compound_order(head),
+			skb_frag_off(frag), skb_frag_size(frag),
+			base_pages);
+	}
+}
+
+static void skb_segment_dbg_dump_tree(const char *stage,
+				      const struct sk_buff *skb)
+{
+	struct sk_buff *iter;
+	unsigned int i = 0;
+
+	skb_segment_dbg_dump_skb(stage, "head", 0, skb);
+	skb_walk_frags(skb, iter)
+		skb_segment_dbg_dump_skb(stage, "frag_list", i++, iter);
+}
+
+static void skb_segment_dbg_dump_list(const char *stage,
+				      const struct sk_buff *skb)
+{
+	const struct sk_buff *iter;
+	unsigned int i = 0;
+
+	for (iter = skb; iter; iter = iter->next)
+		skb_segment_dbg_dump_skb(stage, "out", i++, iter);
+}
+
 /**
  *	skb_segment - Perform protocol segmentation on skb.
  *	@head_skb: buffer to segment
@@ -4775,6 +4847,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *tail = NULL;
 	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
 	unsigned int mss = skb_shinfo(head_skb)->gso_size;
+	bool gso_by_frags = mss == GSO_BY_FRAGS;
 	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
 	unsigned int offset = doffset;
 	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
@@ -4784,6 +4857,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *frag_skb;
 	skb_frag_t *frag;
 	__be16 proto;
+	bool trace = false;
 	bool csum, sg;
 	int err = -ENOMEM;
 	int i = 0;
@@ -4861,6 +4935,18 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			partial_segs = 0;
 	}
 
+	trace = skb_segment_trace(head_skb, mss);
+	if (trace) {
+		pr_info("skb_segment: start head=%px len=%u doffset=%u mss=%u gso_by_frags=%u features=%pNF\n",
+			head_skb, head_skb->len, doffset, mss, gso_by_frags,
+			&features);
+		skb_segment_dbg_dump_tree("input", head_skb);
+		pr_info("skb_segment: setup after push head=%px len=%u payload=%u offset=%u mss=%u partial_segs=%u sg=%u csum=%u list_skb=%px features=%pNF\n",
+			head_skb, head_skb->len, head_skb->len - offset,
+			offset, mss, partial_segs, sg, csum, list_skb,
+			&features);
+	}
+
 normal:
 	headroom = skb_headroom(head_skb);
 	pos = skb_headlen(head_skb);
@@ -4888,10 +4974,22 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		hsize = skb_headlen(head_skb) - offset;
 
+		if (trace)
+			pr_info("skb_segment: segment begin offset=%u len=%u end=%u pos=%d i=%d nfrags=%d hsize=%d list_skb=%px frag_skb=%px\n",
+				offset, len, offset + len, pos, i, nfrags,
+				hsize, list_skb, frag_skb);
+
 		if (hsize <= 0 && i >= nfrags && skb_headlen(list_skb) &&
 		    (skb_headlen(list_skb) == len || sg)) {
 			BUG_ON(skb_headlen(list_skb) > len);
 
+			if (trace)
+				pr_info("skb_segment: clone frag_list skb=%px headlen=%u len=%u nr_frags=%u pos=%d target_end=%u sg=%u\n",
+					list_skb, skb_headlen(list_skb),
+					list_skb->len,
+					skb_shinfo(list_skb)->nr_frags,
+					pos, offset + len, sg);
+
 			nskb = skb_clone(list_skb, GFP_ATOMIC);
 			if (unlikely(!nskb))
 				goto err;
@@ -4903,9 +5001,22 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			pos += skb_headlen(list_skb);
 
 			while (pos < offset + len) {
+				if (trace)
+					pr_info("skb_segment: clone walk pos=%d target_end=%u i=%d nfrags=%d frag_skb=%px\n",
+						pos, offset + len, i, nfrags,
+						frag_skb);
+				if (trace && i >= nfrags)
+					pr_info("skb_segment: clone walk would BUG: pos=%d target_end=%u i=%d nfrags=%d list_skb=%px\n",
+						pos, offset + len, i, nfrags,
+						list_skb);
 				BUG_ON(i >= nfrags);
 
 				size = skb_frag_size(frag);
+				if (trace)
+					pr_info("skb_segment: clone walk frag[%d] page=%px off=%u size=%d pos=%d next=%d target_end=%u\n",
+						i, skb_frag_page(frag),
+						skb_frag_off(frag), size, pos,
+						pos + size, offset + len);
 				if (pos + size > offset + len)
 					break;
 
@@ -4916,6 +5027,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			list_skb = list_skb->next;
 
+			if (trace)
+				pr_info("skb_segment: clone walk done nskb=%px pos=%d i=%d next_list=%px\n",
+					nskb, pos, i, list_skb);
+
 			if (unlikely(pskb_trim(nskb, len))) {
 				kfree_skb(nskb);
 				goto err;
@@ -4945,6 +5060,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			skb_reserve(nskb, headroom);
 			__skb_put(nskb, doffset);
+
+			if (trace)
+				pr_info("skb_segment: alloc nskb=%px hsize=%d doffset=%u headroom=%u\n",
+					nskb, hsize, doffset, headroom);
 		}
 
 		if (segs)
@@ -4997,6 +5116,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
+				if (trace)
+					pr_info("skb_segment: sg source exhausted pos=%d target_end=%u old_frag_skb=%px old_nfrags=%d next_list=%px\n",
+						pos, offset + len, frag_skb,
+						nfrags, list_skb);
 				if (skb_orphan_frags(list_skb, GFP_ATOMIC) ||
 				    skb_zerocopy_clone(nskb, list_skb,
 						       GFP_ATOMIC))
@@ -5019,11 +5142,24 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 					frag--;
 				}
 
+				if (trace)
+					pr_info("skb_segment: sg entered frag_list skb=%px len=%u headlen=%u nr_frags=%d head_frag=%u start_i=%d next_list=%px\n",
+						frag_skb, frag_skb->len,
+						skb_headlen(frag_skb), nfrags,
+						frag_skb->head_frag, i,
+						list_skb->next);
+
 				list_skb = list_skb->next;
 			}
 
 			if (unlikely(skb_shinfo(nskb)->nr_frags >=
 				     MAX_SKB_FRAGS)) {
+				if (trace)
+					pr_info("skb_segment: sg output full nskb=%px nr_frags=%u max=%u pos=%d target_end=%u source_i=%d source_nfrags=%d source_skb=%px\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						MAX_SKB_FRAGS, pos,
+						offset + len, i, nfrags,
+						frag_skb);
 				net_warn_ratelimited(
 					"skb_segment: too many frags: %u %u\n",
 					pos, mss);
@@ -5035,18 +5171,46 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			__skb_frag_ref(nskb_frag);
 			size = skb_frag_size(nskb_frag);
 
+			if (trace)
+				pr_info("skb_segment: sg add nskb=%px out_frag=%u source_skb=%px source=%s[%d] page=%px off=%u size=%d pos=%d target=[%u,%u)\n",
+					nskb, skb_shinfo(nskb)->nr_frags,
+					frag_skb, i < 0 ? "head_frag" : "frag",
+					i, skb_frag_page(nskb_frag),
+					skb_frag_off(nskb_frag), size, pos,
+					offset, offset + len);
+
 			if (pos < offset) {
+				if (trace)
+					pr_info("skb_segment: sg trim front nskb=%px out_frag=%u trim=%u old_off=%u old_size=%d\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						offset - pos,
+						skb_frag_off(nskb_frag), size);
 				skb_frag_off_add(nskb_frag, offset - pos);
 				skb_frag_size_sub(nskb_frag, offset - pos);
+				if (trace)
+					pr_info("skb_segment: sg trim front done nskb=%px out_frag=%u new_off=%u new_size=%u\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						skb_frag_off(nskb_frag),
+						skb_frag_size(nskb_frag));
 			}
 
 			skb_shinfo(nskb)->nr_frags++;
 
 			if (pos + size <= offset + len) {
+				if (trace)
+					pr_info("skb_segment: sg consumed source nskb=%px out_nr_frags=%u pos=%d next_pos=%d next_i=%d\n",
+						nskb, skb_shinfo(nskb)->nr_frags,
+						pos, pos + size, i + 1);
 				i++;
 				frag++;
 				pos += size;
 			} else {
+				if (trace)
+					pr_info("skb_segment: sg trim tail nskb=%px out_frag=%u trim=%u pos=%d size=%d target_end=%u\n",
+						nskb,
+						skb_shinfo(nskb)->nr_frags - 1,
+						pos + size - (offset + len),
+						pos, size, offset + len);
 				skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
 				goto skip_fraglist;
 			}
@@ -5059,6 +5223,12 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		nskb->len += nskb->data_len;
 		nskb->truesize += nskb->data_len;
 
+		if (trace)
+			pr_info("skb_segment: segment built nskb=%px len=%u data_len=%u nr_frags=%u hsize=%d offset=%u seg_len=%u pos=%d i=%d next_offset=%u\n",
+				nskb, nskb->len, nskb->data_len,
+				skb_shinfo(nskb)->nr_frags, hsize, offset,
+				len, pos, i, offset + len);
+
 perform_csum_check:
 		if (!csum) {
 			if (skb_has_shared_frag(nskb) &&
@@ -5106,6 +5276,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			skb_shinfo(tail)->gso_segs = DIV_ROUND_UP(tail->len - doffset, gso_size);
 	}
 
+	if (trace)
+		skb_segment_dbg_dump_list("output", segs);
+
 	/* Following permits correct backpressure, for protocols
 	 * using skb_set_owner_w().
 	 * Idea is to tranfert ownership from head_skb to last segment.
@@ -5118,6 +5291,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	return segs;
 
 err:
+	if (trace)
+		pr_info("skb_segment: error err=%d segs=%px tail=%p offset=%u len=%u pos=%d i=%d nfrags=%d list_skb=%px frag_skb=%px\n",
+			err, segs, tail, offset, len, pos, i, nfrags,
+			list_skb, frag_skb);
 	kfree_skb_list(segs);
 	return ERR_PTR(err);
 }
diff --git a/tools/testing/selftests/net/big_tcp_repro.sh b/tools/testing/selftests/net/big_tcp_repro.sh
new file mode 100755
index 000000000000..7c465938526d
--- /dev/null
+++ b/tools/testing/selftests/net/big_tcp_repro.sh
@@ -0,0 +1,226 @@
+#!/usr/bin/env bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Reproducer for GRO frag_list failure in skb_segment.
+#
+# Topology:
+#
+#   client ns                 router ns                         server ns
+#   c0 10.0.0.1  <--veth-->  r0 10.0.0.2
+#                             r1 10.0.1.1  <--veth-->          s1 10.0.1.2
+#                             ptun0 10.0.2.1 <--outer vxlan--> ptun1 10.0.2.2
+#                             tun0 192.0.2.1 <--inner vxlan--> tun1 192.0.2.2
+#
+# The client emits non-GSO TCP packets. The router receives them on r0 with
+# plain GRO enabled, forwards the GRO skb into the inner VXLAN tunnel, then
+# sends the already-encapsulated skb through the outer VXLAN tunnel with
+# tx-gso-partial enabled. The outer tunnel forces the SKB into skb_segment for
+# partial GSO, where "skb_segment: too many frags" can be caught.
+
+set -euo pipefail
+
+CLIENT_NS=$(mktemp -u btcp-client-XXXXXXXX)
+ROUTER_NS=$(mktemp -u btcp-router-XXXXXXXX)
+SERVER_NS=$(mktemp -u btcp-server-XXXXXXXX)
+
+CLIENT_IP=10.0.0.1
+ROUTER_CLIENT_IP=10.0.0.2
+ROUTER_UNDERLAY_IP=10.0.1.1
+SERVER_UNDERLAY_IP=10.0.1.2
+ROUTER_OUTER_IP=10.0.2.1
+SERVER_OUTER_IP=10.0.2.2
+ROUTER_INNER_IP=192.0.2.1
+SERVER_INNER_IP=192.0.2.2
+
+NETPERF_TIME=${NETPERF_TIME:-10}
+NETPERF_WRITE=${NETPERF_WRITE:-262144}
+LOWER_MTU=${LOWER_MTU:-9000}
+SHOW_DMESG=${SHOW_DMESG:-1}
+CLEAR_DMESG=${CLEAR_DMESG:-0}
+
+OLD_HIGH_ORDER_ALLOC_DISABLE=
+
+require_command()
+{
+	if ! command -v "$1" >/dev/null 2>&1; then
+		echo "SKIP: missing $1"
+		exit 4
+	fi
+}
+
+cleanup()
+{
+	for ns in "$SERVER_NS" "$ROUTER_NS" "$CLIENT_NS"; do
+		ip netns pids "$ns" 2>/dev/null | xargs -r kill 2>/dev/null || true
+	done
+
+	ip netns del "$SERVER_NS" 2>/dev/null || true
+	ip netns del "$ROUTER_NS" 2>/dev/null || true
+	ip netns del "$CLIENT_NS" 2>/dev/null || true
+
+	if [ -n "$OLD_HIGH_ORDER_ALLOC_DISABLE" ]; then
+		sysctl -qw "net.core.high_order_alloc_disable=$OLD_HIGH_ORDER_ALLOC_DISABLE" || true
+	fi
+}
+
+ethtool_must()
+{
+	local ns=$1
+	local dev=$2
+
+	shift 2
+	ip netns exec "$ns" ethtool -K "$dev" "$@"
+}
+
+ethtool_try()
+{
+	local ns=$1
+	local dev=$2
+
+	shift 2
+	ip netns exec "$ns" ethtool -K "$dev" "$@" >/dev/null 2>&1 || true
+}
+
+setup_namespaces()
+{
+	ip netns add "$CLIENT_NS"
+	ip netns add "$ROUTER_NS"
+	ip netns add "$SERVER_NS"
+
+	for ns in "$CLIENT_NS" "$ROUTER_NS" "$SERVER_NS"; do
+		ip -n "$ns" link set lo up
+		ip netns exec "$ns" sysctl -qw net.ipv4.conf.all.rp_filter=0
+		ip netns exec "$ns" sysctl -qw net.ipv4.conf.default.rp_filter=0
+	done
+	ip netns exec "$ROUTER_NS" sysctl -qw net.ipv4.ip_forward=1
+	ip netns exec "$CLIENT_NS" sysctl -qw net.ipv4.tcp_wmem="4096 4194304 4194304"
+	ip netns exec "$SERVER_NS" sysctl -qw net.ipv4.tcp_rmem="4096 4194304 4194304"
+
+	ip -n "$CLIENT_NS" link add c0 type veth peer name r0 netns "$ROUTER_NS"
+	ip -n "$ROUTER_NS" link add r1 type veth peer name s1 netns "$SERVER_NS"
+
+	ip -n "$CLIENT_NS" addr add "$CLIENT_IP/24" dev c0
+	ip -n "$ROUTER_NS" addr add "$ROUTER_CLIENT_IP/24" dev r0
+	ip -n "$ROUTER_NS" addr add "$ROUTER_UNDERLAY_IP/24" dev r1
+	ip -n "$SERVER_NS" addr add "$SERVER_UNDERLAY_IP/24" dev s1
+
+	ip -n "$ROUTER_NS" link set dev r1 mtu "$LOWER_MTU"
+	ip -n "$SERVER_NS" link set dev s1 mtu "$LOWER_MTU"
+
+	ip -n "$CLIENT_NS" link set c0 up
+	ip -n "$ROUTER_NS" link set r0 up
+	ip -n "$ROUTER_NS" link set r1 up
+	ip -n "$SERVER_NS" link set s1 up
+
+	ip -n "$CLIENT_NS" route add "$SERVER_INNER_IP/32" via "$ROUTER_CLIENT_IP" dev c0
+}
+
+setup_tunnels()
+{
+	ip -n "$ROUTER_NS" link add ptun0 type vxlan \
+		id 100 local "$ROUTER_UNDERLAY_IP" remote "$SERVER_UNDERLAY_IP" dev r1 dstport 4790
+	ip -n "$SERVER_NS" link add ptun1 type vxlan \
+		id 100 local "$SERVER_UNDERLAY_IP" remote "$ROUTER_UNDERLAY_IP" dev s1 dstport 4790
+
+	ip -n "$ROUTER_NS" addr add "$ROUTER_OUTER_IP/24" dev ptun0
+	ip -n "$SERVER_NS" addr add "$SERVER_OUTER_IP/24" dev ptun1
+	ip -n "$ROUTER_NS" link set ptun0 up
+	ip -n "$SERVER_NS" link set ptun1 up
+
+	ip -n "$ROUTER_NS" link add tun0 type vxlan \
+		id 200 local "$ROUTER_OUTER_IP" remote "$SERVER_OUTER_IP" dev ptun0 dstport 4789
+	ip -n "$SERVER_NS" link add tun1 type vxlan \
+		id 200 local "$SERVER_OUTER_IP" remote "$ROUTER_OUTER_IP" dev ptun1 dstport 4789
+
+	ip -n "$ROUTER_NS" addr add "$ROUTER_INNER_IP/24" dev tun0
+	ip -n "$SERVER_NS" addr add "$SERVER_INNER_IP/24" dev tun1
+	ip -n "$ROUTER_NS" link set tun0 up
+	ip -n "$SERVER_NS" link set tun1 up
+
+	ip -n "$SERVER_NS" route add "$CLIENT_IP/32" via "$ROUTER_INNER_IP" dev tun1
+}
+
+setup_offloads()
+{
+	# Client must put non-GSO packets onto c0 so router-side r0 can GRO them.
+	ethtool_must "$CLIENT_NS" c0 tso off gso off
+	ethtool_try "$CLIENT_NS" c0 tx-gso-partial off
+	ethtool_try "$CLIENT_NS" c0 tx-udp_tnl-segmentation off
+	ethtool_try "$CLIENT_NS" c0 tx-udp_tnl-csum-segmentation off
+
+	# Router ingress: normal GRO, not rx-gro-list.  We want skb_gro_receive()
+	# to fill order-0 frags and then use frag_list as the fallback.
+	ethtool_must "$ROUTER_NS" r0 gro on
+	ethtool_try "$ROUTER_NS" r0 rx-gro-list off
+
+	# Outer tunnel: this is the partial-GSO-capable software egress.
+	ethtool_must "$ROUTER_NS" ptun0 \
+		tx-gso-partial on \
+		tx-udp_tnl-segmentation on \
+		tx-udp_tnl-csum-segmentation on
+
+	# Lower veth must not absorb the whole outer tunnel GSO packet.
+	ethtool_must "$ROUTER_NS" r1 tso off gso off
+	ethtool_try "$ROUTER_NS" r1 tx-gso-partial off
+	ethtool_try "$ROUTER_NS" r1 tx-udp_tnl-segmentation off
+	ethtool_try "$ROUTER_NS" r1 tx-udp_tnl-csum-segmentation off
+}
+
+show_relevant_state()
+{
+	echo "Client TX offloads:"
+	ip netns exec "$CLIENT_NS" ethtool -k c0 | grep -E 'tcp-segmentation-offload|generic-segmentation-offload|tx-gso-partial|tx-udp_tnl' || true
+	echo
+	echo "Router GRO ingress:"
+	ip -n "$ROUTER_NS" -d link show r0 | grep -E 'gso|gro' || true
+	ip netns exec "$ROUTER_NS" ethtool -k r0 | grep -E 'generic-receive-offload|rx-gro-list' || true
+	echo
+	echo "Router partial-GSO outer tunnel:"
+	ip netns exec "$ROUTER_NS" ethtool -k ptun0 | grep -E 'tx-gso-partial|tx-udp_tnl' || true
+}
+
+run_traffic()
+{
+	ip netns exec "$SERVER_NS" netserver >/dev/null
+	sleep 1
+
+	if [ "$CLEAR_DMESG" = 1 ]; then
+		dmesg -C || true
+	fi
+
+	echo "Running TCP_STREAM from $CLIENT_NS to $SERVER_INNER_IP for ${NETPERF_TIME}s"
+	ip netns exec "$CLIENT_NS" netperf -H "$SERVER_INNER_IP" \
+		-t TCP_STREAM -l "$NETPERF_TIME" -- -m "$NETPERF_WRITE" >/dev/null
+
+	if [ "$SHOW_DMESG" = 1 ]; then
+		echo
+		echo "Recent skb_segment logs:"
+		dmesg | grep 'skb_segment' || true
+	fi
+}
+
+require_command ip
+require_command ethtool
+require_command netperf
+require_command netserver
+
+if [ "$(id -u)" -ne 0 ]; then
+	echo "SKIP: must be run as root"
+	exit 4
+fi
+
+if ! { ip link help 2>&1 || :; } | grep -q gso_ipv4_max_size; then
+	echo "SKIP: iproute2 does not support gso/gro IPv4 max size knobs"
+	exit 4
+fi
+
+trap cleanup EXIT
+
+OLD_HIGH_ORDER_ALLOC_DISABLE=$(sysctl -n net.core.high_order_alloc_disable 2>/dev/null || true)
+#sysctl -qw net.core.high_order_alloc_disable=1
+
+setup_namespaces
+setup_tunnels
+setup_offloads
+show_relevant_state
+run_traffic
--cut--

Thanks,
Alice

^ permalink raw reply related

* [PATCH net v3 2/2] selftests/bpf: Add LWT encap tests for skb metadata
From: Jakub Sitnicki @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Martin KaFai Lau
  Cc: netdev, bpf, kernel-team
In-Reply-To: <20260619-bpf-lwt-drop-skb-metadata-v3-0-71d6a33ab76b@cloudflare.com>

Test that an LWT encapsulation does not silently corrupt XDP metadata
sitting in the skb headroom. Exercise all three LWT dispatch paths:

- BPF LWT xmit prog reserves headroom on the LWT .xmit redirect,
- mpls pushes an MPLS label on the LWT .xmit redirect,
- seg6 in encap mode runs on the LWT .input redirect,
- ioam6 encap inserts an IOAM Hop-by-Hop option on LWT .output redirect.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/config                 |   3 +
 .../bpf/prog_tests/xdp_context_test_run.c          | 175 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_xdp_meta.c  | 123 +++++++++------
 3 files changed, 249 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index bac60b444551..adb25146e88c 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -45,13 +45,16 @@ CONFIG_IPV6=y
 CONFIG_IPV6_FOU=y
 CONFIG_IPV6_FOU_TUNNEL=y
 CONFIG_IPV6_GRE=y
+CONFIG_IPV6_IOAM6_LWTUNNEL=y
 CONFIG_IPV6_SEG6_BPF=y
+CONFIG_IPV6_SEG6_LWTUNNEL=y
 CONFIG_IPV6_SIT=y
 CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LIVEPATCH=y
 CONFIG_LWTUNNEL=y
+CONFIG_LWTUNNEL_BPF=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
index 26159e0499c7..448807676176 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include <network_helpers.h>
+#include <linux/ipv6.h>
+#include <arpa/inet.h>
 #include "test_xdp_context_test_run.skel.h"
 #include "test_xdp_meta.skel.h"
 
@@ -8,9 +10,12 @@
 #define TX_NAME "veth1"
 #define TX_NETNS "xdp_context_tx"
 #define RX_NETNS "xdp_context_rx"
+#define RX_MAC "02:00:00:00:00:01"
+#define TX_MAC "02:00:00:00:00:02"
 #define TAP_NAME "tap0"
 #define DUMMY_NAME "dum0"
 #define TAP_NETNS "xdp_context_tuntap"
+#define LWT_NETNS "xdp_context_lwt"
 
 #define TEST_PAYLOAD_LEN 32
 static const __u8 test_payload[TEST_PAYLOAD_LEN] = {
@@ -187,6 +192,42 @@ static int write_test_packet(int tap_fd)
 	return 0;
 }
 
+/* Inject Ethernet+IPv6+UDP frame into TAP */
+static int write_test_packet_udp(int tap_fd)
+{
+	__u8 pkt[sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
+		 sizeof(struct udphdr) + TEST_PAYLOAD_LEN] = {};
+	struct ethhdr *eth = (void *)pkt;
+	struct ipv6hdr *ip6 = (void *)(eth + 1);
+	struct udphdr *udp = (void *)(ip6 + 1);
+	__u8 *payload = (void *)(udp + 1);
+	const __u8 tap_mac[ETH_ALEN] = { 0x02, 0, 0, 0, 0, 0x01 };
+	int n;
+
+	memcpy(eth->h_dest, tap_mac, ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IPV6);
+
+	ip6->version = 6;
+	ip6->hop_limit = 64;
+	ip6->nexthdr = IPPROTO_UDP;
+	ip6->payload_len = htons(sizeof(*udp) + TEST_PAYLOAD_LEN);
+	inet_pton(AF_INET6, "fd00::2", &ip6->saddr);
+	inet_pton(AF_INET6, "fd00:1::1", &ip6->daddr);
+
+	udp->source = htons(42);
+	udp->dest = htons(42);
+	udp->len = htons(sizeof(*udp) + TEST_PAYLOAD_LEN);
+	/* UDP checksum is not validated on the forwarding path. */
+
+	memcpy(payload, test_payload, TEST_PAYLOAD_LEN);
+
+	n = write(tap_fd, pkt, sizeof(pkt));
+	if (!ASSERT_EQ(n, sizeof(pkt), "write frame"))
+		return -1;
+
+	return 0;
+}
+
 static void dump_err_stream(const struct bpf_program *prog)
 {
 	char buf[512];
@@ -518,3 +559,137 @@ void test_xdp_context_tuntap(void)
 
 	test_xdp_meta__destroy(skel);
 }
+
+/*
+ * Test topology:
+ *
+ *	tap0 fd00::1
+ *	  RX:  injected IPv6 UDP frame, XDP ingress sets metadata
+ *	  fwd: encap route prepends outer header(s)
+ *	  TX:  TC egress validates metadata
+ *
+ * A routable IPv6 UDP frame is written into the tap fd, so it enters the RX
+ * path where XDP stores metadata. Routing then forwards it back out the same
+ * tap through an encapsulating route that prepends outer header(s). The TC
+ * egress program checks that the pushed header did not silently corrupt
+ * metadata.
+ */
+#define LWT_PIN_PATH "/sys/fs/bpf/xdp_context_lwt_xmit"
+
+enum lwt_encap_type {
+	LWT_ENCAP_BPF,
+	LWT_ENCAP_MPLS,
+	LWT_ENCAP_SEG6,
+	LWT_ENCAP_IOAM6,
+};
+
+static void test_lwt_encap(struct test_xdp_meta *skel,
+			   enum lwt_encap_type type)
+{
+	LIBBPF_OPTS(bpf_tc_hook, tc_hook, .attach_point = BPF_TC_EGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
+	struct bpf_program *lwt_prog = NULL;
+	struct netns_obj *ns = NULL;
+	const char *encap;
+	bool pinned = false;
+	int tap_ifindex;
+	int tap_fd = -1;
+	int ret;
+
+	skel->bss->test_pass = false;
+
+	switch (type) {
+	case LWT_ENCAP_BPF:
+		encap = "encap bpf xmit pinned " LWT_PIN_PATH " via fd00::2";
+		lwt_prog = skel->progs.dummy_lwt_xmit;
+		break;
+	case LWT_ENCAP_MPLS:
+		encap = "encap mpls 100 via inet6 fd00::2";
+		break;
+	case LWT_ENCAP_SEG6:
+		encap = "encap seg6 mode encap segs fd00::2";
+		break;
+	case LWT_ENCAP_IOAM6:
+		encap = "encap ioam6 mode encap tundst fd00::2 "
+			"trace prealloc type 0x800000 ns 0 size 4 via fd00::2";
+		break;
+	default:
+		return;
+	}
+
+	if (lwt_prog) {
+		unlink(LWT_PIN_PATH);
+		ret = bpf_program__pin(lwt_prog, LWT_PIN_PATH);
+		if (!ASSERT_OK(ret, "pin lwt prog"))
+			return;
+		pinned = true;
+	}
+
+	ns = netns_new(LWT_NETNS, true);
+	if (!ASSERT_OK_PTR(ns, "netns_new"))
+		goto close;
+
+	tap_fd = open_tuntap(TAP_NAME, true);
+	if (!ASSERT_GE(tap_fd, 0, "open_tuntap"))
+		goto close;
+
+	SYS(close, "ip link set dev " TAP_NAME " address " RX_MAC);
+	SYS(close, "sysctl -wq net.ipv6.conf.all.forwarding=1");
+	SYS(close, "ip addr add fd00::1/64 dev " TAP_NAME " nodad");
+	SYS(close, "ip link set dev " TAP_NAME " up");
+	SYS(close, "ip neigh add fd00::2 lladdr " TX_MAC " nud permanent dev " TAP_NAME);
+	SYS(close, "ip -6 route add fd00:1::/64 %s dev %s", encap, TAP_NAME);
+
+	tap_ifindex = if_nametoindex(TAP_NAME);
+	if (!ASSERT_GE(tap_ifindex, 0, "if_nametoindex"))
+		goto close;
+
+	ret = bpf_xdp_attach(tap_ifindex, bpf_program__fd(skel->progs.ing_xdp),
+			     0, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto close;
+
+	tc_hook.ifindex = tap_ifindex;
+	ret = bpf_tc_hook_create(&tc_hook);
+	if (!ASSERT_OK(ret, "bpf_tc_hook_create"))
+		goto close;
+
+	tc_opts.prog_fd = bpf_program__fd(skel->progs.tc_is_meta_empty);
+	ret = bpf_tc_attach(&tc_hook, &tc_opts);
+	if (!ASSERT_OK(ret, "bpf_tc_attach"))
+		goto close;
+
+	ret = write_test_packet_udp(tap_fd);
+	if (!ASSERT_OK(ret, "write_test_packet_udp"))
+		goto close;
+
+	if (!ASSERT_TRUE(skel->bss->test_pass, "test_pass"))
+		dump_err_stream(skel->progs.tc_is_meta_empty);
+
+close:
+	if (tap_fd >= 0)
+		close(tap_fd);
+	netns_free(ns);
+	if (pinned)
+		unlink(LWT_PIN_PATH);
+}
+
+void test_xdp_context_lwt_encap(void)
+{
+	struct test_xdp_meta *skel;
+
+	skel = test_xdp_meta__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open and load skeleton"))
+		return;
+
+	if (test__start_subtest("bpf_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_BPF);
+	if (test__start_subtest("mpls_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_MPLS);
+	if (test__start_subtest("seg6_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_SEG6);
+	if (test__start_subtest("ioam6_encap"))
+		test_lwt_encap(skel, LWT_ENCAP_IOAM6);
+
+	test_xdp_meta__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_meta.c b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
index fa73b17cb999..08b03be0b891 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_meta.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
@@ -21,10 +21,6 @@
 
 bool test_pass;
 
-static const __u8 smac_want[ETH_ALEN] = {
-	0x12, 0x34, 0xDE, 0xAD, 0xBE, 0xEF,
-};
-
 static const __u8 meta_want[META_SIZE] = {
 	0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08,
 	0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18,
@@ -32,11 +28,6 @@ static const __u8 meta_want[META_SIZE] = {
 	0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38,
 };
 
-static bool check_smac(const struct ethhdr *eth)
-{
-	return !__builtin_memcmp(eth->h_source, smac_want, ETH_ALEN);
-}
-
 static bool check_metadata(const char *file, int line, __u8 *meta_have)
 {
 	if (!__builtin_memcmp(meta_have, meta_want, META_SIZE))
@@ -280,18 +271,47 @@ int ing_cls_dynptr_offset_oob(struct __sk_buff *ctx)
 	return TC_ACT_SHOT;
 }
 
+/* Test packets carry test metadata pattern as payload. */
+static bool is_test_packet_xdp(struct xdp_md *ctx)
+{
+	__u8 meta_have[META_SIZE];
+	__u32 len;
+
+	len = bpf_xdp_get_buff_len(ctx);
+	if (len < META_SIZE)
+		return false;
+	if (bpf_xdp_load_bytes(ctx, len - META_SIZE, meta_have, META_SIZE))
+		return false;
+	if (__builtin_memcmp(meta_have, meta_want, META_SIZE))
+		return false;
+
+	return true;
+}
+
+/* Test packets carry test metadata pattern as payload. */
+static bool is_test_packet_tc(struct __sk_buff *ctx)
+{
+	__u8 meta_have[META_SIZE];
+
+	if (ctx->len < META_SIZE)
+		return false;
+	if (bpf_skb_load_bytes(ctx, ctx->len - META_SIZE, meta_have, META_SIZE))
+		return false;
+	if (__builtin_memcmp(meta_have, meta_want, META_SIZE))
+		return false;
+
+	return true;
+}
+
 /* Reserve and clear space for metadata but don't populate it */
 SEC("xdp")
 int ing_xdp_zalloc_meta(struct xdp_md *ctx)
 {
-	struct ethhdr *eth = ctx_ptr(ctx, data);
 	__u8 *meta;
 	int ret;
 
 	/* Drop any non-test packets */
-	if (eth + 1 > ctx_ptr(ctx, data_end))
-		return XDP_DROP;
-	if (!check_smac(eth))
+	if (!is_test_packet_xdp(ctx))
 		return XDP_DROP;
 
 	ret = bpf_xdp_adjust_meta(ctx, -META_SIZE);
@@ -310,33 +330,24 @@ int ing_xdp_zalloc_meta(struct xdp_md *ctx)
 SEC("xdp")
 int ing_xdp(struct xdp_md *ctx)
 {
-	__u8 *data, *data_meta, *data_end, *payload;
-	struct ethhdr *eth;
+	__u8 *data, *data_meta;
 	int ret;
 
+	/* Drop any non-test packets */
+	if (!is_test_packet_xdp(ctx))
+		return XDP_DROP;
+
 	ret = bpf_xdp_adjust_meta(ctx, -META_SIZE);
 	if (ret < 0)
 		return XDP_DROP;
 
 	data_meta = ctx_ptr(ctx, data_meta);
-	data_end  = ctx_ptr(ctx, data_end);
 	data      = ctx_ptr(ctx, data);
 
-	eth = (struct ethhdr *)data;
-	payload = data + sizeof(struct ethhdr);
-
-	if (payload + META_SIZE > data_end ||
-	    data_meta + META_SIZE > data)
+	if (data_meta + META_SIZE > data)
 		return XDP_DROP;
 
-	/* The Linux networking stack may send other packets on the test
-	 * interface that interfere with the test. Just drop them.
-	 * The test packets can be recognized by their source MAC address.
-	 */
-	if (!check_smac(eth))
-		return XDP_DROP;
-
-	__builtin_memcpy(data_meta, payload, META_SIZE);
+	__builtin_memcpy(data_meta, meta_want, META_SIZE);
 	return XDP_PASS;
 }
 
@@ -353,7 +364,7 @@ int clone_data_meta_survives_data_write(struct __sk_buff *ctx)
 	if (eth + 1 > ctx_ptr(ctx, data_end))
 		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	if (meta_have + META_SIZE > eth)
@@ -383,7 +394,7 @@ int clone_data_meta_survives_meta_write(struct __sk_buff *ctx)
 	if (eth + 1 > ctx_ptr(ctx, data_end))
 		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	if (meta_have + META_SIZE > eth)
@@ -416,7 +427,7 @@ int clone_meta_dynptr_survives_data_slice_write(struct __sk_buff *ctx)
 	if (!eth)
 		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	bpf_dynptr_from_skb_meta(ctx, 0, &meta);
@@ -436,16 +447,11 @@ int clone_meta_dynptr_survives_data_slice_write(struct __sk_buff *ctx)
 SEC("tc")
 int clone_meta_dynptr_survives_meta_slice_write(struct __sk_buff *ctx)
 {
-	struct bpf_dynptr data, meta;
-	const struct ethhdr *eth;
+	struct bpf_dynptr meta;
 	__u8 *meta_have;
 
-	bpf_dynptr_from_skb(ctx, 0, &data);
-	eth = bpf_dynptr_slice(&data, 0, NULL, sizeof(*eth));
-	if (!eth)
-		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	bpf_dynptr_from_skb_meta(ctx, 0, &meta);
@@ -471,15 +477,10 @@ int clone_meta_dynptr_rw_before_data_dynptr_write(struct __sk_buff *ctx)
 {
 	struct bpf_dynptr data, meta;
 	__u8 meta_have[META_SIZE];
-	const struct ethhdr *eth;
 	int err;
 
-	bpf_dynptr_from_skb(ctx, 0, &data);
-	eth = bpf_dynptr_slice(&data, 0, NULL, sizeof(*eth));
-	if (!eth)
-		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	/* Expect read-write metadata before unclone */
@@ -492,6 +493,7 @@ int clone_meta_dynptr_rw_before_data_dynptr_write(struct __sk_buff *ctx)
 		goto out;
 
 	/* Helper write to payload will unclone the packet */
+	bpf_dynptr_from_skb(ctx, 0, &data);
 	bpf_dynptr_write(&data, offsetof(struct ethhdr, h_proto), "x", 1, 0);
 
 	err = bpf_dynptr_read(meta_have, META_SIZE, &meta, 0, 0);
@@ -511,17 +513,12 @@ int clone_meta_dynptr_rw_before_data_dynptr_write(struct __sk_buff *ctx)
 SEC("tc")
 int clone_meta_dynptr_rw_before_meta_dynptr_write(struct __sk_buff *ctx)
 {
-	struct bpf_dynptr data, meta;
+	struct bpf_dynptr meta;
 	__u8 meta_have[META_SIZE];
-	const struct ethhdr *eth;
 	int err;
 
-	bpf_dynptr_from_skb(ctx, 0, &data);
-	eth = bpf_dynptr_slice(&data, 0, NULL, sizeof(*eth));
-	if (!eth)
-		goto out;
 	/* Ignore non-test packets */
-	if (!check_smac(eth))
+	if (!is_test_packet_tc(ctx))
 		goto out;
 
 	/* Expect read-write metadata before unclone */
@@ -545,6 +542,28 @@ int clone_meta_dynptr_rw_before_meta_dynptr_write(struct __sk_buff *ctx)
 	return TC_ACT_SHOT;
 }
 
+SEC("lwt_xmit")
+int dummy_lwt_xmit(struct __sk_buff *ctx)
+{
+	if (bpf_skb_change_head(ctx, sizeof(struct ipv6hdr), 0))
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+SEC("tc")
+int tc_is_meta_empty(struct __sk_buff *ctx)
+{
+	if (!is_test_packet_tc(ctx))
+		return TC_ACT_OK;
+
+	if (ctx->data_meta != ctx->data)
+		return TC_ACT_OK;
+
+	test_pass = true;
+	return TC_ACT_OK;
+}
+
 SEC("tc")
 int helper_skb_vlan_push_pop(struct __sk_buff *ctx)
 {

-- 
2.43.0


^ permalink raw reply related

* [PATCH net v3 1/2] net: lwtunnel: Drop skb metadata before LWT encapsulation
From: Jakub Sitnicki @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Martin KaFai Lau
  Cc: netdev, bpf, kernel-team
In-Reply-To: <20260619-bpf-lwt-drop-skb-metadata-v3-0-71d6a33ab76b@cloudflare.com>

skb metadata is meant for passing information between XDP and TC. It lives
in the skb headroom, immediately before skb->data. LWT programs cannot
access the __sk_buff->data_meta pseudo-pointer to metadata.

However, LWT encapsulation prepends outer headers, moving skb->data back
over the headroom where the metadata sits. On an RX-originated (forwarded)
packet that still carries XDP metadata this goes wrong in two different
ways, depending on the encap type:

1. Non-BPF LWT encaps (mpls, seg6, ioam6 ...) call skb_push()/skb_pull()
   and silently overwrite the metadata that sits in the headroom.

2) BPF LWT xmit calls bpf_skb_change_head(), which uses skb_data_move().
   That helper expects metadata immediately before skb->data. But since
   the IP output path runs LWT xmit before neighbour output has built
   the outgoing L2 header, for forwarded packets skb->data points at the
   L3 header while skb_mac_header() still points at the old L2 header.
   skb_data_move() sees metadata ending at skb_mac_header(), not before
   skb->data, warns and clears metadata:

  WARNING: CPU: 21 PID: 454557 at include/linux/skbuff.h:4609 skb_data_move+0x47/0x90
  CPU: 21 UID: 0 PID: 454557 Comm: napi/iconduit-g Tainted: G           O        6.18.21 #1
  RIP: 0010:skb_data_move+0x47/0x90
  Call Trace:
   <IRQ>
   bpf_skb_change_head+0xe6/0x1a0
   bpf_prog_...+0x213/0x2e3
   run_lwt_bpf.isra.0+0x1d3/0x360
   bpf_xmit+0x46/0xe0
   lwtunnel_xmit+0xa1/0xf0
   ip_finish_output2+0x1e7/0x5e0
   ip_output+0x63/0x100
   __netif_receive_skb_one_core+0x85/0xa0
   process_backlog+0x9c/0x150
   __napi_poll+0x2b/0x190
   net_rx_action+0x40b/0x7f0
   handle_softirqs+0xd2/0x270
   do_softirq+0x3f/0x60
   </IRQ>

That is what happens, as for how to fix it - a received packet that
carries metadata can reach an encap through any of the three LWT
redirect modes:

  LWTUNNEL_STATE_INPUT_REDIRECT
   ip6_rcv_finish
     dst_input
       lwtunnel_input

  LWTUNNEL_STATE_OUTPUT_REDIRECT
   ip6_rcv_finish
     dst_input
       ip6_forward
         ip6_forward_finish
           dst_output
             lwtunnel_output

  LWTUNNEL_STATE_XMIT_REDIRECT
   ip6_rcv_finish
     dst_input
       ip6_forward
         ip6_forward_finish
           dst_output
             ip6_output
               ip6_finish_output
                 ip6_finish_output2
                   lwtunnel_xmit

Every encap funnels through the three LWT dispatch helpers, so drop the
metadata there, right before handing the skb to the encap op. This
single chokepoint covers all encap types and all three redirect modes:

  - lwtunnel_input():  seg6, rpl, ila, seg6_local
  - lwtunnel_output(): ioam6
  - lwtunnel_xmit():   mpls, LWT BPF xmit

Alternatively, we could clear the metadata right after TC ingress hook.
That would require a compromise, however. Metadata would become
inaccessible from TC egress (in setups where it actually reaches the
hook it tact, that is without any L2 tunnels on path).

Fixes: 8989d328dfe7 ("net: Helper to move packet data and metadata after skb_push/pull")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/lwtunnel.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index f9d76d85d04f..b01a395d9a96 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -350,6 +350,8 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
 	if (likely(ops && ops->output)) {
+		/* Encap pushes outer headers over the metadata; drop it. */
+		skb_metadata_clear(skb);
 		dev_xmit_recursion_inc();
 		ret = ops->output(net, sk, skb);
 		dev_xmit_recursion_dec();
@@ -404,6 +406,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
 	if (likely(ops && ops->xmit)) {
+		/* Encap pushes outer headers over the metadata; drop it. */
+		skb_metadata_clear(skb);
 		dev_xmit_recursion_inc();
 		ret = ops->xmit(skb);
 		dev_xmit_recursion_dec();
@@ -455,6 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
 	if (likely(ops && ops->input)) {
+		/* Encap pushes outer headers over the metadata; drop it. */
+		skb_metadata_clear(skb);
 		dev_xmit_recursion_inc();
 		ret = ops->input(skb);
 		dev_xmit_recursion_dec();

-- 
2.43.0

^ permalink raw reply related

* [PATCH net v3 0/2] Drop skb metadata before LWT encapsulation
From: Jakub Sitnicki @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Martin KaFai Lau
  Cc: netdev, bpf, kernel-team

See description for patch 1.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Changes in v3:
- Clear metadata for non-BPF LWT encaps as well (Sashiko)
- Add selftests for LWT encap + XDP metadata
- Link to v2: https://lore.kernel.org/r/20260514-bpf-lwt-drop-skb-metadata-v2-1-458664edc2b5@cloudflare.com

Changes in v2:
- Clear metadata in bpf_xmit to allow access from tc(x) egress (Daniel)
- Add WARNING snippet to the description
- Link to v1: https://lore.kernel.org/r/20260428-wip-skb-local-storage-from-scratch-v1-1-8f7ca9b378ce@cloudflare.com

---
Jakub Sitnicki (2):
      net: lwtunnel: Drop skb metadata before LWT encapsulation
      selftests/bpf: Add LWT encap tests for skb metadata

 net/core/lwtunnel.c                                |   6 +
 tools/testing/selftests/bpf/config                 |   3 +
 .../bpf/prog_tests/xdp_context_test_run.c          | 175 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_xdp_meta.c  | 123 +++++++++------
 4 files changed, 255 insertions(+), 52 deletions(-)


^ permalink raw reply

* Re: [PATCH net-next v5 1/4] dpll: add DPLL_PIN_TYPE_INT_NCO pin type
From: Ivan Vecera @ 2026-06-19 17:07 UTC (permalink / raw)
  To: Kubalewski, Arkadiusz, Jiri Pirko, Vadim Fedorenko,
	Jakub Kicinski
  Cc: netdev@vger.kernel.org, Jiri Pirko, David S. Miller,
	Donald Hunter, Eric Dumazet, Schmidt, Michal, Paolo Abeni,
	Vaananen, Pasi, Oros, Petr, Prathosh Satish, Simon Horman,
	linux-kernel@vger.kernel.org
In-Reply-To: <CH3PR11MB8749910F17977B951A8B12CA9BE42@CH3PR11MB8749.namprd11.prod.outlook.com>

On 6/17/26 1:59 PM, Kubalewski, Arkadiusz wrote:
>> From: Ivan Vecera <ivecera@redhat.com>
>> Sent: Monday, June 15, 2026 2:00 PM
>>
>> On 6/11/26 2:09 PM, Jiri Pirko wrote:
>>> Wed, Jun 10, 2026 at 05:45:46PM +0200, ivecera@redhat.com wrote:
>>>> On 6/10/26 3:04 PM, Kubalewski, Arkadiusz wrote:
>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>> Sent: Tuesday, June 9, 2026 4:59 PM
>>>>>>
>>>>>> On 6/9/26 4:00 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>> Sent: Tuesday, June 9, 2026 10:51 AM
>>>>>>>>
>>>>>>>> Mon, Jun 08, 2026 at 07:03:46PM +0200,
>>>>>>>> arkadiusz.kubalewski@intel.com
>>>>>>>> wrote:
>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>> Sent: Monday, June 8, 2026 5:48 PM
>>>>>>>>>>
>>>>>>>>>> On 6/8/26 4:43 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>>>> Sent: Sunday, May 31, 2026 9:44 PM ...
>>>>>>>>>>>>            -
>>>>>>>>>>>>              name: gnss
>>>>>>>>>>>>              doc: GNSS recovered clock
>>>>>>>>>>>> +      -
>>>>>>>>>>>> +        name: int-nco
>>>>>>>>>>>> +        doc: |
>>>>>>>>>>>> +          Device internal numerically controlled oscillator.
>>>>>>>>>>>> +          When connected as a DPLL input, the DPLL enters NCO
>>>>>>>>>>>> mode
>>>>>>>>>>>> +          where the output frequency is adjusted by the host
>>>>>>>>>>>> via
>>>>>>>>>>>> +          the PTP clock interface.
>>>>>>>>>>>
>>>>>>>>>>> Hi Ivan!
>>>>>>>>>>>
>>>>>>>>>>> How would you control this in case of automatic mode dpll?
>>>>>>>>>>> Automatic mode DPLL shall be controlled on HW level, such pin
>>>>>>>>>>> brakes that rule and requires some driver magic to show it is
>>>>>>>>>>> higher priority then the rest of the pins?
>>>>>>>>>>
>>>>>>>>>> The NCO pin can be connected only in manual mode. In other words
>>>>>>>>>> a
>>>>>>>>>> DPLL in automatic mode cannot select NCO pin (switch to NCO mode)
>>>>>>>>>> by
>>>>>>>>>> its own.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being picky on DPLL_MODE for enabling feature is not something we
>>>>>>>>> can allow if it is not related to HW limitation, is it?
>>>>>>>>> Could you please elaborate why it is not possible for AUTOMATIC
>>>>>>>>> mode?
>>>>>>>>
>>>>>>>> In automatic mode, the pin selection logic is defined upon prio. I
>>>>>>>> can imagine that if NCO pin has the highest prio of the available
>>>>>>>> ones, it gets picked. I would be aligned 100% with automatic mode
>>>>>>>> behaviour.
>>>>>>>> Is there a real usecase for it?
>>>>>>>>
>>>>>>>> [..]
>>>>>>>
>>>>>>> This is not true. AUTOMATIC mode is HW solution, SW driver ONLY
>>>>>>> configures priorities on the inputs, not manages the active inputs.
>>>>>>> This brakes that behavior, the SW driver would have to manually
>>>>>>> override the AUTMATIC mode to be fed from such NCO pin as it doesn't
>>>>>>> exists on it's priority list, HW cannot pick or use it.
>>>>>>
>>>>>> Correct, AUTO mode is hardware feature and it should not be emulated
>>>>>> by a
>>>>>> driver. If the hardware does not support it then the switching
>>>>>> between
>>>>>> input references should be done by userspace (by monitoring ffo,
>>>>>> phase_offset, operstate).
>>>>>>
>>>>>
>>>>> Yes, exactly, so for AUTOMATIC mode HW it will not be possible to
>>>>> create
>>>>> such pin, which means that NCO pin would serve only a MANUAL mode
>>>>> implementation.
>>>>> Basically this is something we shall not allow to happen. DPLL API
>>>>> should be designed to cover the case where AUTO mode is able to
>>>>> implement
>>>>> all features consistently.
>>>>
>>>> If you don't like the proposal from Jiri (NCO switch driven by NCO pin
>>>> priority -> highest==enter_nco else leave_nco) then it could be
>>>> possible
>>>> to handle the switching by allowing the state 'connected' in AUTO mode
>>>> for the NCO pin type. Then the implementation will be the same for both
>>>> selection modes.
>>>>
>>>> Only difference would be that a user does not need to switch the device
>>> >from the AUTO to MANUAL mode.
>>>>
>>>>>>> The real use case is that any DPLL can switch the mode to this one
>>>>>>> instead of implementing MANUAL mode just to use the feature with a
>>>>>>> 'virtual' pin.
>>>>>>
>>>>>> I don't expect this... but it is up to a driver. I don't plan such
>>>>>> functionality in zl3073x as the NCO pin does not expose prio_get()
>>>>>> and
>>>>>> prio_set() callbacks - so it is clear that this pin cannot be part of
>>>>>> the
>>>>>> automatic selection.
>>>>>>
>>>>>> Ivan
>>>>>
>>>>> There is a difference between particular HW and API capabilities, with
>>>>> the
>>>>> proposed API we would disallow the possibility of such implementation
>>>>> for
>>>>> existing HW variants.
>>>>>
>>>>> DPLL NCO MODE would allow that but as pointed here by Ivan and by Jiri
>>>>> in
>>>>> the other email it would also require the extra implementation for
>>>>> some
>>>>> configuration - device level phase/ffo handling.
>>>>>
>>>>> To summarize it all, I don't have such simple solution for it.
>>>>>
>>>>> First thing that comes to my mind is to combine both approaches.
>>>>> Make it possible for AUTMATIC mode to also set "CONNECTED" state
>>>>> on certain kind of "OVERRIDE" pins, where it could be determined by
>>>>> the type of PIN and embed that logic into the DPLL subsystem.
>>>>
>>>> The possible states for particual pins are now handled at a driver
>>>> level
>>>> so the driver decides if the requested state is correct or not. So it
>>>> could be easy to implement this.
>>>>
>>>> For auto mode allowed states:
>>>> - input references: selectable / disconnected
>>>> - nco pin: connected / disconnected
>>>>
>>>>> Basically, if driver registers such NCO pin it would be always
>>>>> selected
>>>>> manually, and in such case all the other pins are going to
>>>>> disconnected
>>>>> state while DPLL mode is also a "OVERRIDE" or something like it.
>>>>
>>>> I would leave this decision on the driver level... Imagine the
>>>> potential
>>>> HW that would allow to switch NCO mode if there is no valid input
>>>> reference.
>>>>
>>>> Example:
>>>>
>>>> REF0 (prio 0) -> +------+ -> OUT0
>>>> REF1 (prio 1) -> | DPLL | -> ...
>>>> NCO  (prio 2) -> +------+ -> OUTn
>>>>
>>>> Such HW would prefer REF0 or REF1 and lock to one of them if they are
>>>> qualified. But if they are NOT, then it switches to NCO mode.
> 
> Now you said yourself "NCO mode" ... I agree that it would be a mode in
> that case. Where instead of running on regular/built in XO dpll would run
> on NCO and user could select it, and this would be addition to regular
> behavior.
> 
> I also agree that the pin approach might be better/easier to use, assuming
> frequency offset for all the outputs given dpll drives, it makes more sense
> to have it configurable on input side.

+1

>>>>
>>>> In this situation the relevant driver would allow to configure priority
>>>> and state 'selectable' for this NCO pin.
>>>>
>>>>> Perhaps the pin type could include OVERRIDE in it's name to make it
>>>>> less
>>>>> confusing and needs some extra documentation.
>>>>>
>>>>> Thoughts?
>>>> I think _INT_ is ok. In the case of TYPE_INT_OSCILLATOR it is also
>>>> obvious that it is not a standard input reference.
>>>>
>>>> Jiri, Vadim, Arek, thoughts?
>>>
>>> I agree with you, the driver should have the flexibility to implement
>>> this according to his/hw's needs/capabilities. If it implements prio
>>> selection in AUTO mode, let it have it. If it implements manual NCO pin
>>> selection in AUTO mode using connected/disconnected override, let it
>>> have it.
> 
> I don't know 'current' HW that is capable of using AUTO mode as a part of
> HW-based priority source selection and use such NCO input..
> But as already explained above, this is special mode of regular XO, which
> allows DPLL's output frequency offset configuration.

Lets keep this available for potential future HW. I can imagine a
situation where a user will prefer an automatic switch to NCO mode
if there is no qualified input reference - automatic switch means
that HW will support this (not emulated by the driver).

>>>
>>> Moreover, I actually like the "override" capability for pins in AUTO
>>> mode in general. It may be handy for other usecases as well.
>>>
>> Arek? Vadim?
>>
>> Thanks,
>> Ivan
> 
> Agree, 'override' capability of a pin would be the way to go for this and
> other similar further cases.
> 
> I believe a single approach on this would be best, I mean if AUTO mode
> needs a capability, to switch from regular behavior to 'OVERRIDE', and
> 'OVERRIDE' is only pin capability that allows such behavior for AUTO
> mode, then similar approach should be used on MANUAL mode, to make
> userspace know that such pin is always available to set "CONNECTED"
> and make the userspace implementation consistent on enabling it no matter
> if AUTO or MANUAL mode dpll.

Proposal:
1) new pin capability
    - name: state-connected-override
    - doc: pin state can be changed to connected in any DPLL mode

2) new NCO pin type to switch the DPLL to NCO mode when connected

3) automatic-only DPLL
    - should expose NCO pin with state-connected-override capability

4) manual-only DPLL
   - does not need to expose NCO pin with state-connected-override cap

5) dual-mode DPLL (supporting mode switching)
   - if it exposes NCO pin with the override cap then it has to support
     switching to NCO mode directly from AUTO mode
   - if does not expose NCO pin with the override cap then a user MUST
     switch the DPLL mode from AUTO to MANUAL to be able to make NCO
     pin connected to the DPLL

Vadim, Jiri, Arek - thoughts?

Thanks,
Ivan


^ permalink raw reply

* Re: [PATCH net 0/6] ipv6: fix sysctl error handling and missing notifications
From: Fernando Fernandez Mancera @ 2026-06-19 16:42 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

On 6/18/26 6:22 PM, Fernando Fernandez Mancera wrote:
> While working on a different IPv6 patch series I have spotted multiple
> minor bugs around sysctl error handling and notifications. In general,
> they are not serious issues.
> 
> In addition, there is one more issue in forwarding sysctl as it does not
> check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
> of this series and I am aiming it at the net-next tree once it re-opens.
> 
> Fernando Fernandez Mancera (6):
>    ipv6: fix error handling in disable_ipv6 sysctl
>    ipv6: fix error handling in ignore_routes_with_linkdown sysctl
>    ipv6: fix error handling in forwarding sysctl
>    ipv6: fix error handling in disable_policy sysctl
>    ipv6: reset value and position for proxy_ndp sysctl restart
>    ipv6: fix missing notification for ignore_routes_with_linkdown
> 
>   net/ipv6/addrconf.c | 35 +++++++++++++++++++++++++++--------
>   1 file changed, 27 insertions(+), 8 deletions(-)
> 

So I have checked the feedback from sashiko.dev [1]. In general, it is 
real but I have doubts about this one. While it is real, I am not so 
sure we actually care about this scenario. In any case, this is a 
pattern around kernel code and it would require fixes everywhere 
(IPv4/IPv6/other places (?)).

The easier thing to do is what Nicolas suggested about taking the lock 
before parsing the value i.e before calling proc_dointvec().

I also do wonder if I should fix them in individual commits or together 
in the same commit as they have different fixes tag. Or maybe we can aim 
this last part to net-next. Thoughts?

Please, see the feedback attached below.

Thanks!

[1] https://sashiko.dev/#/patchset/20260618162225.4588-1-fmancera%40suse.de

 > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
 > index 8ff015975e279..1cfb223476bd6 100644
 > --- a/net/ipv6/addrconf.c
 > +++ b/net/ipv6/addrconf.c
[ ... ]
 > @@ -6493,8 +6494,12 @@ static int addrconf_sysctl_proxy_ndp(const 
struct ctl_table *ctl, int write,
 >  	if (write && old != new) {
 >  		struct net *net = ctl->extra2;
 >
 > -		if (!rtnl_net_trylock(net))
 > +		if (!rtnl_net_trylock(net)) {
 > +			/* Restore the original values before restarting */
 > +			*valp = old;
 > +			*ppos = pos;
 >  			return restart_syscall();
 > +		}
 >

Does this rollback introduce a race condition that can silently discard
concurrent writes?
The sysctl core does not serialize concurrent writes to the same sysctl
parameter, and proc_dointvec() modifies the global sysctl memory pointed to
by ctl->data directly.
If Thread A and Thread B concurrently write to proxy_ndp, and Thread A fails
to acquire the RTNL lock, setting *valp = old will unconditionally reset the
global memory back to what Thread A observed before its write. This seems
like it would clobber Thread B's concurrent update.
If Thread B successfully acquires the RTNL lock, it will evaluate its local
old != new stack variables as true and emit a RTM_NEWNETCONF netlink
notification. However, the notification will read the corrupted *valp 
memory,
leaving the emitted event and internal sysctl state out of sync.
Could this be avoided by duplicating the ctl_table, pointing the data field
to a local stack variable, calling proc_dointvec() on the local variable,
and only updating the global *valp after successfully acquiring the lock,
similar to how it is handled in addrconf_sysctl_forward()?

^ permalink raw reply

* Re: [PATCH] net: add sock_open() for unified socket creation
From: Al Viro @ 2026-06-19 16:34 UTC (permalink / raw)
  To: Alex Goltsev; +Cc: davem, netdev, linux-kernel
In-Reply-To: <CAEKmD4JfM5GWSiRMUn6NK+kKFeyXA8i3A9gthDz3hVKFcR1YDA@mail.gmail.com>

On Fri, Jun 19, 2026 at 01:35:56PM +0300, Alex Goltsev wrote:
> > What's the point (and why not make it inline, while we are at it)?
> 
> > Are there really callers that would pass a non-constant value as the last argument,
> > and if so, what are they doing next?
> 
> 
> As for `inline`: in this case, it would have no practical significance.
> 
> The compiler already treats a simple inline function as a regular
> 
> symbol within the `EXPORT_SYMBOL` context, whereas a static inline
> function (the standard
> 
> kernel template for helper functions) would completely break the
> export to the LKM.

How so?  All three underlying primitives are exported, so static inline
in whatever include/*/*.h you put it in would work just fine.

> As for the last argument, yes, today it is usually a constant,
> 
> but that’s not the point. The purpose of the enumeration is to provide
> 
> a unified, explicit control interface. It’s important that if, in the future,
> 
> someone adds a new type of socket creation, existing calling programs won’t
> 
> panic or throw a compilation error, but will smoothly fall back to
> 
> the default case and return -EINVAL, which is a safe failure mode.

Collapsing several functions together is worthless unless the combination
can be _used_ other than a (questionable) syntax sugar.  kmalloc() can;
something that would only result in trading multiple identifiers for
functions for multiple identifiers for "which function to call" is not
an improvement.

^ permalink raw reply

* RE: Ethtool : PRBS feature
From: Das, Shubham @ 2026-06-19 16:26 UTC (permalink / raw)
  To: Alexander H Duyck, Andrew Lunn, lee@trager.us
  Cc: netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
	Chintalapalle, Balaji, Lindberg, Magnus,
	niklas.damberg@ericsson.com
In-Reply-To: <06d8c98da24e80d148ede4e933bb621c5515a7a2.camel@gmail.com>

> Also do you know what layer in the PHY you are injecting this PRBS at?
> I would be curious if this is PCS or at the PMD level?

In our case PRBS functionality is implemented in the PHY firmware at the PCS (TX/RX) + PMA (FEC Error Injection) layer.


Andrew,  Alexander, Lee,

The host driver does not directly access any registers but requests the PHY FW to manage PRBS on behalf of it.
Because of this, the implementation does not naturally fit the traditional PHYLIB model, where Linux PHY drivers directly manage PHY registers. 
The functionality is closer to a firmware-managed service exposed through the PCIe driver, so we thought the right place would be to extend ethtool.

We come from the Ethernet PHY field and are attempting to generalize PRBS for generic PHYs to accommodate all bus types, which might distract us, I believe.
The existing ethtool user application interface will give a quick start for Ethernet PHY PRBS management. 
When we need other buses or when we have another model implementation, then we can abstract the commonalities into a framework.

Should we proceed with implementing the "ethtool --phy-test" ?


> -----Original Message-----
> From: Alexander H Duyck <alexander.duyck@gmail.com>
> Sent: 16 June 2026 21:45
> To: Das, Shubham <shubham.das@intel.com>; Andrew Lunn <andrew@lunn.ch>
> Cc: netdev@vger.kernel.org; mkubecek@suse.cz; D H, Siddaraju
> <siddaraju.dh@intel.com>; Chintalapalle, Balaji <balaji.chintalapalle@intel.com>
> Subject: Re: Ethtool : PRBS feature
> 
> On Tue, 2026-06-16 at 12:14 +0000, Das, Shubham wrote:
> > Hi Andrew,
> >
> > Thanks for the feedback.
> >
> > Yes, for multi-lane ports we can accept the lane number as an argument like:
> >
> > ethtool --phy-test eth1 lane 0 tx-prbs prbs7 ethtool --phy-test eth2
> > lane 0 rx-prbs prbs7
> >
> > We referred to "Lee Trager's" "Open-Source Tooling for PHY Management and
> Testing" session:
> > https://netdevconf.info/0x19/sessions/talk/open-source-tooling-for-phy-
> management-and-testing.html?.
> > We have been trying to reach "Lee Trager" to seek more input, latest update on
> the approach and understand if there is a parallel effort in active so we can
> collaborate.
> > If you can, please help me connect with "Lee Trager" and others who expressed
> interest in Ethernet PRBS. We are happy to align and start implementation.
> >
> 
> You aren't going to have much luck if you are trying to reach out via his Meta
> address as he has moved onto Nvidia so he is no longer working on the fbnic
> driver.
> 
> As far as the work done most of it was internal and making use of debugfs. I don't
> believe any of the work for fbnic began to approach the suggested methods for
> upstreamming the feature as Lee had been pulled into other efforts.
> 
> > About standardizing across other bus like PCIe and USB, I had a quick discussion
> with our internal designers, but I didn't observe any such SW-level config knobs
> interest.
> > Looks like Ethernet has clear interest and we are joining that Ethernet PRBS
> community too.
> 
> I think it largely depends on what your implementation looks like. The point being
> made was that many of the SerDes PHYs out there are capable of use in multiple
> applications. So instead of being a networking device you would be looking at a
> SerDes PHY such as those in "/drivers/phy/".
> 
> Also do you know what layer in the PHY you are injecting this PRBS at?
> I would be curious if this is PCS or at the PMD level?
> 
> If you are referring to the PCS level then yes, it would make sense to have it in the
> networking subsystem as the PCS at this point is more a netdev specific set of
> drivers, see "/drivers/net/pcs/".
> 
> In the case of the PMD that is where things get a bit more interesting.
> There is an IEEE c45 register definition that includes PRBS testing registers,
> however in the case of our implementation the PMD doesn't follow that
> specification and follows more the "/drivers/phy/" model.
> 
> > Ethernet PRBS configuration and diagnostics support is well established and
> already widely used in existing Ethernet SERDES deployments.
> > We think Ethernet is the most natural starting point within netdev, as
> > it aligns with current driver practice and existing validation workflows.
> 
> The problem is many of these parts used as an Ethernet Serdes PMD are really a
> multiuse part. So for example in the case of the hardware in FBNIC we use the
> same part on the Ethernet PHY as we do for the PCIe
> Gen5 PHY.
> 
> The complication in our case is that both are buried behind our FW due to the fact
> that both are shared between slices. However for testing purposes and such we
> could look at disabling the odd slices to essentially unshare the hardware if you
> need another platform to test something like this with.

^ permalink raw reply

* RE: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500 MAC-PHY
From: Selvamani Rajagopal @ 2026-06-19 16:05 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Richard Cochran, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Simon Horman, Jonathan Corbet,
	Shuah Khan, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org, linux-doc@vger.kernel.org, Jerry Ray
In-Reply-To: <ajVKfBKPuNk9zN7b@monoceros>


Thanks for your feedback. Will take care of all the three comments.

> -----Original Message-----
> Subject: Re: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500
> MAC-PHY
> 
> On Sun, Jun 14, 2026 at 10:00:28AM -0700, Selvamani Rajagopal via B4 Relay wrote:
> > +static const struct of_device_id s2500_of_match[] = {
> > +	{ .compatible = "onnn,s2500" },
> > +	{}
> 
> s/{}/{ }/
> 
> > +};
> > +
> > +static const struct spi_device_id s2500_ids[] = {
> > +	{ "s2500" },
> > +	{}
> > +};
> 
> Please make this:
> 
> static const struct spi_device_id s2500_ids[] = {
> 	{ .name = "s2500" },
> 	{ }
> };
> 
> > +MODULE_DEVICE_TABLE(spi, s2500_ids);
> > +
> > +static struct spi_driver s2500_driver = {
> > +	.driver = {
> > +		.name	= DRV_NAME,
> > +		.of_match_table = s2500_of_match,
> > +	},
> > +	.probe		= s2500_probe,
> > +	.remove		= s2500_remove,
> > +	.id_table	= s2500_ids,
> 
> Tastes are different, but the idea to align = is usually screwed by
> follow up patches. Here it's broken from the start. If you ask me: Use a
> single space before each =.
> 
> > +};
> > +
> > +module_spi_driver(s2500_driver);
> 
> Usually there is no empty line between the driver struct and the macro
> registering it.
> 
>> 
> Best regards
> Uwe


^ permalink raw reply

* Wireguard head of line blocking when CPUs saturate
From: Toke Høiland-Jørgensen @ 2026-06-19 15:56 UTC (permalink / raw)
  To: wireguard; +Cc: netdev

Hey everyone

I'm running Wireguard on my main gateway, which is a not-super-high
powered ARM box with eight cores (based on the NXP LS1088A SoC). The box
does, however, also have eight hardware queues for its networking, which
means regular network traffic can be spread nicely across the cores.

However, the per-core performance is limited, making it pretty trivial
to saturate a single core by just running a fat TCP flow through it. And
when this happens, Wireguard traffic just... stalls. I.e., no traffic
gets through the Wireguard interface until the (unrelated) flow
saturating one of the cores subsides.

I suspect what happens is that Wireguard spreads out traffic to all
cores for encryption, but has to wait for the respective CPUs to finish
encrypting the packets in order before they can actually be transmitted.
And because one CPU is now suddenly saturated in softirq context, the
Wireguard work queue never gets a chance to run on that CPU, stalling TX
progress for the Wireguard device entirely.

I'm sending this message to (a) see if anyone else is seeing the same
kind of stalling, and (b) to get input on whether the explanation
outlined above seems plausible. And, in the case of affirmative answers
to both (a) and (b), to hopefully start a discussion on what to do about
this :)

-Toke

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox