Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/6] selftests: net: multithread + rss_multiqueue support for iou-zcrx
From: Jakub Kicinski @ 2026-04-18 16:49 UTC (permalink / raw)
  To: Juanlu Herrero; +Cc: dw, netdev
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

On Fri, 17 Apr 2026 09:49:46 -0700 Juanlu Herrero wrote:
> Add multithread support to the iou-zcrx selftest, plus a new
> rss_multiqueue Python variant that exercises multi-queue zero-copy
> receive on a single listening socket with NAPI-ID-based dispatch.

## Form letter - net-next-closed

We have already submitted our pull request with net-next material for v7.1,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.

Please repost when net-next reopens after Apr 27th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
-- 
pw-bot: defer
pv-bot: closed

^ permalink raw reply

* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Marek Vasut @ 2026-04-18 16:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
	Jakub Kicinski, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
	Yicong Hui, linux-kernel
In-Reply-To: <20260416104818._EDbo9hA@linutronix.de>

On 4/16/26 12:48 PM, Sebastian Andrzej Siewior wrote:
> On 2026-04-16 11:26:00 [+0200], Marek Vasut wrote:
>>> memory allocation. Therefore I am saying this backtrace is from an older
>>> kernel.
>>
>> I actually did update the backtrace in V3 with the one from next 20260413
>> that contained b44596ffe1b4 ("ARM: Allow to enable RT") from
>> stable-rt/v6.12-rt-rebase branch [1] .
>>
>> I think I misunderstood the usage of "softirq is raised" vs. "softirq is
>> invoked" above . Is it possible that there was an already raised softirq
>> before the threaded IRQ handler was invoked, and __netdev_alloc_skb() is
>> what invoked that softirq ?
> 
> It is not impossible. Something needs to netif_wake_queue() and
> ks8851_irq() must only report IRQ_RXI (not IRQ_TXI). Then it can happen.
> But usually the driver "stops" the queue if it can't process any new
> packets and resumes it once a packet has been sent so it has room again.
This driver .start_xmit is very simple, if there is space in the 6 kiB 
TX FIFO, then the packet is written into it, otherwise the .start_xmit 
returns NETDEV_TX_BUSY . There does not seem to be any 
netif_{start,stop,wake}_queue() in the .start_xmit path.

^ permalink raw reply

* [PATCH net] seg6: fix seg6 lwtunnel output redirect for L2 reduced encap mode
From: Andrea Mayer @ 2026-04-18 16:28 UTC (permalink / raw)
  To: davem, dsahern, edumazet, kuba, pabeni, horms
  Cc: anton.makarov11235, stefano.salsano, netdev, linux-kernel,
	Andrea Mayer, stable

When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the
condition in seg6_build_state() that excludes L2 encap modes from
setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for
the new mode.
As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output()
on the output path, where the packet is silently dropped because
skb_mac_header_was_set() fails on L3 packets.

Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP.

Fixes: 13f0296be8ec ("seg6: add support for SRv6 H.L2Encaps.Red behavior")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
---
 net/ipv6/seg6_iptunnel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 97b50d9b1365..9b64343ebad6 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -746,7 +746,8 @@ static int seg6_build_state(struct net *net, struct nlattr *nla,
 	newts->type = LWTUNNEL_ENCAP_SEG6;
 	newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
 
-	if (tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP)
+	if (tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP &&
+	    tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP_RED)
 		newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
 
 	newts->headroom = seg6_lwt_headroom(tuninfo);
-- 
2.20.1


^ permalink raw reply related

* [PATCH nf] netfilter: xt_TCPMSS: check skb_dst before path-MTU clamping
From: Weiming Shi @ 2026-04-18 16:30 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Phil Sutter, Simon Horman, netfilter-devel, coreteam, netdev,
	Xiang Mei, Weiming Shi

When TCPMSS with CLAMP_PMTU is used via nft_compat in a non-base
chain, par->hook_mask is set to 0, bypassing the checkentry hook
validation. The target can then run at PRE_ROUTING where skb_dst is
NULL, causing a null-ptr-deref in tcpmss_mangle_packet():

 KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
 RIP: 0010:tcpmss_mangle_packet (include/net/dst.h:219 net/netfilter/xt_TCPMSS.c:105)
  tcpmss_tg4 (net/netfilter/xt_TCPMSS.c:202)
  nft_target_eval_xt (net/netfilter/nft_compat.c:87)
  nft_do_chain (net/netfilter/nf_tables_core.c:287)
  nf_hook_slow (net/netfilter/core.c:623)

Check skb_dst() for NULL before calling dst_mtu().

Fixes: 493618a92c6a ("netfilter: nft_compat: fix hook validation for non-base chains")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
 net/netfilter/xt_TCPMSS.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index 116a885adb3c..79b5e475e23e 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -102,7 +102,12 @@ tcpmss_mangle_packet(struct sk_buff *skb,
 	if (info->mss == XT_TCPMSS_CLAMP_PMTU) {
 		struct net *net = xt_net(par);
 		unsigned int in_mtu = tcpmss_reverse_mtu(net, skb, family);
-		unsigned int min_mtu = min(dst_mtu(skb_dst(skb)), in_mtu);
+		unsigned int min_mtu;
+
+		if (!skb_dst(skb))
+			return -1;
+
+		min_mtu = min(dst_mtu(skb_dst(skb)), in_mtu);
 
 		if (min_mtu <= minlen) {
 			net_err_ratelimited("unknown or invalid path-MTU (%u)\n",
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v5 net] nfc: hci: fix out-of-bounds read in HCP header parsing
From: Simon Horman @ 2026-04-18 16:30 UTC (permalink / raw)
  To: Ashutosh Desai
  Cc: netdev, kuba, edumazet, davem, pabeni, stable, linux-kernel
In-Reply-To: <20260416051522.4154698-1-ashutoshdesai993@gmail.com>

On Thu, Apr 16, 2026 at 05:15:22AM +0000, Ashutosh Desai wrote:
> nfc_hci_recv_from_llc() and nci_hci_data_received_cb() cast skb->data
> to struct hcp_packet and read the message header byte without checking
> that enough data is present in the linear sk_buff area. A malicious NFC
> peer can send a 1-byte HCP frame that passes through the SHDLC layer
> and reaches these functions, causing an out-of-bounds heap read.
> 
> Fix this by adding pskb_may_pull() before each cast to ensure the full
> 2-byte HCP header is pulled into the linear area before it is accessed.
> 
> Fixes: 8b8d2e08bf0d ("NFC: HCI support")
> Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
> ---
> V4 -> V5: fix whitespace damage
> V3 -> V4: add Fixes tags
> V2 -> V3: drop redundant checks from nfc_hci_msg_rx_work/nci_hci_msg_rx_work;
>           remove incorrect Suggested-by tag
> V1 -> V2: use pskb_may_pull() instead of skb->len check
> 
> v4: https://lore.kernel.org/netdev/177614425081.3600288.2536320552978506086@gmail.com/
> v3: https://lore.kernel.org/netdev/20260413024329.3293075-1-ashutoshdesai993@gmail.com/
> v2: https://lore.kernel.org/netdev/20260409150825.2217133-1-ashutoshdesai993@gmail.com/
> v1: https://lore.kernel.org/netdev/20260408223113.2009304-1-ashutoshdesai993@gmail.com/
> 
>  net/nfc/hci/core.c | 5 +++++
>  net/nfc/nci/hci.c  | 5 +++++
>  2 files changed, 10 insertions(+)

Reviewed-by: Simon Horman <horms@kernel.org>

Review of this patch at Sashiko.dev flags a number of related problems in
this code. I believe none of them introduced by this patch. And that
they can all be treated as area for possible follow-up.


^ permalink raw reply

* Re: [PATCH net 2/3] octeontx2-af: npc: cn20k: Drop debugfs_create_file() error checks in init
From: Simon Horman @ 2026-04-18 16:20 UTC (permalink / raw)
  To: Ratheesh Kannoth
  Cc: netdev, linux-kernel, sgoutham, davem, edumazet, kuba, pabeni,
	andrew+netdev, dan.carpenter, Dan Carpenter
In-Reply-To: <20260416035352.333808-3-rkannoth@marvell.com>

On Thu, Apr 16, 2026 at 09:23:51AM +0530, Ratheesh Kannoth wrote:
> debugfs is not intended to be checked for allocation failures the way
> other kernel APIs are: callers should not fail probe or subsystem init
> because a debugfs node could not be created, including when debugfs is
> disabled in Kconfig.  Replacing NULL checks with IS_ERR() checks is
> similarly wrong for optional debugfs.
> 
> Remove dentry checks and -EFAULT returns from npc_cn20k_debugfs_init().
> https://staticthinking.wordpress.com/2023/07/24/debugfs-functions-are-not-supposed-to-be-checked/
> 
> CC: Dan Carpenter <error27@gmail.com>
> Link: https://lore.kernel.org/netdev/adjNGPWKMOk3KgWL@stanley.mountain/
> Fixes: 528530dff56b ("octeontx2-af: npc: cn20k: add debugfs support")
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net 3/3] octeontx2-af: npc: cn20k: Return error when defrag rollback free fails
From: Simon Horman @ 2026-04-18 16:18 UTC (permalink / raw)
  To: rkannoth
  Cc: 'Simon Horman', netdev, linux-kernel, sgoutham, davem,
	edumazet, kuba, pabeni, andrew+netdev, dan.carpenter, error27
In-Reply-To: <20260416035352.333808-4-rkannoth@marvell.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
octeontx2-af: npc: cn20k: Return error when defrag rollback free fails

This commit attempts to fix a bug where an error code wasn't returned
when a rollback failure occurred during partial allocation in
npc_defrag_alloc_free_slots(). It sets rc to -EFAULT if the
__npc_subbank_free() call fails.

Link: https://lore.kernel.org/netdev/adjNJEpILRZATB2N@stanley.mountain/

> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c b/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
> index 2fcd0ee2b1e10..df192729ac1d4 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
> @@ -3541,6 +3541,7 @@ static int npc_defrag_alloc_free_slots(struct rvu *rvu,
>  			dev_err(rvu->dev,
>  				"%s: Error to free mcam idx=%u\n",
>  				__func__, save[i]);
> +			rc = -EFAULT;
>  			break;
>  		}
>  	}

While this ensures an error is returned if the rollback fails, does the
function still incorrectly return success when the rollback succeeds?

Earlier in the function, if the allocation cannot fulfill the requested slots,
the code jumps to the rollback loop:

	if (alloc_cnt1 + alloc_cnt2 != cnt) {
		...
		goto fail_free_alloc;
	}

Inside the fail_free_alloc loop, the rc variable is overwritten by the
subbank lookup:

fail_free_alloc:
	for (i = 0; i < alloc_cnt1 + alloc_cnt2; i++) {
		rc = npc_mcam_idx_2_subbank_idx(rvu, save[i], &sb, &sb_off);

If the subbank lookup and the subsequent __npc_subbank_free() succeed,
rc will be 0.

Since the loop finishes normally without hitting the newly added rc = -EFAULT
path, does the function end up returning 0 at the end of the fail_free_alloc
block?

Could this cause the caller to incorrectly assume that all slots were
successfully allocated, leading to operations on the freed or uninitialized
slots in the save array? Using a separate variable for the loop operations
or explicitly setting an error code like -ENOSPC before the goto might help
preserve the failure state.

^ permalink raw reply

* Re: [PATCH net 1/3] octeontx2-af: npc: cn20k: Handle npc_mcam_idx_2_key_type() failures
From: Simon Horman @ 2026-04-18 16:17 UTC (permalink / raw)
  To: rkannoth
  Cc: 'Simon Horman', netdev, linux-kernel, sgoutham, davem,
	edumazet, kuba, pabeni, andrew+netdev, dan.carpenter, error27
In-Reply-To: <20260416035352.333808-2-rkannoth@marvell.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
octeontx2-af: npc: cn20k: Handle npc_mcam_idx_2_key_type() failures

This commit adds error checking to npc_mcam_idx_2_key_type() calls.
It returns early from several void functions to prevent programming
or interpreting MCAM state with an invalid key type.

Link: https://lore.kernel.org/netdev/adiQJvuKlEhq2ILx@stanley.mountain/

> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c b/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
> index 7291fdb89b03f..2fcd0ee2b1e10 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c

[ ... ]

> @@ -1052,10 +1054,12 @@ void npc_cn20k_config_mcam_entry(struct rvu *rvu, int blkaddr, int index,
>  	int kw = 0;
>  	u8 kw_type;
>  
> +	if (npc_mcam_idx_2_key_type(rvu, index, &kw_type))
> +		return;
> +
>  	/* Disable before mcam entry update */
>  	npc_cn20k_enable_mcam_entry(rvu, blkaddr, index, false);
>  

Does returning early from a void function here cause state desynchronization?

For example, in mailbox handlers like
rvu_mbox_handler_npc_cn20k_mcam_write_entry(), the caller assumes success
and returns 0, but the hardware programming would be silently aborted.

[ ... ]

> @@ -1132,8 +1136,13 @@ void npc_cn20k_copy_mcam_entry(struct rvu *rvu, int blkaddr, u16 src, u16 dest)
>  
>  	dbank = npc_get_bank(mcam, dest);
>  	sbank = npc_get_bank(mcam, src);
> -	npc_mcam_idx_2_key_type(rvu, src, &src_kwtype);
> -	npc_mcam_idx_2_key_type(rvu, dest, &dest_kwtype);
> +
> +	if (npc_mcam_idx_2_key_type(rvu, src, &src_kwtype))
> +		return;
> +

If this copy is skipped, could callers like npc_cn20k_defrag() unconditionally
enable the target entry with stale hardware state?

[ ... ]

> @@ -1188,7 +1197,8 @@ void npc_cn20k_read_mcam_entry(struct rvu *rvu, int blkaddr, u16 index,
>  	int kw = 0, bank;
>  	u8 kw_type;
>  
> -	npc_mcam_idx_2_key_type(rvu, index, &kw_type);
> +	if (npc_mcam_idx_2_key_type(rvu, index, &kw_type))
> +		return;

Could this leave caller variables uninitialized?

If a function like npc_update_dmac_value() passes an uninitialized u8 intf
parameter on the stack, and this function returns early, the caller might
branch on garbage data when later calling is_npc_intf_tx(intf).

Would it make sense to refactor these functions to return an int so that
failures can be propagated and gracefully handled by the callers?

^ permalink raw reply

* Re: [PATCH net v2] vxlan: fix NULL vn6_sock dereference in vxlan_igmp_join() and vxlan_igmp_leave()
From: Weiming Shi @ 2026-04-18 16:08 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, netdev, Xiang Mei
In-Reply-To: <20260418155843.GA808294@shredder>

On 26-04-18 18:58, Ido Schimmel wrote:
> On Sat, Apr 18, 2026 at 04:41:12AM -0700, Weiming Shi wrote:
> > vxlan_sock_add() tolerates IPv6 socket creation failure with
> > -EAFNOSUPPORT (e.g. ipv6.disable=1), leaving vn6_sock as NULL while
> > successfully creating vn4_sock. vxlan_igmp_join() and
> > vxlan_igmp_leave() then crash when they dereference the NULL vn6_sock
> > for VNI filter entries with IPv6 multicast groups:
> > 
> >  Oops: general protection fault, probably for non-canonical address
> >       0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
> >  KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
> >  RIP: 0010:vxlan_igmp_join (drivers/net/vxlan/vxlan_multicast.c:40)
> >  Call Trace:
> >   vxlan_multicast_join (drivers/net/vxlan/vxlan_multicast.c:195)
> >   vxlan_open (drivers/net/vxlan/vxlan_core.c:2965)
> >   __dev_open (net/core/dev.c:1704)
> >   __dev_change_flags (net/core/dev.c:9781)
> >   do_setlink.isra.0 (net/core/rtnetlink.c:3180)
> >   rtnl_newlink (net/core/rtnetlink.c:4238)
> >   rtnetlink_rcv_msg (net/core/rtnetlink.c:6921)
> > 
> > Skip the IPv6 multicast join/leave when vn6_sock is NULL, consistent
> > with how vxlan_sock_add() tolerates missing IPv6 support.
> > 
> > Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device")
> > Reported-by: Xiang Mei <xmei5@asu.edu>
> > Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> 
> AFAICT, this is the same patch as:
> 
> https://lore.kernel.org/netdev/20260323095544.3311285-4-bestswngs@gmail.com/
> 
> If you disagree with the feedback, then please comment there instead of
> reposting the patch.
> 
Apologies for the duplicate posting - I should have followed up on the
original.

Thanks,
Weiming Shi
> > ---
> > v2:
> >   - drop sock4 NULL checks 
> > 
> >  drivers/net/vxlan/vxlan_multicast.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/drivers/net/vxlan/vxlan_multicast.c b/drivers/net/vxlan/vxlan_multicast.c
> > index a7f2d67dc61b..e6aa5ab1c939 100644
> > --- a/drivers/net/vxlan/vxlan_multicast.c
> > +++ b/drivers/net/vxlan/vxlan_multicast.c
> > @@ -37,6 +37,9 @@ int vxlan_igmp_join(struct vxlan_dev *vxlan, union vxlan_addr *rip,
> >  	} else {
> >  		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
> >  
> > +		if (!sock6)
> > +			return 0;
> > +
> >  		sk = sock6->sock->sk;
> >  		lock_sock(sk);
> >  		ret = ipv6_stub->ipv6_sock_mc_join(sk, ifindex,
> 
> This line changed in commit 29ae61b2fe7e ("drivers: net: drop ipv6_stub
> usage and use direct function calls")
> 
> > @@ -71,6 +74,9 @@ int vxlan_igmp_leave(struct vxlan_dev *vxlan, union vxlan_addr *rip,
> >  	} else {
> >  		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
> >  
> > +		if (!sock6)
> > +			return 0;
> > +
> >  		sk = sock6->sock->sk;
> >  		lock_sock(sk);
> >  		ret = ipv6_stub->ipv6_sock_mc_drop(sk, ifindex,
> > -- 
> > 2.43.0
> > 

^ permalink raw reply

* Re: [PATCH net v2] vxlan: fix NULL vn6_sock dereference in vxlan_igmp_join() and vxlan_igmp_leave()
From: Ido Schimmel @ 2026-04-18 15:58 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, netdev, Xiang Mei
In-Reply-To: <20260418114110.2602784-3-bestswngs@gmail.com>

On Sat, Apr 18, 2026 at 04:41:12AM -0700, Weiming Shi wrote:
> vxlan_sock_add() tolerates IPv6 socket creation failure with
> -EAFNOSUPPORT (e.g. ipv6.disable=1), leaving vn6_sock as NULL while
> successfully creating vn4_sock. vxlan_igmp_join() and
> vxlan_igmp_leave() then crash when they dereference the NULL vn6_sock
> for VNI filter entries with IPv6 multicast groups:
> 
>  Oops: general protection fault, probably for non-canonical address
>       0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
>  RIP: 0010:vxlan_igmp_join (drivers/net/vxlan/vxlan_multicast.c:40)
>  Call Trace:
>   vxlan_multicast_join (drivers/net/vxlan/vxlan_multicast.c:195)
>   vxlan_open (drivers/net/vxlan/vxlan_core.c:2965)
>   __dev_open (net/core/dev.c:1704)
>   __dev_change_flags (net/core/dev.c:9781)
>   do_setlink.isra.0 (net/core/rtnetlink.c:3180)
>   rtnl_newlink (net/core/rtnetlink.c:4238)
>   rtnetlink_rcv_msg (net/core/rtnetlink.c:6921)
> 
> Skip the IPv6 multicast join/leave when vn6_sock is NULL, consistent
> with how vxlan_sock_add() tolerates missing IPv6 support.
> 
> Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device")
> Reported-by: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>

AFAICT, this is the same patch as:

https://lore.kernel.org/netdev/20260323095544.3311285-4-bestswngs@gmail.com/

If you disagree with the feedback, then please comment there instead of
reposting the patch.

> ---
> v2:
>   - drop sock4 NULL checks 
> 
>  drivers/net/vxlan/vxlan_multicast.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/vxlan/vxlan_multicast.c b/drivers/net/vxlan/vxlan_multicast.c
> index a7f2d67dc61b..e6aa5ab1c939 100644
> --- a/drivers/net/vxlan/vxlan_multicast.c
> +++ b/drivers/net/vxlan/vxlan_multicast.c
> @@ -37,6 +37,9 @@ int vxlan_igmp_join(struct vxlan_dev *vxlan, union vxlan_addr *rip,
>  	} else {
>  		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
>  
> +		if (!sock6)
> +			return 0;
> +
>  		sk = sock6->sock->sk;
>  		lock_sock(sk);
>  		ret = ipv6_stub->ipv6_sock_mc_join(sk, ifindex,

This line changed in commit 29ae61b2fe7e ("drivers: net: drop ipv6_stub
usage and use direct function calls")

> @@ -71,6 +74,9 @@ int vxlan_igmp_leave(struct vxlan_dev *vxlan, union vxlan_addr *rip,
>  	} else {
>  		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
>  
> +		if (!sock6)
> +			return 0;
> +
>  		sk = sock6->sock->sk;
>  		lock_sock(sk);
>  		ret = ipv6_stub->ipv6_sock_mc_drop(sk, ifindex,
> -- 
> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH net] ipv6: fix possible UAF in icmpv6_rcv()
From: Ido Schimmel @ 2026-04-18 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	David Ahern, netdev, eric.dumazet
In-Reply-To: <20260416103505.2380753-1-edumazet@google.com>

On Thu, Apr 16, 2026 at 10:35:05AM +0000, Eric Dumazet wrote:
> Caching saddr and daddr before pskb_pull() is problematic
> since skb->head can change.
> 
> Remove these temporary variables:
> 
> - We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr
>   when net_dbg_ratelimited() is called in the slow path.
> 
> - Avoid potential future misuse after pskb_pull() call.
> 
> Fixes: 4b3418fba0fe ("ipv6: icmp: include addresses in debug messages")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply

* Re: [PATCH] net: hamachi: fix divide by zero in hamachi_init_one
From: Andrew Lunn @ 2026-04-18 15:34 UTC (permalink / raw)
  To: Mingyu Wang
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, tglx, mingo, netdev,
	linux-kernel
In-Reply-To: <20260418121804.149171-1-25181214217@stu.xidian.edu.cn>

On Sat, Apr 18, 2026 at 08:18:04PM +0800, Mingyu Wang wrote:
> During the hardware initialization phase in hamachi_init_one(), the driver
> reads the PCIClkMeas register to calculate the PCI bus frequency.
> 
> The current code attempts to prevent a divide-by-zero error using a ternary
> operator: `i ? 2000/(i&0x7f) : 0`. However, this check is flawed. The highest
> bit of `i` (0x80) acts as a ready flag. If unreliable hardware or a malicious
> virtual device returns a value where the ready bit is set but the lower 7 bits
> are zero (e.g., 0x80), the condition `i` evaluates to true, but `(i & 0x7f)`
> evaluates to 0. This results in a fatal divide-by-zero exception.
> 
> This bug was discovered during an automated virtual device fuzzing campaign
> testing the hardware-software trust boundary. When the hardware returns 0x80,
> it bypassed the readiness while-loop but triggered the divide error. In our
> tests, this panic interrupted the module loading process, further triggering
> a KASAN slab-out-of-bounds in the module error path, and ultimately leading
> to a multi-core soft lockup and RCU stall.

Isn't that a good result of somebody trying to use emulated hardware
with bad behaviour? The machine grinds to a halt? So it is not
exploitable.

What happens with your patch in place? How are you reporting the
hardware is attacking the machine, and the hardware should not be
trusted?

	Andrew

^ permalink raw reply

* Re: [PATCH] net: ipv4: igmp: add sysctl option to ignore inbound llm_reports
From: Ido Schimmel @ 2026-04-18 15:29 UTC (permalink / raw)
  To: Steffen Trumtrar
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern, netdev,
	linux-doc, linux-kernel
In-Reply-To: <20260415-v7-0-topic-igmp-llm-drop-v1-1-1367bfbb898e@pengutronix.de>

On Wed, Apr 15, 2026 at 12:26:13PM +0200, Steffen Trumtrar wrote:
> Add a new sysctl option 'igmp_link_local_mcast_reports_drop' that allows
> dropping inbound IGMP reports for link-local multicast groups in the
> 224.0.0.X range. This can be used to prevent the local system from
> processing IGMP reports for link local multicast groups and therefore
> let the kernel still send the own outbound IGMP reports.

OK, but what is the motivation to keep sending IGMP reports for
link-local multicast groups when the host already received such reports
from other hosts on the network? Why link-local groups are special in
this case?

AFAICT, igmp_heard_report() implements report suppression according to
RFC 2236 and it doesn't mention special behavior for link-local groups:
"If the host receives another host's Report (version 1 or 2) while it
has a timer running, it stops its timer for the specified group and does
not send a Report, in order to suppress duplicate Reports."

Also, I'm not convinced we need a new sysctl (that we will need to keep
forever) for this. It should be possible to drop such packets using tc
(tc-32 / tc-bpf) or netfilter.

[...]

> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6921d8594b849..2da4cd6ac7202 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -2306,6 +2306,18 @@ igmp_link_local_mcast_reports - BOOLEAN
>  
>  	Default TRUE
>  
> +igmp_link_local_mcast_reports_drop - BOOLEAN
> +	Drop inbound IGMP reports for link local multicast groups in
> +	the 224.0.0.X range. When enabled, IGMP membership reports for
> +	link local multicast addresses are silently dropped without
> +	processing.
> +	When the kernel gets inbound IGMP reports it stops sending own
> +	IGMP reports. With allowing to drop and process the inbound reports,
> +	the kernel will not stop sending the own reports, even when IGMP
> +	reports from other hosts are seen on the network.
> +
> +	Default FALSE

[...]

> diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
> index a674fb44ec25b..3a4932e4108bd 100644
> --- a/net/ipv4/igmp.c
> +++ b/net/ipv4/igmp.c
> @@ -931,6 +931,8 @@ static bool igmp_heard_report(struct in_device *in_dev, __be32 group)
>  	if (ipv4_is_local_multicast(group) &&
>  	    !READ_ONCE(net->ipv4.sysctl_igmp_llm_reports))
>  		return false;
> +	if (READ_ONCE(net->ipv4.sysctl_igmp_llm_reports_drop))
> +		return true;
>  
>  	rcu_read_lock();
>  	for_each_pmc_rcu(in_dev, im) {

The documentation says that this sysctl is specifically about link-local
groups, but it drops reports from all groups...

^ permalink raw reply

* Re: pre-boot plugged SFP autoneg advertisement
From: Andrew Lunn @ 2026-04-18 15:25 UTC (permalink / raw)
  To: markus.stockhausen
  Cc: linux, hkallweit1, netdev, 'Jonas Jelonek', jan
In-Reply-To: <007c01dccf15$9b4622c0$d1d26840$@gmx.de>

On Sat, Apr 18, 2026 at 11:27:40AM +0200, markus.stockhausen@gmx.de wrote:
> Hi,
> 
> I'm currently analyzing an issue where a pre-boot-plugged SFP module 
> comes up with autoneg=no advertisement during boot. After an
> unplug/replug autoneg=yes advertisement is chosen. 
> 
> The following addition in phylink_start() just before the call to
> phylink_mac_initial_config() mitigiates this.
> 
> +  /* If an SFP module was already present before phylink_start() was
> +   * called, phylink_sfp_set_config() was unable to call
> +   * phylink_mac_initial_config() as phylink was not yet started.
> +   * Ensure the SFP capabilities are reflected in advertising.
> +   */
> +  if (pl->sfp_bus && !linkmode_empty(pl->sfp_support))
> +    linkmode_copy(pl->link_config.advertising, pl->sfp_support);

Let me see if i have the call chain correct. This is net-next/main
from today.

phylink_sfp_connect_phy() ->
  phylink_sfp_config_phy

        if (changed && !test_bit(PHYLINK_DISABLE_STOPPED,
                                 &pl->phylink_disable_state))
                phylink_mac_initial_config(pl, false);

You are saying PHYLINK_DISABLE_STOPPED is set, so
phylink_mac_initial_config() is not called.

What i don't see is how phylink_mac_initial_config() does the
linkmode_copy() you are adding.

	Andrew

^ permalink raw reply

* Re: [PATCH net v2] slip: reject VJ receive packets on instances with no rstate array
From: Simon Horman @ 2026-04-18 15:19 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Xiang Mei
In-Reply-To: <aeOXJoeq6VkCnqAH@SLSGDTSWING002>

On Sat, Apr 18, 2026 at 10:37:26PM +0800, Weiming Shi wrote:
> On 26-04-18 13:39, Simon Horman wrote:
> > On Thu, Apr 16, 2026 at 04:41:31AM +0800, Weiming Shi wrote:

...

> > I do note that Sashiko flags some other problems in this code.
> > I do not think that needs to delay progress of this patch.
> > But you may wish to look into them as follow-up work.
> 
> Thanks for your review. 
> 
> I've already sent two follow-up patches for the decode()/pull16() 
> bounds-checking issues:
> 
>     [PATCH net] slip: fix slab-out-of-bounds write in slhc_uncompress()
>     https://lore.kernel.org/netdev/20260415213359.335657-2-bestswngs@gmail.com/
> 
>     [PATCH net] slip: bound decode() reads against the compressed packet length
>     https://lore.kernel.org/netdev/20260416100147.531855-5-bestswngs@gmail.com/

Great, thanks!

^ permalink raw reply

* Re: [PATCH net-next v4 5/5] selftests: net: bridge: add MRC and QQIC field encoding tests
From: Ido Schimmel @ 2026-04-18 14:49 UTC (permalink / raw)
  To: Ujjal Roy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
	Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <CAE2MWkmvdAVMBvJ9xKgEzjJZ010=oY_ZoG==FBjHEisHEMrS8Q@mail.gmail.com>

On Fri, Apr 17, 2026 at 11:27:06AM +0530, Ujjal Roy wrote:
> On Mon, Apr 13, 2026 at 2:18 PM Ido Schimmel <idosch@nvidia.com> wrote:
> >
> > See some comments below, but note that net-next is closed:
> >
> > https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
> >
> > So you can either wait with v5 until it is open again or post it as RFC
> > so that we can at least review (but not merge) it while net-next is
> > closed.
> 
> Let me clear the changes asked here inline, so that I will be prepared
> with v5 until net-next is open. You can ask me to send it as RFC v5,
> if you have doubts about inline answers.

I checked the proposed changes and they look fine to me.

Thanks

^ permalink raw reply

* Re: [PATCH net v2] slip: reject VJ receive packets on instances with no rstate array
From: Weiming Shi @ 2026-04-18 14:37 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Xiang Mei
In-Reply-To: <20260418123929.GE280379@horms.kernel.org>

On 26-04-18 13:39, Simon Horman wrote:
> On Thu, Apr 16, 2026 at 04:41:31AM +0800, Weiming Shi wrote:
> > slhc_init() accepts rslots == 0 as a valid configuration, with the
> > documented meaning of 'no receive compression'. In that case the
> > allocation loop in slhc_init() is skipped, so comp->rstate stays
> > NULL and comp->rslot_limit stays 0 (from the kzalloc of struct
> > slcompress).
> > 
> > The receive helpers do not defend against that configuration.
> > slhc_uncompress() dereferences comp->rstate[x] when the VJ header
> > carries an explicit connection ID, and slhc_remember() later assigns
> > cs = &comp->rstate[...] after only comparing the packet's slot number
> > to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the
> > range check, and the code dereferences a NULL rstate.
> > 
> > The configuration is reachable in-tree through PPP. PPPIOCSMAXCID
> > stores its argument in a signed int, and (val >> 16) uses arithmetic
> > shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1
> > is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because
> > /dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path
> > is reachable from an unprivileged user namespace. Once the malformed
> > VJ state is installed, any inbound VJ-compressed or VJ-uncompressed
> > frame that selects slot 0 crashes the kernel in softirq context:
> > 
> >  Oops: general protection fault, probably for non-canonical
> >        address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
> >  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> >  RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519)
> >  Call Trace:
> >   <TASK>
> >   ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466)
> >   ppp_input (drivers/net/ppp/ppp_generic.c:2359)
> >   ppp_async_process (drivers/net/ppp/ppp_async.c:492)
> >   tasklet_action_common (kernel/softirq.c:926)
> >   handle_softirqs (kernel/softirq.c:623)
> >   run_ksoftirqd (kernel/softirq.c:1055)
> >   smpboot_thread_fn (kernel/smpboot.c:160)
> >   kthread (kernel/kthread.c:436)
> >   ret_from_fork (arch/x86/kernel/process.c:164)
> >   </TASK>
> > 
> > Reject the receive side on such instances instead of touching rstate.
> > slhc_uncompress() falls through to its existing 'bad' label, which
> > bumps sls_i_error and enters the toss state. slhc_remember() mirrors
> > that with an explicit sls_i_error increment followed by slhc_toss();
> > the sls_i_runt counter is not used here because a missing rstate is
> > an internal configuration state, not a runt packet.
> > 
> > The transmit path is unaffected: the only in-tree caller that picks
> > rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and
> > slip.c always calls slhc_init(16, 16), so comp->tstate remains valid
> > and slhc_compress() continues to work.
> > 
> > Fixes: b5451d783ade ("slip: Move the SLIP drivers")
> 
> AI review points out that the cited commit moves code but doesn't
> add this bug.
> 
> It seems to me that this bug has existed since the beginning of git
> history. If so, the Fixes tag should be:
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> 
> > Reported-by: Xiang Mei <xmei5@asu.edu>
> > Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> > ---
> > v2:
> > - slhc_remember(): use sls_i_error instead of sls_i_runt for the
> >   missing-rstate case; it is a configuration error, not a runt packet
> >   (Simon).
> > - slhc_uncompress(): goto bad instead of returning 0, so the instance
> >   also enters SLF_TOSS on the first rejected frame.
> 
> Otherwise this looks good to me:
> 
> Reviewed-by: Simon Horman <horms@kernel.org>
> 
> 
> I do note that Sashiko flags some other problems in this code.
> I do not think that needs to delay progress of this patch.
> But you may wish to look into them as follow-up work.

Thanks for your review. 

I've already sent two follow-up patches for the decode()/pull16() 
bounds-checking issues:

    [PATCH net] slip: fix slab-out-of-bounds write in slhc_uncompress()
    https://lore.kernel.org/netdev/20260415213359.335657-2-bestswngs@gmail.com/

    [PATCH net] slip: bound decode() reads against the compressed packet length
    https://lore.kernel.org/netdev/20260416100147.531855-5-bestswngs@gmail.com/

Best regards,
Weiming Shi


^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, kuba, dsahern, tom, willemdebruijn.kernel,
	idosch, pabeni, netdev
In-Reply-To: <75d98880-afcd-43f9-8bd5-b874fa5690f5@gmail.com>

On 4/18/26 15:46, Justin Iurman wrote:
> On 4/18/26 15:15, Eric Dumazet wrote:
>> On Sat, Apr 18, 2026 at 5:50 AM Justin Iurman 
>> <justin.iurman@gmail.com> wrote:
>>>
>>> On 4/18/26 14:26, Daniel Borkmann wrote:
>>>> Hi Justin,
>>>>
>>>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>>>> [...]
>>>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>>>> index d2cd33e2698d..93f865545a7c 100644
>>>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] 
>>>>>> = {
>>>>>>            .extra1        = SYSCTL_ZERO,
>>>>>>            .extra2        = &flowlabel_reflect_max,
>>>>>>        },
>>>>>> +    {
>>>>>> +        .procname    = "max_ext_hdrs_number",
>>>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>>>> +        .maxlen        = sizeof(int),
>>>>>> +        .mode        = 0644,
>>>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>>>> +        .extra1        = SYSCTL_ONE,
>>>>>> +    },
>>>>>>        {
>>>>>>            .procname    = "max_dst_opts_number",
>>>>>>            .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>>>
>>>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>>>
>>>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want
>>>>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's
>>>>> about time) this series [1] to enforce ordering and occurrences of
>>>>> Extension Headers, which is based on an IETF draft [2] (FYI, draft-
>>>>> ietf-6man-eh-limits is dead). I think we should enforce ordering and
>>>>> occurrences in this code path too, instead of relying on a sysctl.
>>>>> Let's keep both code paths consistent.
>>>
>>> Hi Daniel,
>>>
>>>> Hm, that series [1] should probably go to net instead of net-next, 
>>>> but atm
>>>
>>> +1, would make sense.
>>>
>>>> hasn't moved since a month. I'd still think max_ext_hdrs_number 
>>>> would be
>>>> useful given it has less complexity also for stable, but I guess 
>>>> ultimately
>>>> up to maintainers..
>>>
>>> In the short term, I agree. What worries me is that we end up with a
>>> redundant, or even useless, sysctl once the other series is applied,
>>> which will only increase user confusion.
>>
>> Given the amount of bugs in this code, a sysctl is safe and quire 
>> reasonable.
>>
>> No one will object when it is eventually removed (or has no action)
>>
>> For the record,  I approve Daniel patch.
> 
> Fair enough. If there is consensus on this patch, then let me just 
> suggest two changes:
> 
> - make it clear in the sysctl description that it mainly applies to TX 
> (as opposed to the other series [1] discussed earlier that applies to RX)

Sorry, I meant it does not apply to core RX (ip6_rcv()), which is what 
series [1] does.

> - set the default to 8 (which should be the max value) instead of 32, as 
> per RFC8200, Sec. 4.1


^ permalink raw reply

* [PATCH net v2] net/rds: zero per-item info buffer before handing it to visitors
From: Michael Bommarito @ 2026-04-18 14:10 UTC (permalink / raw)
  To: Allison Henderson, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Sharath Srinivasan, Simon Horman, netdev, linux-rdma, rds-devel,
	linux-kernel, stable
In-Reply-To: <20260417141916.494761-1-michael.bommarito@gmail.com>

rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
caller-allocated on-stack u64 buffer to a per-connection visitor and
then copy the full item_len bytes back to user space via
rds_info_copy() regardless of how much of the buffer the visitor
actually wrote.

rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
write a subset of their output struct when the underlying
rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
and the two GIDs via explicit memsets). Several u32 fields
(max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
cache_allocs) and the 2-byte alignment hole between sl and
cache_allocs remain as whatever stack contents preceded the visitor
call and are then memcpy_to_user()'d out to user space.

struct rds_info_rdma_connection and struct rds6_info_rdma_connection
are the only rds_info_* structs in include/uapi/linux/rds.h that are
not marked __attribute__((packed)), so they have a real alignment
hole. The other info visitors (rds_conn_info_visitor,
rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
their packed output struct today and are not known to be vulnerable,
but a future visitor that adds a conditional write-path would have
the same bug.

Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
any netdev is sufficient), sendto()'s any peer on the same subnet
(fails cleanly but installs an rds_connection in the global hash in
RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26
bytes of stack garbage including kernel text/data pointers:

    0..7   0a 63 00 01 0a 63 00 02     src=10.99.0.1 dst=10.99.0.2
    8..39  00 ...                      gids (memset-zeroed)
    40..47 e0 92 a3 81 ff ff ff ff     kernel pointer (max_send_wr)
    48..55 7f 37 b5 81 ff ff ff ff     kernel pointer (rdma_mr_max)
    56..59 01 00 08 00                 rdma_mr_size (garbage)
    60..61 00 00                       tos, sl
    62..63 00 00                       alignment padding
    64..67 18 00 00 00                 cache_allocs (garbage)

Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
and rds_walk_conn_path_info() before invoking the visitor. This
covers the IPv4/IPv6 IB visitors and hardens all current and future
visitors against the same class of bug.

No functional change for visitors that fully populate their output.

Changes in v2:
- retarget at the net tree (subject prefix "[PATCH net v2]",
  net/rds: prefix in the title)
- add Cc: stable@vger.kernel.org
- pick up Reviewed-by tags from Sharath Srinivasan and
  Allison Henderson

Fixes: ec16227e1414 ("RDS/IB: Infiniband transport")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Assisted-by: Claude:claude-opus-4-7
---
 net/rds/connection.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 412441aaa298..c10b7ed06c49 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -701,6 +701,13 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
 	     i++, head++) {
 		hlist_for_each_entry_rcu(conn, head, c_hash_node) {

+			/* Zero the per-item buffer before handing it to the
+			 * visitor so any field the visitor does not write -
+			 * including implicit alignment padding - cannot leak
+			 * stack contents to user space via rds_info_copy().
+			 */
+			memset(buffer, 0, item_len);
+
 			/* XXX no c_lock usage.. */
 			if (!visitor(conn, buffer))
 				continue;
@@ -750,6 +757,13 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
 			 */
 			cp = conn->c_path;

+			/* Zero the per-item buffer for the same reason as
+			 * rds_for_each_conn_info(): any byte the visitor
+			 * does not write (including alignment padding) must
+			 * not leak stack contents via rds_info_copy().
+			 */
+			memset(buffer, 0, item_len);
+
 			/* XXX no cp_lock usage.. */
 			if (!visitor(cp, buffer))
 				continue;
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 13:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, kuba, dsahern, tom, willemdebruijn.kernel,
	idosch, pabeni, netdev
In-Reply-To: <CANn89i+Y0jctj8=tCHFP5jDSJBAWR=RvNfagammc-WqU6EdPRw@mail.gmail.com>

On 4/18/26 15:15, Eric Dumazet wrote:
> On Sat, Apr 18, 2026 at 5:50 AM Justin Iurman <justin.iurman@gmail.com> wrote:
>>
>> On 4/18/26 14:26, Daniel Borkmann wrote:
>>> Hi Justin,
>>>
>>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>>> [...]
>>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>>> index d2cd33e2698d..93f865545a7c 100644
>>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>>>            .extra1        = SYSCTL_ZERO,
>>>>>            .extra2        = &flowlabel_reflect_max,
>>>>>        },
>>>>> +    {
>>>>> +        .procname    = "max_ext_hdrs_number",
>>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>>> +        .maxlen        = sizeof(int),
>>>>> +        .mode        = 0644,
>>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>>> +        .extra1        = SYSCTL_ONE,
>>>>> +    },
>>>>>        {
>>>>>            .procname    = "max_dst_opts_number",
>>>>>            .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>>
>>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>>
>>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want
>>>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's
>>>> about time) this series [1] to enforce ordering and occurrences of
>>>> Extension Headers, which is based on an IETF draft [2] (FYI, draft-
>>>> ietf-6man-eh-limits is dead). I think we should enforce ordering and
>>>> occurrences in this code path too, instead of relying on a sysctl.
>>>> Let's keep both code paths consistent.
>>
>> Hi Daniel,
>>
>>> Hm, that series [1] should probably go to net instead of net-next, but atm
>>
>> +1, would make sense.
>>
>>> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
>>> useful given it has less complexity also for stable, but I guess ultimately
>>> up to maintainers..
>>
>> In the short term, I agree. What worries me is that we end up with a
>> redundant, or even useless, sysctl once the other series is applied,
>> which will only increase user confusion.
> 
> Given the amount of bugs in this code, a sysctl is safe and quire reasonable.
> 
> No one will object when it is eventually removed (or has no action)
> 
> For the record,  I approve Daniel patch.

Fair enough. If there is consensus on this patch, then let me just 
suggest two changes:

- make it clear in the sysctl description that it mainly applies to TX 
(as opposed to the other series [1] discussed earlier that applies to RX)

- set the default to 8 (which should be the max value) instead of 32, as 
per RFC8200, Sec. 4.1

^ permalink raw reply

* Re: [PATCH net 1/2] tcp: call sk_data_ready() after listener migration
From: 上勾拳 @ 2026-04-18 13:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
	shuah, tamird, linux-kernel, linux-kselftest, stable
In-Reply-To: <CANn89iJOfDB+5oORjWPbP7Z1SyqUhMzVR8u8i+8P8MPDgg_EGA@mail.gmail.com>

Thanks Eric, you're right.

After inet_csk_reqsk_queue_add() succeeds, the ref acquired in
reuseport_migrate_sock() is effectively transferred to
nreq->rsk_listener. Another CPU can then dequeue nreq (via
accept() or listener shutdown), hit reqsk_put(), and drop that
listener ref.

Since listeners are SOCK_RCU_FREE, the post-queue_add()
dereferences of nsk should be under rcu_read_lock()/
rcu_read_unlock(), which also covers the existing sock_net(nsk)
access in that path.

I also checked reqsk_timer_handler(): reqsk_queue_migrated()
there is only accounting, and once nreq becomes visible via
inet_ehash_insert(), the handler no longer appears to
dereference nsk.

I'll fold this into v2.


Eric Dumazet <edumazet@google.com> 于2026年4月18日周六 14:02写道：
>
> On Fri, Apr 17, 2026 at 9:17 PM Zhenzhong Wu <jt26wzz@gmail.com> wrote:
> >
> > When inet_csk_listen_stop() migrates an established child socket from
> > a closing listener to another socket in the same SO_REUSEPORT group,
> > the target listener gets a new accept-queue entry via
> > inet_csk_reqsk_queue_add(), but that path never notifies the target
> > listener's waiters.
> >
> > As a result, a nonblocking accept() still succeeds because it checks
> > the accept queue directly, but waiters that sleep for listener
> > readiness can remain asleep until another connection generates a
> > wakeup. This affects poll()/epoll_wait()-based waiters, and can also
> > leave a blocking accept() asleep after migration even though the
> > child is already in the target listener's accept queue.
> >
> > This was observed in a local test where listener A completed the
> > handshake, queued the child, and was closed before userspace called
> > accept(). The child was migrated to listener B, but listener B never
> > received a wakeup for the migrated accept-queue entry.
> >
> > Call READ_ONCE(nsk->sk_data_ready)(nsk) after a successful migration
> > in inet_csk_listen_stop().
> >
> > The reqsk_timer_handler() path does not need the same change:
> > half-open requests only become readable to userspace when the final
> > ACK completes the handshake, and tcp_child_process() already wakes
> > the listener in that case.
> >
> > Fixes: 54b92e841937 ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
> > ---
> >  net/ipv4/inet_connection_sock.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 4ac3ae1bc..da1ce082f 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -1483,6 +1483,7 @@ void inet_csk_listen_stop(struct sock *sk)
> >                                         __NET_INC_STATS(sock_net(nsk),
> >                                                         LINUX_MIB_TCPMIGRATEREQSUCCESS);
> >                                         reqsk_migrate_reset(req);
> > +                                       READ_ONCE(nsk->sk_data_ready)(nsk);
>
> I think this is adding a potential UAF (Use Afte Free).
> @nsk might have been freed already by another thread/cpu.
> Note the existing code already has similar issues.
>
> Untested patch:
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 4ac3ae1bc1afc3a39f2790e39b4dda877dc3272b..287b6e01c4f71bfec3dd2a708f316224d9eb4a64
> 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -1479,6 +1479,7 @@ void inet_csk_listen_stop(struct sock *sk)
>                         if (nreq) {
>                                 refcount_set(&nreq->rsk_refcnt, 1);
>
> +                               rcu_read_lock();
>                                 if (inet_csk_reqsk_queue_add(nsk,
> nreq, child)) {
>                                         __NET_INC_STATS(sock_net(nsk),
>
> LINUX_MIB_TCPMIGRATEREQSUCCESS);
> @@ -1489,7 +1490,7 @@ void inet_csk_listen_stop(struct sock *sk)
>                                         reqsk_migrate_reset(nreq);
>                                         __reqsk_free(nreq);
>                                 }
> -
> +                               rcu_read_unlock();
>                                 /* inet_csk_reqsk_queue_add() has already
>                                  * called inet_child_forget() on failure case.
>                                  */

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 13:18 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <d4c72730-2d74-4efe-8ede-50e3fe9658c8@iogearbox.net>

On 4/18/26 14:59, Daniel Borkmann wrote:
> On 4/18/26 2:50 PM, Justin Iurman wrote:
>> On 4/18/26 14:26, Daniel Borkmann wrote:
>>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>>> [...]
>>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>>> index d2cd33e2698d..93f865545a7c 100644
>>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>>>           .extra1        = SYSCTL_ZERO,
>>>>>           .extra2        = &flowlabel_reflect_max,
>>>>>       },
>>>>> +    {
>>>>> +        .procname    = "max_ext_hdrs_number",
>>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>>> +        .maxlen        = sizeof(int),
>>>>> +        .mode        = 0644,
>>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>>> +        .extra1        = SYSCTL_ONE,
>>>>> +    },
>>>>>       {
>>>>>           .procname    = "max_dst_opts_number",
>>>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>>
>>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>>
>>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want 
>>>> yet-another-sysctl for that. Instead, we have (well, not yet, but 
>>>> it's about time) this series [1] to enforce ordering and occurrences 
>>>> of Extension Headers, which is based on an IETF draft [2] (FYI, 
>>>> draft- ietf-6man-eh-limits is dead). I think we should enforce 
>>>> ordering and occurrences in this code path too, instead of relying 
>>>> on a sysctl. Let's keep both code paths consistent.
>>
>> Hi Daniel,
>>
>>> Hm, that series [1] should probably go to net instead of net-next, 
>>> but atm
>>
>> +1, would make sense.
>>
>>> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
>>> useful given it has less complexity also for stable, but I guess 
>>> ultimately
>>> up to maintainers..
>>
>> In the short term, I agree. What worries me is that we end up with a 
>> redundant, or even useless, sysctl once the other series is applied, 
>> which will only increase user confusion.
> I'm thinking even if that series lands, and there is still odd hw out there
> where the enforcement of ordering is not in place, and users might be 
> forced
> to disable net.ipv6.enforce_ext_hdr_order, then the limit would still apply
> and protect them.

Agree. OTOH, IPv6 packets with out-of-order (or more than allowed) 
Extension Headers look suspicious and should probably be dropped by 
hosts anyway.

> Cheers,
> Daniel


^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Eric Dumazet @ 2026-04-18 13:15 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Daniel Borkmann, kuba, dsahern, tom, willemdebruijn.kernel,
	idosch, pabeni, netdev
In-Reply-To: <b57f31a2-456e-4727-839a-bc2f0fb07855@gmail.com>

On Sat, Apr 18, 2026 at 5:50 AM Justin Iurman <justin.iurman@gmail.com> wrote:
>
> On 4/18/26 14:26, Daniel Borkmann wrote:
> > Hi Justin,
> >
> > On 4/18/26 1:45 PM, Justin Iurman wrote:
> >> On 4/17/26 19:18, Daniel Borkmann wrote:
> > [...]
> >>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> >>> index d2cd33e2698d..93f865545a7c 100644
> >>> --- a/net/ipv6/sysctl_net_ipv6.c
> >>> +++ b/net/ipv6/sysctl_net_ipv6.c
> >>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
> >>>           .extra1        = SYSCTL_ZERO,
> >>>           .extra2        = &flowlabel_reflect_max,
> >>>       },
> >>> +    {
> >>> +        .procname    = "max_ext_hdrs_number",
> >>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
> >>> +        .maxlen        = sizeof(int),
> >>> +        .mode        = 0644,
> >>> +        .proc_handler    = proc_dointvec_minmax,
> >>> +        .extra1        = SYSCTL_ONE,
> >>> +    },
> >>>       {
> >>>           .procname    = "max_dst_opts_number",
> >>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
> >>
> >> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
> >>
> >> +1000 on the need, but NAK on the way it is done. IMO, we don't want
> >> yet-another-sysctl for that. Instead, we have (well, not yet, but it's
> >> about time) this series [1] to enforce ordering and occurrences of
> >> Extension Headers, which is based on an IETF draft [2] (FYI, draft-
> >> ietf-6man-eh-limits is dead). I think we should enforce ordering and
> >> occurrences in this code path too, instead of relying on a sysctl.
> >> Let's keep both code paths consistent.
>
> Hi Daniel,
>
> > Hm, that series [1] should probably go to net instead of net-next, but atm
>
> +1, would make sense.
>
> > hasn't moved since a month. I'd still think max_ext_hdrs_number would be
> > useful given it has less complexity also for stable, but I guess ultimately
> > up to maintainers..
>
> In the short term, I agree. What worries me is that we end up with a
> redundant, or even useless, sysctl once the other series is applied,
> which will only increase user confusion.

Given the amount of bugs in this code, a sysctl is safe and quire reasonable.

No one will object when it is eventually removed (or has no action)

For the record,  I approve Daniel patch.

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Daniel Borkmann @ 2026-04-18 12:59 UTC (permalink / raw)
  To: Justin Iurman, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <b57f31a2-456e-4727-839a-bc2f0fb07855@gmail.com>

On 4/18/26 2:50 PM, Justin Iurman wrote:
> On 4/18/26 14:26, Daniel Borkmann wrote:
>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>> [...]
>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>> index d2cd33e2698d..93f865545a7c 100644
>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>>           .extra1        = SYSCTL_ZERO,
>>>>           .extra2        = &flowlabel_reflect_max,
>>>>       },
>>>> +    {
>>>> +        .procname    = "max_ext_hdrs_number",
>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>> +        .maxlen        = sizeof(int),
>>>> +        .mode        = 0644,
>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>> +        .extra1        = SYSCTL_ONE,
>>>> +    },
>>>>       {
>>>>           .procname    = "max_dst_opts_number",
>>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>
>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>
>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want yet-another-sysctl for that. Instead, we have (well, not yet, but it's about time) this series [1] to enforce ordering and occurrences of Extension Headers, which is based on an IETF draft [2] (FYI, draft- ietf-6man-eh-limits is dead). I think we should enforce ordering and occurrences in this code path too, instead of relying on a sysctl. Let's keep both code paths consistent.
> 
> Hi Daniel,
> 
>> Hm, that series [1] should probably go to net instead of net-next, but atm
> 
> +1, would make sense.
> 
>> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
>> useful given it has less complexity also for stable, but I guess ultimately
>> up to maintainers..
> 
> In the short term, I agree. What worries me is that we end up with a redundant, or even useless, sysctl once the other series is applied, which will only increase user confusion.
I'm thinking even if that series lands, and there is still odd hw out there
where the enforcement of ordering is not in place, and users might be forced
to disable net.ipv6.enforce_ext_hdr_order, then the limit would still apply
and protect them.

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 12:50 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <ae053593-907e-4891-90fb-03b4c5d8f5e1@iogearbox.net>

On 4/18/26 14:26, Daniel Borkmann wrote:
> Hi Justin,
> 
> On 4/18/26 1:45 PM, Justin Iurman wrote:
>> On 4/17/26 19:18, Daniel Borkmann wrote:
> [...]
>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>> index d2cd33e2698d..93f865545a7c 100644
>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>           .extra1        = SYSCTL_ZERO,
>>>           .extra2        = &flowlabel_reflect_max,
>>>       },
>>> +    {
>>> +        .procname    = "max_ext_hdrs_number",
>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>> +        .maxlen        = sizeof(int),
>>> +        .mode        = 0644,
>>> +        .proc_handler    = proc_dointvec_minmax,
>>> +        .extra1        = SYSCTL_ONE,
>>> +    },
>>>       {
>>>           .procname    = "max_dst_opts_number",
>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>
>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>
>> +1000 on the need, but NAK on the way it is done. IMO, we don't want 
>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's 
>> about time) this series [1] to enforce ordering and occurrences of 
>> Extension Headers, which is based on an IETF draft [2] (FYI, draft- 
>> ietf-6man-eh-limits is dead). I think we should enforce ordering and 
>> occurrences in this code path too, instead of relying on a sysctl. 
>> Let's keep both code paths consistent.

Hi Daniel,

> Hm, that series [1] should probably go to net instead of net-next, but atm

+1, would make sense.

> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
> useful given it has less complexity also for stable, but I guess ultimately
> up to maintainers..

In the short term, I agree. What worries me is that we end up with a 
redundant, or even useless, sysctl once the other series is applied, 
which will only increase user confusion.

Cheers,
Justin

> Thanks,
> Daniel
> 
>>   [1] https://lore.kernel.org/netdev/20260314175124.47010-1- 
>> tom@herbertland.com/#t
>>   [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/
> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox