Netdev List
 help / color / mirror / Atom feed
* [RFC PATCH net] mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient
From: Li Xiasong @ 2026-04-18 10:00 UTC (permalink / raw)
  To: Matthieu Baerts, Mat Martineau, Geliang Tang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, mptcp, linux-kernel, yuehaibing, zhangchangzhong,
	weiyongjun1

When TCP option space is insufficient (e.g., IPv6 with tcp_timestamps
enabled), the original code jumped to out_unlock without clearing the
addr_signal flag. This caused mptcp_pm_add_timer to keep rescheduling
indefinitely without sending ADD_ADDR, preventing the endpoint list from
being traversed.

In a pure ACK scenario (indicated by drop_other_suboptions=true), if
the option space is insufficient to carry the ADD_ADDR suboption, it
is appropriate to drop this address signal to allow the timer handler
to move on to other addresses.

Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Signed-off-by: Li Xiasong <lixiasong1@huawei.com>
---

Seeking feedback on:

When announcing addresses to the peer, MPTCP sends a pure ACK packet
to carry MPTCP options (ADD_ADDR). In this scenario, if the option space
is insufficient for ADD_ADDR, clearing addr_signal would:

  - Prevent the timer from retrying infinitely
  - Allow the timer to continue traversing and processing other addresses
  - Not block other subflow creation or address announcement operations

Is there any scenario where we should retry later instead of clearing
the address signal/echo flag? However, if a pure ACK doesn't have
enough space for the flag, subsequent packets won't either.

---
 net/mptcp/pm.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 57a456690406..1d49779c6a1f 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -881,19 +881,18 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
 	}
 
 	*echo = mptcp_pm_should_add_signal_echo(msk);
+	add_addr = msk->pm.addr_signal &
+		~(*echo ? BIT(MPTCP_ADD_ADDR_ECHO) : BIT(MPTCP_ADD_ADDR_SIGNAL));
 	port = !!(*echo ? msk->pm.remote.port : msk->pm.local.port);
-
 	family = *echo ? msk->pm.remote.family : msk->pm.local.family;
-	if (remaining < mptcp_add_addr_len(family, *echo, port))
-		goto out_unlock;
 
-	if (*echo) {
-		*addr = msk->pm.remote;
-		add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_ECHO);
-	} else {
-		*addr = msk->pm.local;
-		add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_SIGNAL);
+	if (remaining < mptcp_add_addr_len(family, *echo, port)) {
+		if (*drop_other_suboptions)
+			WRITE_ONCE(msk->pm.addr_signal, add_addr);
+		goto out_unlock;
 	}
+
+	*addr = *echo ? msk->pm.remote : msk->pm.local;
 	WRITE_ONCE(msk->pm.addr_signal, add_addr);
 	ret = true;
 
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH 1/4 nf] netfilter: nft_exthdr: skip SCTP chunk evaluation for non-first fragments
From: Fernando Fernandez Mancera @ 2026-04-18  9:51 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, netdev, coreteam, fw, phil
In-Reply-To: <aeM3gmXM43beA3ot@chamomile>

On 4/18/26 9:49 AM, Pablo Neira Ayuso wrote:
> Hi Fernando,
> 
> On Fri, Apr 17, 2026 at 08:34:30PM +0200, Fernando Fernandez Mancera wrote:
>> The SCTP chunk matching logic in nft_exthdr relies on SCTP common header
>> being present at the transport header offset. For fragmented packets at
>> IP level, only the first fragment would match this condition.
>>
>> The nft_exthdr could be used in a PREROUTING chain with a priority lower
>> than -400. This would bypass defragmentation. In addition, it can be use
>> in stateless environments so it should work on a environment where
>> defragmentation is not being performed at all.
> 
> Yes, and stateless filtering is still a valid configuration, ie.
> nf_conntrack is not loaded.
> 
>> Add a check for pkt->fragoff to ensure exthdr SCTP only evaluates
>> unfragmented packets or the first fragment in the stream.
> 
> I would suggest to squash the three small patches to check for
> pkt->fragoff in one patch. The three expressions have been already
> around for a while (backporting the combo patch that makes the same
> logical change should be easy) and it is basically the same logical
> change.
> 

Hi Pablo,

Thanks for the review! I am not sure about squashing them as they all 
have different blamed commits. I find accurate fixes tag quite useful 
when handling backports and I guess others do too (also for stable 
kernels). Is that convincing?

Anyway, not a big deal if there is a strong preference I will squash them.

Thanks,
Fernando.

> Thanks!
> 
>> Fixes: 133dc203d77d ("netfilter: nft_exthdr: Support SCTP chunks")
>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>> ---
>>   net/netfilter/nft_exthdr.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
>> index 7eedf4e3ae9c..8eb708bb8cff 100644
>> --- a/net/netfilter/nft_exthdr.c
>> +++ b/net/netfilter/nft_exthdr.c
>> @@ -376,7 +376,7 @@ static void nft_exthdr_sctp_eval(const struct nft_expr *expr,
>>   	const struct sctp_chunkhdr *sch;
>>   	struct sctp_chunkhdr _sch;
>>   
>> -	if (pkt->tprot != IPPROTO_SCTP)
>> +	if (pkt->tprot != IPPROTO_SCTP || pkt->fragoff)
>>   		goto err;
>>   
>>   	do {
>> -- 
>> 2.53.0
>>


^ permalink raw reply

* Re: [PATCH net] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Justin Iurman @ 2026-04-18 10:59 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <20260417220358.693101-1-daniel@iogearbox.net>

On 4/18/26 00:03, Daniel Borkmann wrote:
> Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
> Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
> and applied them in ip6_parse_tlv(), the generic TLV walker
> invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
> 
> ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
> it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
> branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
> loop is bounded only by optlen, which can be up to 2048 bytes.
> Stuffing the Destination Options header with 2046 Pad1 (type=0)
> entries advances the scanner a single byte at a time, yielding
> ~2000 TLV iterations per extension header.
> 
> Reuse max_dst_opts_cnt to bound the TLV iterations, matching
> the semantics from 47d3d7ac656a.
> 
> Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   net/ipv6/ip6_tunnel.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 907c6a2af331..0ab76f93c136 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>   				break;
>   		}
>   		if (nexthdr == NEXTHDR_DEST) {
> +			int tlv_max = READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt);
> +			int tlv_cnt = 0;
>   			u16 i = 2;
>   
>   			while (1) {
>   				struct ipv6_tlv_tnl_enc_lim *tel;
>   
> +				if (unlikely(tlv_cnt++ >= tlv_max))
> +					break;
> +
>   				/* No more room for encapsulation limit */
>   				if (i + sizeof(*tel) > optlen)
>   					break;

Good point on reusing max_dst_opts_cnt in ip6_tnl_parse_tlv_enc_lim(), 
but this patch is not ready yet.

We need to be careful: max_dst_opts_cnt can be negative. If this is the 
case, ip6_tnl_parse_tlv_enc_lim() would probably return 0, which is not 
what we want here. From the doc:

max_dst_opts_number - INTEGER
         Maximum number of non-padding TLVs allowed in a Destination
         options extension header. If this value is less than zero
         then unknown options are disallowed and the number of known
         TLVs allowed is the absolute value of this number.

         Default: 8

Since ip6_tnl_parse_tlv_enc_lim() does not check for specific option 
types (e.g., Pad1, PadN, you-name-it) and does not differentiate known 
from unknown options during parsing, I would simply use the absolute 
value of max_dst_opts_cnt by default.

Also, I wouldn't use unlikely() because it could harm us more than it 
helps in this specific context (consistent with ip6_parse_tlv()).

^ permalink raw reply

* [PATCH net] net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd()
From: Bingquan Chen @ 2026-04-18 11:20 UTC (permalink / raw)
  To: Willem de Bruijn, Greg KH
  Cc: Stephen Hemminger, security, David S . Miller, Jakub Kicinski,
	Eric Dumazet, netdev, Bingquan Chen
In-Reply-To: <2026041858-estimator-shower-0f16@gregkh>

In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points
directly into the mmap'd TX ring buffer shared with userspace. The
kernel validates the header via __packet_snd_vnet_parse() but then
re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent
userspace thread can modify the vnet_hdr fields between validation
and use, bypassing all safety checks.

The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr
to a stack-local variable. All other vnet_hdr consumers in the kernel
(tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX
path is the only caller of virtio_net_hdr_to_skb() that reads directly
from user-controlled shared memory.

Fix this by copying vnet_hdr from the mmap'd ring buffer to a
stack-local variable before validation and use, consistent with the
approach used in packet_snd() and all other callers.

Fixes: 1d036d25e560 ("packet: tpacket_snd gso and checksum offload")
Signed-off-by: Bingquan Chen <patzilla007@gmail.com>
---
 net/packet/af_packet.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4b043241fd56..8e6f3a734ba0 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2718,7 +2718,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 {
 	struct sk_buff *skb = NULL;
 	struct net_device *dev;
-	struct virtio_net_hdr *vnet_hdr = NULL;
+	struct virtio_net_hdr vnet_hdr;
+	bool has_vnet_hdr = false;
 	struct sockcm_cookie sockc;
 	__be16 proto;
 	int err, reserve = 0;
@@ -2819,16 +2820,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		hlen = LL_RESERVED_SPACE(dev);
 		tlen = dev->needed_tailroom;
 		if (vnet_hdr_sz) {
-			vnet_hdr = data;
 			data += vnet_hdr_sz;
 			tp_len -= vnet_hdr_sz;
-			if (tp_len < 0 ||
-			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
+			if (tp_len < 0) {
+				tp_len = -EINVAL;
+				goto tpacket_error;
+			}
+			memcpy(&vnet_hdr, data - vnet_hdr_sz, sizeof(vnet_hdr));
+			if (__packet_snd_vnet_parse(&vnet_hdr, tp_len)) {
 				tp_len = -EINVAL;
 				goto tpacket_error;
 			}
 			copylen = __virtio16_to_cpu(vio_le(),
-						    vnet_hdr->hdr_len);
+						    vnet_hdr.hdr_len);
+			has_vnet_hdr = true;
 		}
 		copylen = max_t(int, copylen, dev->hard_header_len);
 		skb = sock_alloc_send_skb(&po->sk,
@@ -2865,12 +2870,12 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			}
 		}
 
-		if (vnet_hdr_sz) {
-			if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) {
+		if (has_vnet_hdr) {
+			if (virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le())) {
 				tp_len = -EINVAL;
 				goto tpacket_error;
 			}
-			virtio_net_hdr_set_proto(skb, vnet_hdr);
+			virtio_net_hdr_set_proto(skb, &vnet_hdr);
 		}
 
 		skb->destructor = tpacket_destruct_skb;
-- 
2.53.0


^ permalink raw reply related

* Re: [RFC PATCH v4 01/19] landlock: Support socket access-control
From: Mikhail Ivanov @ 2026-04-18 11:29 UTC (permalink / raw)
  To: Günther Noack
  Cc: mic, gnoack, willemdebruijn.kernel, matthieu,
	linux-security-module, netdev, netfilter-devel, yusongping,
	artem.kuzin, konstantin.meskhidze
In-Reply-To: <af464773-b01b-f3a4-474d-0efb2cfae142@huawei-partners.com>

On 11/22/2025 2:13 PM, Mikhail Ivanov wrote:
> On 11/22/2025 1:49 PM, Günther Noack wrote:
>> On Tue, Nov 18, 2025 at 09:46:21PM +0800, Mikhail Ivanov wrote:
>>> +/**
>>> + * struct landlock_socket_attr - Socket protocol definition
>>> + *
>>> + * Argument of sys_landlock_add_rule().
>>> + */
>>> +struct landlock_socket_attr {
>>> +    /**
>>> +     * @allowed_access: Bitmask of allowed access for a socket protocol
>>> +     * (cf. `Socket flags`_).
>>> +     */
>>> +    __u64 allowed_access;
>>> +    /**
>>> +     * @family: Protocol family used for communication
>>> +     * (cf. include/linux/socket.h).
>>> +     */
>>> +    __s32 family;
>>> +    /**
>>> +     * @type: Socket type (cf. include/linux/net.h)
>>> +     */
>>> +    __s32 type;
>>> +    /**
>>> +     * @protocol: Communication protocol specific to protocol family 
>>> set in
>>> +     * @family field.
>>
>> This is specific to both the @family and the @type, not just the @family.
>>
>>> From socket(2):
>>
>>    Normally only a single protocol exists to support a particular
>>    socket type within a given protocol family.
>>
>> For instance, in your commit message above the protocol in the example
>> is IPPROTO_TCP, which would imply the type SOCK_STREAM, but not work
>> with SOCK_DGRAM.
> 
> You're right.
> 

I revised the socket(2) semantics and this part is about that kernel
maps (family, type, 0) to the default protocol of given family and type.
Eg. (AF_INET, SOCK_STREAM, 0) is mapped to (AF_INET, SOCK_STREAM,
IPPROTO_TCP). I would like to clarify that such mapping is taking place
in landlock_socket_attr.protocol field doc.

There should be list of protocols defined per protocol family. From
socket(2):
	The domain argument specifies a communication domain.
	...
	The protocol number to use is specific to the “communication
	domain” in which communication is to take place.

Such mapping allows to define strange socket rules if setting @type=-1.
For example:
	struct landlock_socket_attr attr = {
		.family = AF_INET,
		.type = -1,
		.protocol = 0,
	};

This definition corresponds to (AF_INET, SOCK_STREAM, 0->IPPROTO_TCP)
and to (AF_INET, SOCK_DGRAM, 0->IPPROTO_UDP).

I don't see this as a bad thing as far as there is proper documentation
for landlock_socket_attr.

^ permalink raw reply

* [PATCH net v2] vxlan: fix NULL vn6_sock dereference in vxlan_igmp_join() and vxlan_igmp_leave()
From: Weiming Shi @ 2026-04-18 11:41 UTC (permalink / raw)
  To: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Roopa Prabhu, netdev, Xiang Mei, Weiming Shi

vxlan_sock_add() tolerates IPv6 socket creation failure with
-EAFNOSUPPORT (e.g. ipv6.disable=1), leaving vn6_sock as NULL while
successfully creating vn4_sock. vxlan_igmp_join() and
vxlan_igmp_leave() then crash when they dereference the NULL vn6_sock
for VNI filter entries with IPv6 multicast groups:

 Oops: general protection fault, probably for non-canonical address
      0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
 RIP: 0010:vxlan_igmp_join (drivers/net/vxlan/vxlan_multicast.c:40)
 Call Trace:
  vxlan_multicast_join (drivers/net/vxlan/vxlan_multicast.c:195)
  vxlan_open (drivers/net/vxlan/vxlan_core.c:2965)
  __dev_open (net/core/dev.c:1704)
  __dev_change_flags (net/core/dev.c:9781)
  do_setlink.isra.0 (net/core/rtnetlink.c:3180)
  rtnl_newlink (net/core/rtnetlink.c:4238)
  rtnetlink_rcv_msg (net/core/rtnetlink.c:6921)

Skip the IPv6 multicast join/leave when vn6_sock is NULL, consistent
with how vxlan_sock_add() tolerates missing IPv6 support.

Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
v2:
  - drop sock4 NULL checks 

 drivers/net/vxlan/vxlan_multicast.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_multicast.c b/drivers/net/vxlan/vxlan_multicast.c
index a7f2d67dc61b..e6aa5ab1c939 100644
--- a/drivers/net/vxlan/vxlan_multicast.c
+++ b/drivers/net/vxlan/vxlan_multicast.c
@@ -37,6 +37,9 @@ int vxlan_igmp_join(struct vxlan_dev *vxlan, union vxlan_addr *rip,
 	} else {
 		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
 
+		if (!sock6)
+			return 0;
+
 		sk = sock6->sock->sk;
 		lock_sock(sk);
 		ret = ipv6_stub->ipv6_sock_mc_join(sk, ifindex,
@@ -71,6 +74,9 @@ int vxlan_igmp_leave(struct vxlan_dev *vxlan, union vxlan_addr *rip,
 	} else {
 		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
 
+		if (!sock6)
+			return 0;
+
 		sk = sock6->sock->sk;
 		lock_sock(sk);
 		ret = ipv6_stub->ipv6_sock_mc_drop(sk, ifindex,
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 11:45 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <20260417171831.687053-1-daniel@iogearbox.net>

On 4/17/26 19:18, Daniel Borkmann wrote:
> ipv6_{skip_exthdr,find_hdr}() and ip6_tnl_parse_tlv_enc_lim() iterate
> over IPv6 extension headers until they find a non-extension-header
> protocol or run out of packet data. The loops have no iteration counter,
> relying solely on the packet length to bound them. For a crafted packet
> with 8-byte extension headers filling a 64KB jumbogram, this means a
> worst case of up to ~8k iterations with a skb_header_pointer call each.
> ipv6_skip_exthdr(), for example, is used where it parses the inner
> quoted packet inside an incoming ICMPv6 error:
> 
>    - icmpv6_rcv
>      - checksum validation
>      - case ICMPV6_DEST_UNREACH
>        - icmpv6_notify
>          - pskb_may_pull()       <- pull inner IPv6 header
>          - ipv6_skip_exthdr()    <- iterates here
>          - pskb_may_pull()
>          - ipprot->err_handler() <- sk lookup (matching sk not required)
> 
> The per-iteration cost of ipv6_skip_exthdr itself is generally light,
> but skb_header_pointer becomes more costly on reassembled packets: the
> first ~1KB of the inner packet are in the skb's linear area, but the
> remaining ~63KB are in the frag_list where skb_copy_bits is needed to
> read data.
> 
> Add a configurable limit via a new sysctl net.ipv6.max_ext_hdrs_number
> (default 32, minimum 1). All three extension header walking functions
> are bound by this limit. The sysctl is in line with commit 47d3d7ac656a
> ("ipv6: Implement limits on Hop-by-Hop and Destination options"). The
> init_net is used since plumbing a struct net * through all helpers
> would touch a lot of callsites.
> 
> There's an ongoing IETF draft-ietf-6man-eh-limits-18 that states that
> 8 extension headers before the transport header is the baseline which
> routers MUST handle; section 7 details also why limits are needed.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   Documentation/networking/ip-sysctl.rst |  7 +++++++
>   include/net/ipv6.h                     |  2 ++
>   include/net/netns/ipv6.h               |  1 +
>   net/ipv6/af_inet6.c                    |  1 +
>   net/ipv6/exthdrs_core.c                | 11 +++++++++++
>   net/ipv6/ip6_tunnel.c                  |  5 +++++
>   net/ipv6/sysctl_net_ipv6.c             |  8 ++++++++
>   7 files changed, 35 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6921d8594b84..4559a956bbd9 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -2503,6 +2503,13 @@ max_hbh_length - INTEGER
>   
>   	Default: INT_MAX (unlimited)
>   
> +max_ext_hdrs_number - INTEGER
> +	Maximum number of IPv6 extension headers allowed in a packet.
> +	Limits how many extension headers will be traversed. The value
> +	is read from the initial netns.
> +
> +	Default: 32
> +
>   skip_notify_on_dev_down - BOOLEAN
>   	Controls whether an RTM_DELROUTE message is generated for routes
>   	removed when a device is taken down or deleted. IPv4 does not
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 53c5056508be..d7f0d55e6918 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -90,6 +90,8 @@ struct ip_tunnel_info;
>   #define IP6_DEFAULT_MAX_DST_OPTS_LEN	 INT_MAX /* No limit */
>   #define IP6_DEFAULT_MAX_HBH_OPTS_LEN	 INT_MAX /* No limit */
>   
> +#define IP6_DEFAULT_MAX_EXT_HDRS_CNT	 32
> +
>   /*
>    *	Addr type
>    *	
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index 34bdb1308e8f..5be4dd1c9ae8 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -54,6 +54,7 @@ struct netns_sysctl_ipv6 {
>   	int max_hbh_opts_cnt;
>   	int max_dst_opts_len;
>   	int max_hbh_opts_len;
> +	int max_ext_hdrs_cnt;
>   	int seg6_flowlabel;
>   	u32 ioam6_id;
>   	u64 ioam6_id_wide;
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 4cbd45b68088..ed7fe6e4a6bd 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -965,6 +965,7 @@ static int __net_init inet6_net_init(struct net *net)
>   	net->ipv6.sysctl.flowlabel_state_ranges = 0;
>   	net->ipv6.sysctl.max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT;
>   	net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
> +	net->ipv6.sysctl.max_ext_hdrs_cnt = IP6_DEFAULT_MAX_EXT_HDRS_CNT;
>   	net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
>   	net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
>   	net->ipv6.sysctl.fib_notify_on_flag_change = 0;
> diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
> index 49e31e4ae7b7..917307877cbb 100644
> --- a/net/ipv6/exthdrs_core.c
> +++ b/net/ipv6/exthdrs_core.c
> @@ -4,6 +4,8 @@
>    * not configured or static.
>    */
>   #include <linux/export.h>
> +
> +#include <net/net_namespace.h>
>   #include <net/ipv6.h>
>   
>   /*
> @@ -72,7 +74,9 @@ EXPORT_SYMBOL(ipv6_ext_hdr);
>   int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
>   		     __be16 *frag_offp)
>   {
> +	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
>   	u8 nexthdr = *nexthdrp;
> +	int exthdr_cnt = 0;
>   
>   	*frag_offp = 0;
>   
> @@ -80,6 +84,8 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
>   		struct ipv6_opt_hdr _hdr, *hp;
>   		int hdrlen;
>   
> +		if (unlikely(exthdr_cnt++ >= exthdr_max))
> +			return -1;
>   		if (nexthdr == NEXTHDR_NONE)
>   			return -1;
>   		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
> @@ -188,8 +194,10 @@ EXPORT_SYMBOL_GPL(ipv6_find_tlv);
>   int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>   		  int target, unsigned short *fragoff, int *flags)
>   {
> +	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
>   	unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
>   	u8 nexthdr = ipv6_hdr(skb)->nexthdr;
> +	int exthdr_cnt = 0;
>   	bool found;
>   
>   	if (fragoff)
> @@ -216,6 +224,9 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>   			return -ENOENT;
>   		}
>   
> +		if (unlikely(exthdr_cnt++ >= exthdr_max))
> +			return -EBADMSG;
> +
>   		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
>   		if (!hp)
>   			return -EBADMSG;
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 0b53488a9229..78e849e167ca 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -396,15 +396,20 @@ ip6_tnl_dev_uninit(struct net_device *dev)
>   
>   __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>   {
> +	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
>   	const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)raw;
>   	unsigned int nhoff = raw - skb->data;
>   	unsigned int off = nhoff + sizeof(*ipv6h);
>   	u8 nexthdr = ipv6h->nexthdr;
> +	int exthdr_cnt = 0;
>   
>   	while (ipv6_ext_hdr(nexthdr) && nexthdr != NEXTHDR_NONE) {
>   		struct ipv6_opt_hdr *hdr;
>   		u16 optlen;
>   
> +		if (unlikely(exthdr_cnt++ >= exthdr_max))
> +			break;
> +
>   		if (!pskb_may_pull(skb, off + sizeof(*hdr)))
>   			break;
>   
> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> index d2cd33e2698d..93f865545a7c 100644
> --- a/net/ipv6/sysctl_net_ipv6.c
> +++ b/net/ipv6/sysctl_net_ipv6.c
> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>   		.extra1		= SYSCTL_ZERO,
>   		.extra2		= &flowlabel_reflect_max,
>   	},
> +	{
> +		.procname	= "max_ext_hdrs_number",
> +		.data		= &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ONE,
> +	},
>   	{
>   		.procname	= "max_dst_opts_number",
>   		.data		= &init_net.ipv6.sysctl.max_dst_opts_cnt,

NACKed-by: Justin Iurman <justin.iurman@gmail.com>

+1000 on the need, but NAK on the way it is done. IMO, we don't want 
yet-another-sysctl for that. Instead, we have (well, not yet, but it's 
about time) this series [1] to enforce ordering and occurrences of 
Extension Headers, which is based on an IETF draft [2] (FYI, 
draft-ietf-6man-eh-limits is dead). I think we should enforce ordering 
and occurrences in this code path too, instead of relying on a sysctl. 
Let's keep both code paths consistent.

  [1] 
https://lore.kernel.org/netdev/20260314175124.47010-1-tom@herbertland.com/#t
  [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/

^ permalink raw reply

* [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Daniel Borkmann @ 2026-04-18 12:15 UTC (permalink / raw)
  To: kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	justin.iurman, netdev

Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
and applied them in ip6_parse_tlv(), the generic TLV walker
invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().

ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
loop is bounded only by optlen, which can be up to 2048 bytes.
Stuffing the Destination Options header with 2046 Pad1 (type=0)
entries advances the scanner a single byte at a time, yielding
~2000 TLV iterations per extension header.

Reuse max_dst_opts_cnt to bound the TLV iterations, matching
the semantics from 47d3d7ac656a.

Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 v1->v2:
   - Remove unlikely (Justin)
   - Use abs() given max_dst_opts_cnt's negative meaning (Justin)

 net/ipv6/ip6_tunnel.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 907c6a2af331..0f50b7fcb24e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
 				break;
 		}
 		if (nexthdr == NEXTHDR_DEST) {
+			int tlv_max = abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
+			int tlv_cnt = 0;
 			u16 i = 2;
 
 			while (1) {
 				struct ipv6_tlv_tnl_enc_lim *tel;
 
+				if (tlv_cnt++ >= tlv_max)
+					break;
+
 				/* No more room for encapsulation limit */
 				if (i + sizeof(*tel) > optlen)
 					break;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Daniel Borkmann @ 2026-04-18 12:17 UTC (permalink / raw)
  To: Justin Iurman, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <acee197f-1821-4304-8759-a02ac1d5c808@gmail.com>

Hi Justin,

On 4/18/26 12:59 PM, Justin Iurman wrote:
[...]
> Good point on reusing max_dst_opts_cnt in ip6_tnl_parse_tlv_enc_lim(), but this patch is not ready yet.
> 
> We need to be careful: max_dst_opts_cnt can be negative. If this is the case, ip6_tnl_parse_tlv_enc_lim() would probably return 0, which is not what we want here. From the doc:
> 
> max_dst_opts_number - INTEGER
>          Maximum number of non-padding TLVs allowed in a Destination
>          options extension header. If this value is less than zero
>          then unknown options are disallowed and the number of known
>          TLVs allowed is the absolute value of this number.
> 
>          Default: 8
> 
> Since ip6_tnl_parse_tlv_enc_lim() does not check for specific option types (e.g., Pad1, PadN, you-name-it) and does not differentiate known from unknown options during parsing, I would simply use the absolute value of max_dst_opts_cnt by default.
> 
> Also, I wouldn't use unlikely() because it could harm us more than it helps in this specific context (consistent with ip6_parse_tlv()).

Thanks for the review, the suggestions make sense, and I've updated them in a v2.

Cheers,
Daniel

^ permalink raw reply

* [PATCH] net: hamachi: fix divide by zero in hamachi_init_one
From: Mingyu Wang @ 2026-04-18 12:18 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: tglx, mingo, netdev, linux-kernel, Mingyu Wang

During the hardware initialization phase in hamachi_init_one(), the driver
reads the PCIClkMeas register to calculate the PCI bus frequency.

The current code attempts to prevent a divide-by-zero error using a ternary
operator: `i ? 2000/(i&0x7f) : 0`. However, this check is flawed. The highest
bit of `i` (0x80) acts as a ready flag. If unreliable hardware or a malicious
virtual device returns a value where the ready bit is set but the lower 7 bits
are zero (e.g., 0x80), the condition `i` evaluates to true, but `(i & 0x7f)`
evaluates to 0. This results in a fatal divide-by-zero exception.

This bug was discovered during an automated virtual device fuzzing campaign
testing the hardware-software trust boundary. When the hardware returns 0x80,
it bypassed the readiness while-loop but triggered the divide error. In our
tests, this panic interrupted the module loading process, further triggering
a KASAN slab-out-of-bounds in the module error path, and ultimately leading
to a multi-core soft lockup and RCU stall.

This patch fixes the issue by explicitly checking the divisor `(i & 0x7f)`
instead of the entire register value `i` before performing the division.

Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
---
 drivers/net/ethernet/packetengines/hamachi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/packetengines/hamachi.c b/drivers/net/ethernet/packetengines/hamachi.c
index b0de7e9f12a5..1d7206dd18fd 100644
--- a/drivers/net/ethernet/packetengines/hamachi.c
+++ b/drivers/net/ethernet/packetengines/hamachi.c
@@ -748,7 +748,7 @@ static int hamachi_init_one(struct pci_dev *pdev,
 	printk(KERN_INFO "%s:  %d-bit %d Mhz PCI bus (%d), Virtual Jumpers "
 		   "%2.2x, LPA %4.4x.\n",
 		   dev->name, readw(ioaddr + MiscStatus) & 1 ? 64 : 32,
-		   i ? 2000/(i&0x7f) : 0, i&0x7f, (int)readb(ioaddr + VirtualJumpers),
+		   (i & 0x7f) ? 2000 / (i & 0x7f) : 0, i & 0x7f, (int)readb(ioaddr + VirtualJumpers),
 		   readw(ioaddr + ANLinkPartnerAbility));
 
 	if (chip_tbl[hmp->chip_id].flags & CanHaveMII) {
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Daniel Borkmann @ 2026-04-18 12:26 UTC (permalink / raw)
  To: Justin Iurman, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <60b47924-dae4-4a10-b977-75b92e1094c0@gmail.com>

Hi Justin,

On 4/18/26 1:45 PM, Justin Iurman wrote:
> On 4/17/26 19:18, Daniel Borkmann wrote:
[...]
>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>> index d2cd33e2698d..93f865545a7c 100644
>> --- a/net/ipv6/sysctl_net_ipv6.c
>> +++ b/net/ipv6/sysctl_net_ipv6.c
>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>           .extra1        = SYSCTL_ZERO,
>>           .extra2        = &flowlabel_reflect_max,
>>       },
>> +    {
>> +        .procname    = "max_ext_hdrs_number",
>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>> +        .maxlen        = sizeof(int),
>> +        .mode        = 0644,
>> +        .proc_handler    = proc_dointvec_minmax,
>> +        .extra1        = SYSCTL_ONE,
>> +    },
>>       {
>>           .procname    = "max_dst_opts_number",
>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
> 
> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
> 
> +1000 on the need, but NAK on the way it is done. IMO, we don't want yet-another-sysctl for that. Instead, we have (well, not yet, but it's about time) this series [1] to enforce ordering and occurrences of Extension Headers, which is based on an IETF draft [2] (FYI, draft-ietf-6man-eh-limits is dead). I think we should enforce ordering and occurrences in this code path too, instead of relying on a sysctl. Let's keep both code paths consistent.

Hm, that series [1] should probably go to net instead of net-next, but atm
hasn't moved since a month. I'd still think max_ext_hdrs_number would be
useful given it has less complexity also for stable, but I guess ultimately
up to maintainers..

Thanks,
Daniel

>   [1] https://lore.kernel.org/netdev/20260314175124.47010-1-tom@herbertland.com/#t
>   [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/


^ permalink raw reply

* Re: [PATCH iwl-net v3 5/6] ixgbe: fix ITR value overflow in adaptive interrupt throttling
From: Simon Horman @ 2026-04-18 12:26 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260415142841.3222399-6-aleksandr.loktionov@intel.com>

On Wed, Apr 15, 2026 at 04:28:40PM +0200, Aleksandr Loktionov wrote:
> ixgbe_update_itr() packs a mode flag (IXGBE_ITR_ADAPTIVE_LATENCY,
> bit 7) and a usecs delay (bits [6:0]) into an unsigned int, then
> stores the combined value in ring_container->itr which is declared as
> u8.  Values above 0xFF wrap on truncation, corrupting both the delay
> and the mode flag on the next readback.
> 
> Keep the mode bit (IXGBE_ITR_ADAPTIVE_LATENCY) and the usec delay as
> separate operands in the final store expression.  Clamp only the usecs
> portion to [IXGBE_ITR_ADAPTIVE_MIN_USECS, IXGBE_ITR_ADAPTIVE_MAX_USECS]
> using clamp_val() so that:
>  - overflow cannot bleed into the mode bit (bit 7),
>  - the delay cannot exceed 126 us (IXGBE_ITR_ADAPTIVE_MAX_USECS),
>  - the delay cannot drop below 10 us (IXGBE_ITR_ADAPTIVE_MIN_USECS).
> 
> Fixes: b4ded8327fea ("ixgbe: Update adaptive ITR algorithm")
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v2 -> v3:
>  - Use clamp_val() instead of min_t() to also guard the lower bound
>    (IXGBE_ITR_ADAPTIVE_MIN_USECS); keep mode and delay as separate
>    operands until final store; use IXGBE_ITR_ADAPTIVE_MAX_USECS (126)
>    as upper bound instead of IXGBE_ITR_ADAPTIVE_LATENCY - 1 (127)
>    (Simon Horman).

FTR: I think the code would be easier to reason with if
mode and delay were kept separate during earlier calculation
of itr. But I also think that can be handled as a follow-up.
as this patch does improve things.

Reviewed-by: Simon Horman <horms@kernel.org>

^ permalink raw reply

* Re: [PATCH iwl-net v3 3/6] ixgbe: call ixgbe_setup_fc() before fc_enable() after NVM update
From: Simon Horman @ 2026-04-18 12:28 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260415142841.3222399-4-aleksandr.loktionov@intel.com>

On Wed, Apr 15, 2026 at 04:28:38PM +0200, Aleksandr Loktionov wrote:
> During an NVM update the PHY reset clears the Technology Ability Field
> (IEEE 802.3 clause 37 register 7.10) back to hardware defaults.  When
> the driver subsequently calls only hw->mac.ops.fc_enable() the SRRCTL
> register is recalculated from stale autonegotiated capability bits,
> which the MDD (Malicious Driver Detect) logic treats as an invalid
> change and halts traffic on the PF.
> 
> Fix by calling ixgbe_setup_fc() immediately before fc_enable() in
> ixgbe_watchdog_update_link() so that flow-control autoneg and the PHY
> registers are re-programmed in the correct order after any reset.
> 
> Skip setup_fc() on backplane links: on 82599 backplane interfaces
> setup_fc() resolves to prot_autoc_write() ->
> ixgbe_reset_pipeline_82599() which toggles IXGBE_AUTOC_AN_RESTART.
> Calling it unconditionally on link-up creates an infinite link-flap
> loop because each AN-restart triggers another link-up event.  Guard
> with a get_media_type() check and skip setup_fc() when the media type
> is ixgbe_media_type_backplane; fc_enable() is still called.
> 
> Also handle the failure path: if setup_fc() returns an error its output
> is invalid and calling fc_enable() on the unchanged hardware state would
> repeat the exact MDD-triggering condition the fix is meant to prevent.
> Skip fc_enable() in that case while still calling
> ixgbe_set_rx_drop_en() which configures the independent RX-drop
> behaviour.
> 
> Fixes: 93c52dd0033b ("ixgbe: Merge watchdog functionality into service task")
> Suggested-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v2 -> v3:
>  - Skip setup_fc() for ixgbe_media_type_backplane: unconditional call on
>    82599 backplane links triggers prot_autoc_write() ->
>    ixgbe_reset_pipeline_82599() -> IXGBE_AUTOC_AN_RESTART, causing an
>    infinite link-flap loop (Simon Horman).

Reviewed-by: Simon Horman <horms@kernel.org>

(Unsurprisingly) Sashiko has a number of things to say about this patchset.
But I believe they can all be analysed as part of follow-up work: no need
to block progress of this patchset IMHO.

^ permalink raw reply

* Re: [PATCH net v2] slip: reject VJ receive packets on instances with no rstate array
From: Simon Horman @ 2026-04-18 12:39 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Xiang Mei
In-Reply-To: <20260415204130.258866-2-bestswngs@gmail.com>

On Thu, Apr 16, 2026 at 04:41:31AM +0800, Weiming Shi wrote:
> slhc_init() accepts rslots == 0 as a valid configuration, with the
> documented meaning of 'no receive compression'. In that case the
> allocation loop in slhc_init() is skipped, so comp->rstate stays
> NULL and comp->rslot_limit stays 0 (from the kzalloc of struct
> slcompress).
> 
> The receive helpers do not defend against that configuration.
> slhc_uncompress() dereferences comp->rstate[x] when the VJ header
> carries an explicit connection ID, and slhc_remember() later assigns
> cs = &comp->rstate[...] after only comparing the packet's slot number
> to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the
> range check, and the code dereferences a NULL rstate.
> 
> The configuration is reachable in-tree through PPP. PPPIOCSMAXCID
> stores its argument in a signed int, and (val >> 16) uses arithmetic
> shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1
> is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because
> /dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path
> is reachable from an unprivileged user namespace. Once the malformed
> VJ state is installed, any inbound VJ-compressed or VJ-uncompressed
> frame that selects slot 0 crashes the kernel in softirq context:
> 
>  Oops: general protection fault, probably for non-canonical
>        address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>  RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519)
>  Call Trace:
>   <TASK>
>   ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466)
>   ppp_input (drivers/net/ppp/ppp_generic.c:2359)
>   ppp_async_process (drivers/net/ppp/ppp_async.c:492)
>   tasklet_action_common (kernel/softirq.c:926)
>   handle_softirqs (kernel/softirq.c:623)
>   run_ksoftirqd (kernel/softirq.c:1055)
>   smpboot_thread_fn (kernel/smpboot.c:160)
>   kthread (kernel/kthread.c:436)
>   ret_from_fork (arch/x86/kernel/process.c:164)
>   </TASK>
> 
> Reject the receive side on such instances instead of touching rstate.
> slhc_uncompress() falls through to its existing 'bad' label, which
> bumps sls_i_error and enters the toss state. slhc_remember() mirrors
> that with an explicit sls_i_error increment followed by slhc_toss();
> the sls_i_runt counter is not used here because a missing rstate is
> an internal configuration state, not a runt packet.
> 
> The transmit path is unaffected: the only in-tree caller that picks
> rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and
> slip.c always calls slhc_init(16, 16), so comp->tstate remains valid
> and slhc_compress() continues to work.
> 
> Fixes: b5451d783ade ("slip: Move the SLIP drivers")

AI review points out that the cited commit moves code but doesn't
add this bug.

It seems to me that this bug has existed since the beginning of git
history. If so, the Fixes tag should be:

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

> Reported-by: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> ---
> v2:
> - slhc_remember(): use sls_i_error instead of sls_i_runt for the
>   missing-rstate case; it is a configuration error, not a runt packet
>   (Simon).
> - slhc_uncompress(): goto bad instead of returning 0, so the instance
>   also enters SLF_TOSS on the first rejected frame.

Otherwise this looks good to me:

Reviewed-by: Simon Horman <horms@kernel.org>


I do note that Sashiko flags some other problems in this code.
I do not think that needs to delay progress of this patch.
But you may wish to look into them as follow-up work.

^ permalink raw reply

* Re: [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Justin Iurman @ 2026-04-18 12:40 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <20260418121538.706095-1-daniel@iogearbox.net>

On 4/18/26 14:15, Daniel Borkmann wrote:
> Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
> Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
> and applied them in ip6_parse_tlv(), the generic TLV walker
> invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
> 
> ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
> it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
> branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
> loop is bounded only by optlen, which can be up to 2048 bytes.
> Stuffing the Destination Options header with 2046 Pad1 (type=0)
> entries advances the scanner a single byte at a time, yielding
> ~2000 TLV iterations per extension header.
> 
> Reuse max_dst_opts_cnt to bound the TLV iterations, matching
> the semantics from 47d3d7ac656a.
> 
> Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   v1->v2:
>     - Remove unlikely (Justin)
>     - Use abs() given max_dst_opts_cnt's negative meaning (Justin)
> 
>   net/ipv6/ip6_tunnel.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 907c6a2af331..0f50b7fcb24e 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>   				break;
>   		}
>   		if (nexthdr == NEXTHDR_DEST) {
> +			int tlv_max = abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
> +			int tlv_cnt = 0;
>   			u16 i = 2;
>   
>   			while (1) {
>   				struct ipv6_tlv_tnl_enc_lim *tel;
>   
> +				if (tlv_cnt++ >= tlv_max)
> +					break;
> +
>   				/* No more room for encapsulation limit */
>   				if (i + sizeof(*tel) > optlen)
>   					break;

Thanks for v2, Daniel.

I'm still wondering: should we align the above parsing behavior with the 
one in ip6_parse_tlv() to keep things consistent? That is: don't 
increment tlv_cnt for Pad1/PadN, make sure we don't exceed 8 bytes per 
padding (consecutive Pad1's, or a PadN), and we could also check that a 
PadN payload is only made of zeroes. Open question...

Otherwise, LGTM:
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 12:50 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <ae053593-907e-4891-90fb-03b4c5d8f5e1@iogearbox.net>

On 4/18/26 14:26, Daniel Borkmann wrote:
> Hi Justin,
> 
> On 4/18/26 1:45 PM, Justin Iurman wrote:
>> On 4/17/26 19:18, Daniel Borkmann wrote:
> [...]
>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>> index d2cd33e2698d..93f865545a7c 100644
>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>           .extra1        = SYSCTL_ZERO,
>>>           .extra2        = &flowlabel_reflect_max,
>>>       },
>>> +    {
>>> +        .procname    = "max_ext_hdrs_number",
>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>> +        .maxlen        = sizeof(int),
>>> +        .mode        = 0644,
>>> +        .proc_handler    = proc_dointvec_minmax,
>>> +        .extra1        = SYSCTL_ONE,
>>> +    },
>>>       {
>>>           .procname    = "max_dst_opts_number",
>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>
>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>
>> +1000 on the need, but NAK on the way it is done. IMO, we don't want 
>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's 
>> about time) this series [1] to enforce ordering and occurrences of 
>> Extension Headers, which is based on an IETF draft [2] (FYI, draft- 
>> ietf-6man-eh-limits is dead). I think we should enforce ordering and 
>> occurrences in this code path too, instead of relying on a sysctl. 
>> Let's keep both code paths consistent.

Hi Daniel,

> Hm, that series [1] should probably go to net instead of net-next, but atm

+1, would make sense.

> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
> useful given it has less complexity also for stable, but I guess ultimately
> up to maintainers..

In the short term, I agree. What worries me is that we end up with a 
redundant, or even useless, sysctl once the other series is applied, 
which will only increase user confusion.

Cheers,
Justin

> Thanks,
> Daniel
> 
>>   [1] https://lore.kernel.org/netdev/20260314175124.47010-1- 
>> tom@herbertland.com/#t
>>   [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/
> 


^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Daniel Borkmann @ 2026-04-18 12:59 UTC (permalink / raw)
  To: Justin Iurman, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <b57f31a2-456e-4727-839a-bc2f0fb07855@gmail.com>

On 4/18/26 2:50 PM, Justin Iurman wrote:
> On 4/18/26 14:26, Daniel Borkmann wrote:
>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>> [...]
>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>> index d2cd33e2698d..93f865545a7c 100644
>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>>           .extra1        = SYSCTL_ZERO,
>>>>           .extra2        = &flowlabel_reflect_max,
>>>>       },
>>>> +    {
>>>> +        .procname    = "max_ext_hdrs_number",
>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>> +        .maxlen        = sizeof(int),
>>>> +        .mode        = 0644,
>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>> +        .extra1        = SYSCTL_ONE,
>>>> +    },
>>>>       {
>>>>           .procname    = "max_dst_opts_number",
>>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>
>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>
>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want yet-another-sysctl for that. Instead, we have (well, not yet, but it's about time) this series [1] to enforce ordering and occurrences of Extension Headers, which is based on an IETF draft [2] (FYI, draft- ietf-6man-eh-limits is dead). I think we should enforce ordering and occurrences in this code path too, instead of relying on a sysctl. Let's keep both code paths consistent.
> 
> Hi Daniel,
> 
>> Hm, that series [1] should probably go to net instead of net-next, but atm
> 
> +1, would make sense.
> 
>> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
>> useful given it has less complexity also for stable, but I guess ultimately
>> up to maintainers..
> 
> In the short term, I agree. What worries me is that we end up with a redundant, or even useless, sysctl once the other series is applied, which will only increase user confusion.
I'm thinking even if that series lands, and there is still odd hw out there
where the enforcement of ordering is not in place, and users might be forced
to disable net.ipv6.enforce_ext_hdr_order, then the limit would still apply
and protect them.

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Eric Dumazet @ 2026-04-18 13:15 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Daniel Borkmann, kuba, dsahern, tom, willemdebruijn.kernel,
	idosch, pabeni, netdev
In-Reply-To: <b57f31a2-456e-4727-839a-bc2f0fb07855@gmail.com>

On Sat, Apr 18, 2026 at 5:50 AM Justin Iurman <justin.iurman@gmail.com> wrote:
>
> On 4/18/26 14:26, Daniel Borkmann wrote:
> > Hi Justin,
> >
> > On 4/18/26 1:45 PM, Justin Iurman wrote:
> >> On 4/17/26 19:18, Daniel Borkmann wrote:
> > [...]
> >>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> >>> index d2cd33e2698d..93f865545a7c 100644
> >>> --- a/net/ipv6/sysctl_net_ipv6.c
> >>> +++ b/net/ipv6/sysctl_net_ipv6.c
> >>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
> >>>           .extra1        = SYSCTL_ZERO,
> >>>           .extra2        = &flowlabel_reflect_max,
> >>>       },
> >>> +    {
> >>> +        .procname    = "max_ext_hdrs_number",
> >>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
> >>> +        .maxlen        = sizeof(int),
> >>> +        .mode        = 0644,
> >>> +        .proc_handler    = proc_dointvec_minmax,
> >>> +        .extra1        = SYSCTL_ONE,
> >>> +    },
> >>>       {
> >>>           .procname    = "max_dst_opts_number",
> >>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
> >>
> >> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
> >>
> >> +1000 on the need, but NAK on the way it is done. IMO, we don't want
> >> yet-another-sysctl for that. Instead, we have (well, not yet, but it's
> >> about time) this series [1] to enforce ordering and occurrences of
> >> Extension Headers, which is based on an IETF draft [2] (FYI, draft-
> >> ietf-6man-eh-limits is dead). I think we should enforce ordering and
> >> occurrences in this code path too, instead of relying on a sysctl.
> >> Let's keep both code paths consistent.
>
> Hi Daniel,
>
> > Hm, that series [1] should probably go to net instead of net-next, but atm
>
> +1, would make sense.
>
> > hasn't moved since a month. I'd still think max_ext_hdrs_number would be
> > useful given it has less complexity also for stable, but I guess ultimately
> > up to maintainers..
>
> In the short term, I agree. What worries me is that we end up with a
> redundant, or even useless, sysctl once the other series is applied,
> which will only increase user confusion.

Given the amount of bugs in this code, a sysctl is safe and quire reasonable.

No one will object when it is eventually removed (or has no action)

For the record,  I approve Daniel patch.

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 13:18 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <d4c72730-2d74-4efe-8ede-50e3fe9658c8@iogearbox.net>

On 4/18/26 14:59, Daniel Borkmann wrote:
> On 4/18/26 2:50 PM, Justin Iurman wrote:
>> On 4/18/26 14:26, Daniel Borkmann wrote:
>>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>>> [...]
>>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>>> index d2cd33e2698d..93f865545a7c 100644
>>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>>>           .extra1        = SYSCTL_ZERO,
>>>>>           .extra2        = &flowlabel_reflect_max,
>>>>>       },
>>>>> +    {
>>>>> +        .procname    = "max_ext_hdrs_number",
>>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>>> +        .maxlen        = sizeof(int),
>>>>> +        .mode        = 0644,
>>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>>> +        .extra1        = SYSCTL_ONE,
>>>>> +    },
>>>>>       {
>>>>>           .procname    = "max_dst_opts_number",
>>>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>>
>>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>>
>>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want 
>>>> yet-another-sysctl for that. Instead, we have (well, not yet, but 
>>>> it's about time) this series [1] to enforce ordering and occurrences 
>>>> of Extension Headers, which is based on an IETF draft [2] (FYI, 
>>>> draft- ietf-6man-eh-limits is dead). I think we should enforce 
>>>> ordering and occurrences in this code path too, instead of relying 
>>>> on a sysctl. Let's keep both code paths consistent.
>>
>> Hi Daniel,
>>
>>> Hm, that series [1] should probably go to net instead of net-next, 
>>> but atm
>>
>> +1, would make sense.
>>
>>> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
>>> useful given it has less complexity also for stable, but I guess 
>>> ultimately
>>> up to maintainers..
>>
>> In the short term, I agree. What worries me is that we end up with a 
>> redundant, or even useless, sysctl once the other series is applied, 
>> which will only increase user confusion.
> I'm thinking even if that series lands, and there is still odd hw out there
> where the enforcement of ordering is not in place, and users might be 
> forced
> to disable net.ipv6.enforce_ext_hdr_order, then the limit would still apply
> and protect them.

Agree. OTOH, IPv6 packets with out-of-order (or more than allowed) 
Extension Headers look suspicious and should probably be dropped by 
hosts anyway.

> Cheers,
> Daniel


^ permalink raw reply

* Re: [PATCH net 1/2] tcp: call sk_data_ready() after listener migration
From: 上勾拳 @ 2026-04-18 13:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
	shuah, tamird, linux-kernel, linux-kselftest, stable
In-Reply-To: <CANn89iJOfDB+5oORjWPbP7Z1SyqUhMzVR8u8i+8P8MPDgg_EGA@mail.gmail.com>

Thanks Eric, you're right.

After inet_csk_reqsk_queue_add() succeeds, the ref acquired in
reuseport_migrate_sock() is effectively transferred to
nreq->rsk_listener. Another CPU can then dequeue nreq (via
accept() or listener shutdown), hit reqsk_put(), and drop that
listener ref.

Since listeners are SOCK_RCU_FREE, the post-queue_add()
dereferences of nsk should be under rcu_read_lock()/
rcu_read_unlock(), which also covers the existing sock_net(nsk)
access in that path.

I also checked reqsk_timer_handler(): reqsk_queue_migrated()
there is only accounting, and once nreq becomes visible via
inet_ehash_insert(), the handler no longer appears to
dereference nsk.

I'll fold this into v2.


Eric Dumazet <edumazet@google.com> 于2026年4月18日周六 14:02写道:
>
> On Fri, Apr 17, 2026 at 9:17 PM Zhenzhong Wu <jt26wzz@gmail.com> wrote:
> >
> > When inet_csk_listen_stop() migrates an established child socket from
> > a closing listener to another socket in the same SO_REUSEPORT group,
> > the target listener gets a new accept-queue entry via
> > inet_csk_reqsk_queue_add(), but that path never notifies the target
> > listener's waiters.
> >
> > As a result, a nonblocking accept() still succeeds because it checks
> > the accept queue directly, but waiters that sleep for listener
> > readiness can remain asleep until another connection generates a
> > wakeup. This affects poll()/epoll_wait()-based waiters, and can also
> > leave a blocking accept() asleep after migration even though the
> > child is already in the target listener's accept queue.
> >
> > This was observed in a local test where listener A completed the
> > handshake, queued the child, and was closed before userspace called
> > accept(). The child was migrated to listener B, but listener B never
> > received a wakeup for the migrated accept-queue entry.
> >
> > Call READ_ONCE(nsk->sk_data_ready)(nsk) after a successful migration
> > in inet_csk_listen_stop().
> >
> > The reqsk_timer_handler() path does not need the same change:
> > half-open requests only become readable to userspace when the final
> > ACK completes the handshake, and tcp_child_process() already wakes
> > the listener in that case.
> >
> > Fixes: 54b92e841937 ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
> > ---
> >  net/ipv4/inet_connection_sock.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 4ac3ae1bc..da1ce082f 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -1483,6 +1483,7 @@ void inet_csk_listen_stop(struct sock *sk)
> >                                         __NET_INC_STATS(sock_net(nsk),
> >                                                         LINUX_MIB_TCPMIGRATEREQSUCCESS);
> >                                         reqsk_migrate_reset(req);
> > +                                       READ_ONCE(nsk->sk_data_ready)(nsk);
>
> I think this is adding a potential UAF (Use Afte Free).
> @nsk might have been freed already by another thread/cpu.
> Note the existing code already has similar issues.
>
> Untested patch:
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 4ac3ae1bc1afc3a39f2790e39b4dda877dc3272b..287b6e01c4f71bfec3dd2a708f316224d9eb4a64
> 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -1479,6 +1479,7 @@ void inet_csk_listen_stop(struct sock *sk)
>                         if (nreq) {
>                                 refcount_set(&nreq->rsk_refcnt, 1);
>
> +                               rcu_read_lock();
>                                 if (inet_csk_reqsk_queue_add(nsk,
> nreq, child)) {
>                                         __NET_INC_STATS(sock_net(nsk),
>
> LINUX_MIB_TCPMIGRATEREQSUCCESS);
> @@ -1489,7 +1490,7 @@ void inet_csk_listen_stop(struct sock *sk)
>                                         reqsk_migrate_reset(nreq);
>                                         __reqsk_free(nreq);
>                                 }
> -
> +                               rcu_read_unlock();
>                                 /* inet_csk_reqsk_queue_add() has already
>                                  * called inet_child_forget() on failure case.
>                                  */

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 13:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, kuba, dsahern, tom, willemdebruijn.kernel,
	idosch, pabeni, netdev
In-Reply-To: <CANn89i+Y0jctj8=tCHFP5jDSJBAWR=RvNfagammc-WqU6EdPRw@mail.gmail.com>

On 4/18/26 15:15, Eric Dumazet wrote:
> On Sat, Apr 18, 2026 at 5:50 AM Justin Iurman <justin.iurman@gmail.com> wrote:
>>
>> On 4/18/26 14:26, Daniel Borkmann wrote:
>>> Hi Justin,
>>>
>>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>>> [...]
>>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>>> index d2cd33e2698d..93f865545a7c 100644
>>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>>>            .extra1        = SYSCTL_ZERO,
>>>>>            .extra2        = &flowlabel_reflect_max,
>>>>>        },
>>>>> +    {
>>>>> +        .procname    = "max_ext_hdrs_number",
>>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>>> +        .maxlen        = sizeof(int),
>>>>> +        .mode        = 0644,
>>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>>> +        .extra1        = SYSCTL_ONE,
>>>>> +    },
>>>>>        {
>>>>>            .procname    = "max_dst_opts_number",
>>>>>            .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>>
>>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>>
>>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want
>>>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's
>>>> about time) this series [1] to enforce ordering and occurrences of
>>>> Extension Headers, which is based on an IETF draft [2] (FYI, draft-
>>>> ietf-6man-eh-limits is dead). I think we should enforce ordering and
>>>> occurrences in this code path too, instead of relying on a sysctl.
>>>> Let's keep both code paths consistent.
>>
>> Hi Daniel,
>>
>>> Hm, that series [1] should probably go to net instead of net-next, but atm
>>
>> +1, would make sense.
>>
>>> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
>>> useful given it has less complexity also for stable, but I guess ultimately
>>> up to maintainers..
>>
>> In the short term, I agree. What worries me is that we end up with a
>> redundant, or even useless, sysctl once the other series is applied,
>> which will only increase user confusion.
> 
> Given the amount of bugs in this code, a sysctl is safe and quire reasonable.
> 
> No one will object when it is eventually removed (or has no action)
> 
> For the record,  I approve Daniel patch.

Fair enough. If there is consensus on this patch, then let me just 
suggest two changes:

- make it clear in the sysctl description that it mainly applies to TX 
(as opposed to the other series [1] discussed earlier that applies to RX)

- set the default to 8 (which should be the max value) instead of 32, as 
per RFC8200, Sec. 4.1

^ permalink raw reply

* [PATCH net v2] net/rds: zero per-item info buffer before handing it to visitors
From: Michael Bommarito @ 2026-04-18 14:10 UTC (permalink / raw)
  To: Allison Henderson, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Sharath Srinivasan, Simon Horman, netdev, linux-rdma, rds-devel,
	linux-kernel, stable
In-Reply-To: <20260417141916.494761-1-michael.bommarito@gmail.com>

rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
caller-allocated on-stack u64 buffer to a per-connection visitor and
then copy the full item_len bytes back to user space via
rds_info_copy() regardless of how much of the buffer the visitor
actually wrote.

rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
write a subset of their output struct when the underlying
rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
and the two GIDs via explicit memsets). Several u32 fields
(max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
cache_allocs) and the 2-byte alignment hole between sl and
cache_allocs remain as whatever stack contents preceded the visitor
call and are then memcpy_to_user()'d out to user space.

struct rds_info_rdma_connection and struct rds6_info_rdma_connection
are the only rds_info_* structs in include/uapi/linux/rds.h that are
not marked __attribute__((packed)), so they have a real alignment
hole. The other info visitors (rds_conn_info_visitor,
rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
their packed output struct today and are not known to be vulnerable,
but a future visitor that adds a conditional write-path would have
the same bug.

Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
any netdev is sufficient), sendto()'s any peer on the same subnet
(fails cleanly but installs an rds_connection in the global hash in
RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26
bytes of stack garbage including kernel text/data pointers:

    0..7   0a 63 00 01 0a 63 00 02     src=10.99.0.1 dst=10.99.0.2
    8..39  00 ...                      gids (memset-zeroed)
    40..47 e0 92 a3 81 ff ff ff ff     kernel pointer (max_send_wr)
    48..55 7f 37 b5 81 ff ff ff ff     kernel pointer (rdma_mr_max)
    56..59 01 00 08 00                 rdma_mr_size (garbage)
    60..61 00 00                       tos, sl
    62..63 00 00                       alignment padding
    64..67 18 00 00 00                 cache_allocs (garbage)

Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
and rds_walk_conn_path_info() before invoking the visitor. This
covers the IPv4/IPv6 IB visitors and hardens all current and future
visitors against the same class of bug.

No functional change for visitors that fully populate their output.

Changes in v2:
- retarget at the net tree (subject prefix "[PATCH net v2]",
  net/rds: prefix in the title)
- add Cc: stable@vger.kernel.org
- pick up Reviewed-by tags from Sharath Srinivasan and
  Allison Henderson

Fixes: ec16227e1414 ("RDS/IB: Infiniband transport")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Assisted-by: Claude:claude-opus-4-7
---
 net/rds/connection.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 412441aaa298..c10b7ed06c49 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -701,6 +701,13 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
 	     i++, head++) {
 		hlist_for_each_entry_rcu(conn, head, c_hash_node) {
 
+			/* Zero the per-item buffer before handing it to the
+			 * visitor so any field the visitor does not write -
+			 * including implicit alignment padding - cannot leak
+			 * stack contents to user space via rds_info_copy().
+			 */
+			memset(buffer, 0, item_len);
+
 			/* XXX no c_lock usage.. */
 			if (!visitor(conn, buffer))
 				continue;
@@ -750,6 +757,13 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
 			 */
 			cp = conn->c_path;
 
+			/* Zero the per-item buffer for the same reason as
+			 * rds_for_each_conn_info(): any byte the visitor
+			 * does not write (including alignment padding) must
+			 * not leak stack contents via rds_info_copy().
+			 */
+			memset(buffer, 0, item_len);
+
 			/* XXX no cp_lock usage.. */
 			if (!visitor(cp, buffer))
 				continue;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, kuba, dsahern, tom, willemdebruijn.kernel,
	idosch, pabeni, netdev
In-Reply-To: <75d98880-afcd-43f9-8bd5-b874fa5690f5@gmail.com>

On 4/18/26 15:46, Justin Iurman wrote:
> On 4/18/26 15:15, Eric Dumazet wrote:
>> On Sat, Apr 18, 2026 at 5:50 AM Justin Iurman 
>> <justin.iurman@gmail.com> wrote:
>>>
>>> On 4/18/26 14:26, Daniel Borkmann wrote:
>>>> Hi Justin,
>>>>
>>>> On 4/18/26 1:45 PM, Justin Iurman wrote:
>>>>> On 4/17/26 19:18, Daniel Borkmann wrote:
>>>> [...]
>>>>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>>>>> index d2cd33e2698d..93f865545a7c 100644
>>>>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>>>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>>>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] 
>>>>>> = {
>>>>>>            .extra1        = SYSCTL_ZERO,
>>>>>>            .extra2        = &flowlabel_reflect_max,
>>>>>>        },
>>>>>> +    {
>>>>>> +        .procname    = "max_ext_hdrs_number",
>>>>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>>>>> +        .maxlen        = sizeof(int),
>>>>>> +        .mode        = 0644,
>>>>>> +        .proc_handler    = proc_dointvec_minmax,
>>>>>> +        .extra1        = SYSCTL_ONE,
>>>>>> +    },
>>>>>>        {
>>>>>>            .procname    = "max_dst_opts_number",
>>>>>>            .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>>>>
>>>>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>>>>
>>>>> +1000 on the need, but NAK on the way it is done. IMO, we don't want
>>>>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's
>>>>> about time) this series [1] to enforce ordering and occurrences of
>>>>> Extension Headers, which is based on an IETF draft [2] (FYI, draft-
>>>>> ietf-6man-eh-limits is dead). I think we should enforce ordering and
>>>>> occurrences in this code path too, instead of relying on a sysctl.
>>>>> Let's keep both code paths consistent.
>>>
>>> Hi Daniel,
>>>
>>>> Hm, that series [1] should probably go to net instead of net-next, 
>>>> but atm
>>>
>>> +1, would make sense.
>>>
>>>> hasn't moved since a month. I'd still think max_ext_hdrs_number 
>>>> would be
>>>> useful given it has less complexity also for stable, but I guess 
>>>> ultimately
>>>> up to maintainers..
>>>
>>> In the short term, I agree. What worries me is that we end up with a
>>> redundant, or even useless, sysctl once the other series is applied,
>>> which will only increase user confusion.
>>
>> Given the amount of bugs in this code, a sysctl is safe and quire 
>> reasonable.
>>
>> No one will object when it is eventually removed (or has no action)
>>
>> For the record,  I approve Daniel patch.
> 
> Fair enough. If there is consensus on this patch, then let me just 
> suggest two changes:
> 
> - make it clear in the sysctl description that it mainly applies to TX 
> (as opposed to the other series [1] discussed earlier that applies to RX)

Sorry, I meant it does not apply to core RX (ip6_rcv()), which is what 
series [1] does.

> - set the default to 8 (which should be the max value) instead of 32, as 
> per RFC8200, Sec. 4.1


^ permalink raw reply

* Re: [PATCH net v2] slip: reject VJ receive packets on instances with no rstate array
From: Weiming Shi @ 2026-04-18 14:37 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Xiang Mei
In-Reply-To: <20260418123929.GE280379@horms.kernel.org>

On 26-04-18 13:39, Simon Horman wrote:
> On Thu, Apr 16, 2026 at 04:41:31AM +0800, Weiming Shi wrote:
> > slhc_init() accepts rslots == 0 as a valid configuration, with the
> > documented meaning of 'no receive compression'. In that case the
> > allocation loop in slhc_init() is skipped, so comp->rstate stays
> > NULL and comp->rslot_limit stays 0 (from the kzalloc of struct
> > slcompress).
> > 
> > The receive helpers do not defend against that configuration.
> > slhc_uncompress() dereferences comp->rstate[x] when the VJ header
> > carries an explicit connection ID, and slhc_remember() later assigns
> > cs = &comp->rstate[...] after only comparing the packet's slot number
> > to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the
> > range check, and the code dereferences a NULL rstate.
> > 
> > The configuration is reachable in-tree through PPP. PPPIOCSMAXCID
> > stores its argument in a signed int, and (val >> 16) uses arithmetic
> > shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1
> > is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because
> > /dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path
> > is reachable from an unprivileged user namespace. Once the malformed
> > VJ state is installed, any inbound VJ-compressed or VJ-uncompressed
> > frame that selects slot 0 crashes the kernel in softirq context:
> > 
> >  Oops: general protection fault, probably for non-canonical
> >        address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
> >  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> >  RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519)
> >  Call Trace:
> >   <TASK>
> >   ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466)
> >   ppp_input (drivers/net/ppp/ppp_generic.c:2359)
> >   ppp_async_process (drivers/net/ppp/ppp_async.c:492)
> >   tasklet_action_common (kernel/softirq.c:926)
> >   handle_softirqs (kernel/softirq.c:623)
> >   run_ksoftirqd (kernel/softirq.c:1055)
> >   smpboot_thread_fn (kernel/smpboot.c:160)
> >   kthread (kernel/kthread.c:436)
> >   ret_from_fork (arch/x86/kernel/process.c:164)
> >   </TASK>
> > 
> > Reject the receive side on such instances instead of touching rstate.
> > slhc_uncompress() falls through to its existing 'bad' label, which
> > bumps sls_i_error and enters the toss state. slhc_remember() mirrors
> > that with an explicit sls_i_error increment followed by slhc_toss();
> > the sls_i_runt counter is not used here because a missing rstate is
> > an internal configuration state, not a runt packet.
> > 
> > The transmit path is unaffected: the only in-tree caller that picks
> > rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and
> > slip.c always calls slhc_init(16, 16), so comp->tstate remains valid
> > and slhc_compress() continues to work.
> > 
> > Fixes: b5451d783ade ("slip: Move the SLIP drivers")
> 
> AI review points out that the cited commit moves code but doesn't
> add this bug.
> 
> It seems to me that this bug has existed since the beginning of git
> history. If so, the Fixes tag should be:
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> 
> > Reported-by: Xiang Mei <xmei5@asu.edu>
> > Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> > ---
> > v2:
> > - slhc_remember(): use sls_i_error instead of sls_i_runt for the
> >   missing-rstate case; it is a configuration error, not a runt packet
> >   (Simon).
> > - slhc_uncompress(): goto bad instead of returning 0, so the instance
> >   also enters SLF_TOSS on the first rejected frame.
> 
> Otherwise this looks good to me:
> 
> Reviewed-by: Simon Horman <horms@kernel.org>
> 
> 
> I do note that Sashiko flags some other problems in this code.
> I do not think that needs to delay progress of this patch.
> But you may wish to look into them as follow-up work.

Thanks for your review. 

I've already sent two follow-up patches for the decode()/pull16() 
bounds-checking issues:

    [PATCH net] slip: fix slab-out-of-bounds write in slhc_uncompress()
    https://lore.kernel.org/netdev/20260415213359.335657-2-bestswngs@gmail.com/

    [PATCH net] slip: bound decode() reads against the compressed packet length
    https://lore.kernel.org/netdev/20260416100147.531855-5-bestswngs@gmail.com/

Best regards,
Weiming Shi


^ permalink raw reply

* Re: [PATCH net-next v4 5/5] selftests: net: bridge: add MRC and QQIC field encoding tests
From: Ido Schimmel @ 2026-04-18 14:49 UTC (permalink / raw)
  To: Ujjal Roy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
	Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <CAE2MWkmvdAVMBvJ9xKgEzjJZ010=oY_ZoG==FBjHEisHEMrS8Q@mail.gmail.com>

On Fri, Apr 17, 2026 at 11:27:06AM +0530, Ujjal Roy wrote:
> On Mon, Apr 13, 2026 at 2:18 PM Ido Schimmel <idosch@nvidia.com> wrote:
> >
> > See some comments below, but note that net-next is closed:
> >
> > https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
> >
> > So you can either wait with v5 until it is open again or post it as RFC
> > so that we can at least review (but not merge) it while net-next is
> > closed.
> 
> Let me clear the changes asked here inline, so that I will be prepared
> with v5 until net-next is open. You can ask me to send it as RFC v5,
> if you have doubts about inline answers.

I checked the proposed changes and they look fine to me.

Thanks

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox