Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 12:50 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <ae053593-907e-4891-90fb-03b4c5d8f5e1@iogearbox.net>

On 4/18/26 14:26, Daniel Borkmann wrote:
> Hi Justin,
> 
> On 4/18/26 1:45 PM, Justin Iurman wrote:
>> On 4/17/26 19:18, Daniel Borkmann wrote:
> [...]
>>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>>> index d2cd33e2698d..93f865545a7c 100644
>>> --- a/net/ipv6/sysctl_net_ipv6.c
>>> +++ b/net/ipv6/sysctl_net_ipv6.c
>>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>>           .extra1        = SYSCTL_ZERO,
>>>           .extra2        = &flowlabel_reflect_max,
>>>       },
>>> +    {
>>> +        .procname    = "max_ext_hdrs_number",
>>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>>> +        .maxlen        = sizeof(int),
>>> +        .mode        = 0644,
>>> +        .proc_handler    = proc_dointvec_minmax,
>>> +        .extra1        = SYSCTL_ONE,
>>> +    },
>>>       {
>>>           .procname    = "max_dst_opts_number",
>>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
>>
>> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
>>
>> +1000 on the need, but NAK on the way it is done. IMO, we don't want 
>> yet-another-sysctl for that. Instead, we have (well, not yet, but it's 
>> about time) this series [1] to enforce ordering and occurrences of 
>> Extension Headers, which is based on an IETF draft [2] (FYI, draft- 
>> ietf-6man-eh-limits is dead). I think we should enforce ordering and 
>> occurrences in this code path too, instead of relying on a sysctl. 
>> Let's keep both code paths consistent.

Hi Daniel,

> Hm, that series [1] should probably go to net instead of net-next, but atm

+1, would make sense.

> hasn't moved since a month. I'd still think max_ext_hdrs_number would be
> useful given it has less complexity also for stable, but I guess ultimately
> up to maintainers..

In the short term, I agree. What worries me is that we end up with a 
redundant, or even useless, sysctl once the other series is applied, 
which will only increase user confusion.

Cheers,
Justin

> Thanks,
> Daniel
> 
>>   [1] https://lore.kernel.org/netdev/20260314175124.47010-1- 
>> tom@herbertland.com/#t
>>   [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/
> 


^ permalink raw reply

* Re: [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Justin Iurman @ 2026-04-18 12:40 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <20260418121538.706095-1-daniel@iogearbox.net>

On 4/18/26 14:15, Daniel Borkmann wrote:
> Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
> Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
> and applied them in ip6_parse_tlv(), the generic TLV walker
> invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
> 
> ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
> it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
> branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
> loop is bounded only by optlen, which can be up to 2048 bytes.
> Stuffing the Destination Options header with 2046 Pad1 (type=0)
> entries advances the scanner a single byte at a time, yielding
> ~2000 TLV iterations per extension header.
> 
> Reuse max_dst_opts_cnt to bound the TLV iterations, matching
> the semantics from 47d3d7ac656a.
> 
> Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   v1->v2:
>     - Remove unlikely (Justin)
>     - Use abs() given max_dst_opts_cnt's negative meaning (Justin)
> 
>   net/ipv6/ip6_tunnel.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 907c6a2af331..0f50b7fcb24e 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>   				break;
>   		}
>   		if (nexthdr == NEXTHDR_DEST) {
> +			int tlv_max = abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
> +			int tlv_cnt = 0;
>   			u16 i = 2;
>   
>   			while (1) {
>   				struct ipv6_tlv_tnl_enc_lim *tel;
>   
> +				if (tlv_cnt++ >= tlv_max)
> +					break;
> +
>   				/* No more room for encapsulation limit */
>   				if (i + sizeof(*tel) > optlen)
>   					break;

Thanks for v2, Daniel.

I'm still wondering: should we align the above parsing behavior with the 
one in ip6_parse_tlv() to keep things consistent? That is: don't 
increment tlv_cnt for Pad1/PadN, make sure we don't exceed 8 bytes per 
padding (consecutive Pad1's, or a PadN), and we could also check that a 
PadN payload is only made of zeroes. Open question...

Otherwise, LGTM:
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>

^ permalink raw reply

* Re: [PATCH net v2] slip: reject VJ receive packets on instances with no rstate array
From: Simon Horman @ 2026-04-18 12:39 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Xiang Mei
In-Reply-To: <20260415204130.258866-2-bestswngs@gmail.com>

On Thu, Apr 16, 2026 at 04:41:31AM +0800, Weiming Shi wrote:
> slhc_init() accepts rslots == 0 as a valid configuration, with the
> documented meaning of 'no receive compression'. In that case the
> allocation loop in slhc_init() is skipped, so comp->rstate stays
> NULL and comp->rslot_limit stays 0 (from the kzalloc of struct
> slcompress).
> 
> The receive helpers do not defend against that configuration.
> slhc_uncompress() dereferences comp->rstate[x] when the VJ header
> carries an explicit connection ID, and slhc_remember() later assigns
> cs = &comp->rstate[...] after only comparing the packet's slot number
> to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the
> range check, and the code dereferences a NULL rstate.
> 
> The configuration is reachable in-tree through PPP. PPPIOCSMAXCID
> stores its argument in a signed int, and (val >> 16) uses arithmetic
> shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1
> is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because
> /dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path
> is reachable from an unprivileged user namespace. Once the malformed
> VJ state is installed, any inbound VJ-compressed or VJ-uncompressed
> frame that selects slot 0 crashes the kernel in softirq context:
> 
>  Oops: general protection fault, probably for non-canonical
>        address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>  RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519)
>  Call Trace:
>   <TASK>
>   ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466)
>   ppp_input (drivers/net/ppp/ppp_generic.c:2359)
>   ppp_async_process (drivers/net/ppp/ppp_async.c:492)
>   tasklet_action_common (kernel/softirq.c:926)
>   handle_softirqs (kernel/softirq.c:623)
>   run_ksoftirqd (kernel/softirq.c:1055)
>   smpboot_thread_fn (kernel/smpboot.c:160)
>   kthread (kernel/kthread.c:436)
>   ret_from_fork (arch/x86/kernel/process.c:164)
>   </TASK>
> 
> Reject the receive side on such instances instead of touching rstate.
> slhc_uncompress() falls through to its existing 'bad' label, which
> bumps sls_i_error and enters the toss state. slhc_remember() mirrors
> that with an explicit sls_i_error increment followed by slhc_toss();
> the sls_i_runt counter is not used here because a missing rstate is
> an internal configuration state, not a runt packet.
> 
> The transmit path is unaffected: the only in-tree caller that picks
> rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and
> slip.c always calls slhc_init(16, 16), so comp->tstate remains valid
> and slhc_compress() continues to work.
> 
> Fixes: b5451d783ade ("slip: Move the SLIP drivers")

AI review points out that the cited commit moves code but doesn't
add this bug.

It seems to me that this bug has existed since the beginning of git
history. If so, the Fixes tag should be:

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

> Reported-by: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> ---
> v2:
> - slhc_remember(): use sls_i_error instead of sls_i_runt for the
>   missing-rstate case; it is a configuration error, not a runt packet
>   (Simon).
> - slhc_uncompress(): goto bad instead of returning 0, so the instance
>   also enters SLF_TOSS on the first rejected frame.

Otherwise this looks good to me:

Reviewed-by: Simon Horman <horms@kernel.org>


I do note that Sashiko flags some other problems in this code.
I do not think that needs to delay progress of this patch.
But you may wish to look into them as follow-up work.

^ permalink raw reply

* Re: [PATCH iwl-net v3 3/6] ixgbe: call ixgbe_setup_fc() before fc_enable() after NVM update
From: Simon Horman @ 2026-04-18 12:28 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260415142841.3222399-4-aleksandr.loktionov@intel.com>

On Wed, Apr 15, 2026 at 04:28:38PM +0200, Aleksandr Loktionov wrote:
> During an NVM update the PHY reset clears the Technology Ability Field
> (IEEE 802.3 clause 37 register 7.10) back to hardware defaults.  When
> the driver subsequently calls only hw->mac.ops.fc_enable() the SRRCTL
> register is recalculated from stale autonegotiated capability bits,
> which the MDD (Malicious Driver Detect) logic treats as an invalid
> change and halts traffic on the PF.
> 
> Fix by calling ixgbe_setup_fc() immediately before fc_enable() in
> ixgbe_watchdog_update_link() so that flow-control autoneg and the PHY
> registers are re-programmed in the correct order after any reset.
> 
> Skip setup_fc() on backplane links: on 82599 backplane interfaces
> setup_fc() resolves to prot_autoc_write() ->
> ixgbe_reset_pipeline_82599() which toggles IXGBE_AUTOC_AN_RESTART.
> Calling it unconditionally on link-up creates an infinite link-flap
> loop because each AN-restart triggers another link-up event.  Guard
> with a get_media_type() check and skip setup_fc() when the media type
> is ixgbe_media_type_backplane; fc_enable() is still called.
> 
> Also handle the failure path: if setup_fc() returns an error its output
> is invalid and calling fc_enable() on the unchanged hardware state would
> repeat the exact MDD-triggering condition the fix is meant to prevent.
> Skip fc_enable() in that case while still calling
> ixgbe_set_rx_drop_en() which configures the independent RX-drop
> behaviour.
> 
> Fixes: 93c52dd0033b ("ixgbe: Merge watchdog functionality into service task")
> Suggested-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v2 -> v3:
>  - Skip setup_fc() for ixgbe_media_type_backplane: unconditional call on
>    82599 backplane links triggers prot_autoc_write() ->
>    ixgbe_reset_pipeline_82599() -> IXGBE_AUTOC_AN_RESTART, causing an
>    infinite link-flap loop (Simon Horman).

Reviewed-by: Simon Horman <horms@kernel.org>

(Unsurprisingly) Sashiko has a number of things to say about this patchset.
But I believe they can all be analysed as part of follow-up work: no need
to block progress of this patchset IMHO.

^ permalink raw reply

* Re: [PATCH iwl-net v3 5/6] ixgbe: fix ITR value overflow in adaptive interrupt throttling
From: Simon Horman @ 2026-04-18 12:26 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260415142841.3222399-6-aleksandr.loktionov@intel.com>

On Wed, Apr 15, 2026 at 04:28:40PM +0200, Aleksandr Loktionov wrote:
> ixgbe_update_itr() packs a mode flag (IXGBE_ITR_ADAPTIVE_LATENCY,
> bit 7) and a usecs delay (bits [6:0]) into an unsigned int, then
> stores the combined value in ring_container->itr which is declared as
> u8.  Values above 0xFF wrap on truncation, corrupting both the delay
> and the mode flag on the next readback.
> 
> Keep the mode bit (IXGBE_ITR_ADAPTIVE_LATENCY) and the usec delay as
> separate operands in the final store expression.  Clamp only the usecs
> portion to [IXGBE_ITR_ADAPTIVE_MIN_USECS, IXGBE_ITR_ADAPTIVE_MAX_USECS]
> using clamp_val() so that:
>  - overflow cannot bleed into the mode bit (bit 7),
>  - the delay cannot exceed 126 us (IXGBE_ITR_ADAPTIVE_MAX_USECS),
>  - the delay cannot drop below 10 us (IXGBE_ITR_ADAPTIVE_MIN_USECS).
> 
> Fixes: b4ded8327fea ("ixgbe: Update adaptive ITR algorithm")
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v2 -> v3:
>  - Use clamp_val() instead of min_t() to also guard the lower bound
>    (IXGBE_ITR_ADAPTIVE_MIN_USECS); keep mode and delay as separate
>    operands until final store; use IXGBE_ITR_ADAPTIVE_MAX_USECS (126)
>    as upper bound instead of IXGBE_ITR_ADAPTIVE_LATENCY - 1 (127)
>    (Simon Horman).

FTR: I think the code would be easier to reason with if
mode and delay were kept separate during earlier calculation
of itr. But I also think that can be handled as a follow-up.
as this patch does improve things.

Reviewed-by: Simon Horman <horms@kernel.org>

^ permalink raw reply

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Daniel Borkmann @ 2026-04-18 12:26 UTC (permalink / raw)
  To: Justin Iurman, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <60b47924-dae4-4a10-b977-75b92e1094c0@gmail.com>

Hi Justin,

On 4/18/26 1:45 PM, Justin Iurman wrote:
> On 4/17/26 19:18, Daniel Borkmann wrote:
[...]
>> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
>> index d2cd33e2698d..93f865545a7c 100644
>> --- a/net/ipv6/sysctl_net_ipv6.c
>> +++ b/net/ipv6/sysctl_net_ipv6.c
>> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>>           .extra1        = SYSCTL_ZERO,
>>           .extra2        = &flowlabel_reflect_max,
>>       },
>> +    {
>> +        .procname    = "max_ext_hdrs_number",
>> +        .data        = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
>> +        .maxlen        = sizeof(int),
>> +        .mode        = 0644,
>> +        .proc_handler    = proc_dointvec_minmax,
>> +        .extra1        = SYSCTL_ONE,
>> +    },
>>       {
>>           .procname    = "max_dst_opts_number",
>>           .data        = &init_net.ipv6.sysctl.max_dst_opts_cnt,
> 
> NACKed-by: Justin Iurman <justin.iurman@gmail.com>
> 
> +1000 on the need, but NAK on the way it is done. IMO, we don't want yet-another-sysctl for that. Instead, we have (well, not yet, but it's about time) this series [1] to enforce ordering and occurrences of Extension Headers, which is based on an IETF draft [2] (FYI, draft-ietf-6man-eh-limits is dead). I think we should enforce ordering and occurrences in this code path too, instead of relying on a sysctl. Let's keep both code paths consistent.

Hm, that series [1] should probably go to net instead of net-next, but atm
hasn't moved since a month. I'd still think max_ext_hdrs_number would be
useful given it has less complexity also for stable, but I guess ultimately
up to maintainers..

Thanks,
Daniel

>   [1] https://lore.kernel.org/netdev/20260314175124.47010-1-tom@herbertland.com/#t
>   [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/


^ permalink raw reply

* [PATCH] net: hamachi: fix divide by zero in hamachi_init_one
From: Mingyu Wang @ 2026-04-18 12:18 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: tglx, mingo, netdev, linux-kernel, Mingyu Wang

During the hardware initialization phase in hamachi_init_one(), the driver
reads the PCIClkMeas register to calculate the PCI bus frequency.

The current code attempts to prevent a divide-by-zero error using a ternary
operator: `i ? 2000/(i&0x7f) : 0`. However, this check is flawed. The highest
bit of `i` (0x80) acts as a ready flag. If unreliable hardware or a malicious
virtual device returns a value where the ready bit is set but the lower 7 bits
are zero (e.g., 0x80), the condition `i` evaluates to true, but `(i & 0x7f)`
evaluates to 0. This results in a fatal divide-by-zero exception.

This bug was discovered during an automated virtual device fuzzing campaign
testing the hardware-software trust boundary. When the hardware returns 0x80,
it bypassed the readiness while-loop but triggered the divide error. In our
tests, this panic interrupted the module loading process, further triggering
a KASAN slab-out-of-bounds in the module error path, and ultimately leading
to a multi-core soft lockup and RCU stall.

This patch fixes the issue by explicitly checking the divisor `(i & 0x7f)`
instead of the entire register value `i` before performing the division.

Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
---
 drivers/net/ethernet/packetengines/hamachi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/packetengines/hamachi.c b/drivers/net/ethernet/packetengines/hamachi.c
index b0de7e9f12a5..1d7206dd18fd 100644
--- a/drivers/net/ethernet/packetengines/hamachi.c
+++ b/drivers/net/ethernet/packetengines/hamachi.c
@@ -748,7 +748,7 @@ static int hamachi_init_one(struct pci_dev *pdev,
 	printk(KERN_INFO "%s:  %d-bit %d Mhz PCI bus (%d), Virtual Jumpers "
 		   "%2.2x, LPA %4.4x.\n",
 		   dev->name, readw(ioaddr + MiscStatus) & 1 ? 64 : 32,
-		   i ? 2000/(i&0x7f) : 0, i&0x7f, (int)readb(ioaddr + VirtualJumpers),
+		   (i & 0x7f) ? 2000 / (i & 0x7f) : 0, i & 0x7f, (int)readb(ioaddr + VirtualJumpers),
 		   readw(ioaddr + ANLinkPartnerAbility));

 	if (chip_tbl[hmp->chip_id].flags & CanHaveMII) {
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH net] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Daniel Borkmann @ 2026-04-18 12:17 UTC (permalink / raw)
  To: Justin Iurman, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <acee197f-1821-4304-8759-a02ac1d5c808@gmail.com>

Hi Justin,

On 4/18/26 12:59 PM, Justin Iurman wrote:
[...]
> Good point on reusing max_dst_opts_cnt in ip6_tnl_parse_tlv_enc_lim(), but this patch is not ready yet.
> 
> We need to be careful: max_dst_opts_cnt can be negative. If this is the case, ip6_tnl_parse_tlv_enc_lim() would probably return 0, which is not what we want here. From the doc:
> 
> max_dst_opts_number - INTEGER
>          Maximum number of non-padding TLVs allowed in a Destination
>          options extension header. If this value is less than zero
>          then unknown options are disallowed and the number of known
>          TLVs allowed is the absolute value of this number.
> 
>          Default: 8
> 
> Since ip6_tnl_parse_tlv_enc_lim() does not check for specific option types (e.g., Pad1, PadN, you-name-it) and does not differentiate known from unknown options during parsing, I would simply use the absolute value of max_dst_opts_cnt by default.
> 
> Also, I wouldn't use unlikely() because it could harm us more than it helps in this specific context (consistent with ip6_parse_tlv()).

Thanks for the review, the suggestions make sense, and I've updated them in a v2.

Cheers,
Daniel

^ permalink raw reply

* [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Daniel Borkmann @ 2026-04-18 12:15 UTC (permalink / raw)
  To: kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	justin.iurman, netdev

Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
and applied them in ip6_parse_tlv(), the generic TLV walker
invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().

ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
loop is bounded only by optlen, which can be up to 2048 bytes.
Stuffing the Destination Options header with 2046 Pad1 (type=0)
entries advances the scanner a single byte at a time, yielding
~2000 TLV iterations per extension header.

Reuse max_dst_opts_cnt to bound the TLV iterations, matching
the semantics from 47d3d7ac656a.

Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 v1->v2:
   - Remove unlikely (Justin)
   - Use abs() given max_dst_opts_cnt's negative meaning (Justin)

 net/ipv6/ip6_tunnel.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 907c6a2af331..0f50b7fcb24e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
 				break;
 		}
 		if (nexthdr == NEXTHDR_DEST) {
+			int tlv_max = abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
+			int tlv_cnt = 0;
 			u16 i = 2;

 			while (1) {
 				struct ipv6_tlv_tnl_enc_lim *tel;

+				if (tlv_cnt++ >= tlv_max)
+					break;
+
 				/* No more room for encapsulation limit */
 				if (i + sizeof(*tel) > optlen)
 					break;
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net] ipv6: Implement limits on extension header parsing
From: Justin Iurman @ 2026-04-18 11:45 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <20260417171831.687053-1-daniel@iogearbox.net>

On 4/17/26 19:18, Daniel Borkmann wrote:
> ipv6_{skip_exthdr,find_hdr}() and ip6_tnl_parse_tlv_enc_lim() iterate
> over IPv6 extension headers until they find a non-extension-header
> protocol or run out of packet data. The loops have no iteration counter,
> relying solely on the packet length to bound them. For a crafted packet
> with 8-byte extension headers filling a 64KB jumbogram, this means a
> worst case of up to ~8k iterations with a skb_header_pointer call each.
> ipv6_skip_exthdr(), for example, is used where it parses the inner
> quoted packet inside an incoming ICMPv6 error:
> 
>    - icmpv6_rcv
>      - checksum validation
>      - case ICMPV6_DEST_UNREACH
>        - icmpv6_notify
>          - pskb_may_pull()       <- pull inner IPv6 header
>          - ipv6_skip_exthdr()    <- iterates here
>          - pskb_may_pull()
>          - ipprot->err_handler() <- sk lookup (matching sk not required)
> 
> The per-iteration cost of ipv6_skip_exthdr itself is generally light,
> but skb_header_pointer becomes more costly on reassembled packets: the
> first ~1KB of the inner packet are in the skb's linear area, but the
> remaining ~63KB are in the frag_list where skb_copy_bits is needed to
> read data.
> 
> Add a configurable limit via a new sysctl net.ipv6.max_ext_hdrs_number
> (default 32, minimum 1). All three extension header walking functions
> are bound by this limit. The sysctl is in line with commit 47d3d7ac656a
> ("ipv6: Implement limits on Hop-by-Hop and Destination options"). The
> init_net is used since plumbing a struct net * through all helpers
> would touch a lot of callsites.
> 
> There's an ongoing IETF draft-ietf-6man-eh-limits-18 that states that
> 8 extension headers before the transport header is the baseline which
> routers MUST handle; section 7 details also why limits are needed.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   Documentation/networking/ip-sysctl.rst |  7 +++++++
>   include/net/ipv6.h                     |  2 ++
>   include/net/netns/ipv6.h               |  1 +
>   net/ipv6/af_inet6.c                    |  1 +
>   net/ipv6/exthdrs_core.c                | 11 +++++++++++
>   net/ipv6/ip6_tunnel.c                  |  5 +++++
>   net/ipv6/sysctl_net_ipv6.c             |  8 ++++++++
>   7 files changed, 35 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6921d8594b84..4559a956bbd9 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -2503,6 +2503,13 @@ max_hbh_length - INTEGER
>   
>   	Default: INT_MAX (unlimited)
>   
> +max_ext_hdrs_number - INTEGER
> +	Maximum number of IPv6 extension headers allowed in a packet.
> +	Limits how many extension headers will be traversed. The value
> +	is read from the initial netns.
> +
> +	Default: 32
> +
>   skip_notify_on_dev_down - BOOLEAN
>   	Controls whether an RTM_DELROUTE message is generated for routes
>   	removed when a device is taken down or deleted. IPv4 does not
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 53c5056508be..d7f0d55e6918 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -90,6 +90,8 @@ struct ip_tunnel_info;
>   #define IP6_DEFAULT_MAX_DST_OPTS_LEN	 INT_MAX /* No limit */
>   #define IP6_DEFAULT_MAX_HBH_OPTS_LEN	 INT_MAX /* No limit */
>   
> +#define IP6_DEFAULT_MAX_EXT_HDRS_CNT	 32
> +
>   /*
>    *	Addr type
>    *	
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index 34bdb1308e8f..5be4dd1c9ae8 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -54,6 +54,7 @@ struct netns_sysctl_ipv6 {
>   	int max_hbh_opts_cnt;
>   	int max_dst_opts_len;
>   	int max_hbh_opts_len;
> +	int max_ext_hdrs_cnt;
>   	int seg6_flowlabel;
>   	u32 ioam6_id;
>   	u64 ioam6_id_wide;
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 4cbd45b68088..ed7fe6e4a6bd 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -965,6 +965,7 @@ static int __net_init inet6_net_init(struct net *net)
>   	net->ipv6.sysctl.flowlabel_state_ranges = 0;
>   	net->ipv6.sysctl.max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT;
>   	net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
> +	net->ipv6.sysctl.max_ext_hdrs_cnt = IP6_DEFAULT_MAX_EXT_HDRS_CNT;
>   	net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
>   	net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
>   	net->ipv6.sysctl.fib_notify_on_flag_change = 0;
> diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
> index 49e31e4ae7b7..917307877cbb 100644
> --- a/net/ipv6/exthdrs_core.c
> +++ b/net/ipv6/exthdrs_core.c
> @@ -4,6 +4,8 @@
>    * not configured or static.
>    */
>   #include <linux/export.h>
> +
> +#include <net/net_namespace.h>
>   #include <net/ipv6.h>
>   
>   /*
> @@ -72,7 +74,9 @@ EXPORT_SYMBOL(ipv6_ext_hdr);
>   int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
>   		     __be16 *frag_offp)
>   {
> +	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
>   	u8 nexthdr = *nexthdrp;
> +	int exthdr_cnt = 0;
>   
>   	*frag_offp = 0;
>   
> @@ -80,6 +84,8 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
>   		struct ipv6_opt_hdr _hdr, *hp;
>   		int hdrlen;
>   
> +		if (unlikely(exthdr_cnt++ >= exthdr_max))
> +			return -1;
>   		if (nexthdr == NEXTHDR_NONE)
>   			return -1;
>   		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
> @@ -188,8 +194,10 @@ EXPORT_SYMBOL_GPL(ipv6_find_tlv);
>   int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>   		  int target, unsigned short *fragoff, int *flags)
>   {
> +	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
>   	unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
>   	u8 nexthdr = ipv6_hdr(skb)->nexthdr;
> +	int exthdr_cnt = 0;
>   	bool found;
>   
>   	if (fragoff)
> @@ -216,6 +224,9 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>   			return -ENOENT;
>   		}
>   
> +		if (unlikely(exthdr_cnt++ >= exthdr_max))
> +			return -EBADMSG;
> +
>   		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
>   		if (!hp)
>   			return -EBADMSG;
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 0b53488a9229..78e849e167ca 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -396,15 +396,20 @@ ip6_tnl_dev_uninit(struct net_device *dev)
>   
>   __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>   {
> +	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
>   	const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)raw;
>   	unsigned int nhoff = raw - skb->data;
>   	unsigned int off = nhoff + sizeof(*ipv6h);
>   	u8 nexthdr = ipv6h->nexthdr;
> +	int exthdr_cnt = 0;
>   
>   	while (ipv6_ext_hdr(nexthdr) && nexthdr != NEXTHDR_NONE) {
>   		struct ipv6_opt_hdr *hdr;
>   		u16 optlen;
>   
> +		if (unlikely(exthdr_cnt++ >= exthdr_max))
> +			break;
> +
>   		if (!pskb_may_pull(skb, off + sizeof(*hdr)))
>   			break;
>   
> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> index d2cd33e2698d..93f865545a7c 100644
> --- a/net/ipv6/sysctl_net_ipv6.c
> +++ b/net/ipv6/sysctl_net_ipv6.c
> @@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
>   		.extra1		= SYSCTL_ZERO,
>   		.extra2		= &flowlabel_reflect_max,
>   	},
> +	{
> +		.procname	= "max_ext_hdrs_number",
> +		.data		= &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ONE,
> +	},
>   	{
>   		.procname	= "max_dst_opts_number",
>   		.data		= &init_net.ipv6.sysctl.max_dst_opts_cnt,

NACKed-by: Justin Iurman <justin.iurman@gmail.com>

+1000 on the need, but NAK on the way it is done. IMO, we don't want 
yet-another-sysctl for that. Instead, we have (well, not yet, but it's 
about time) this series [1] to enforce ordering and occurrences of 
Extension Headers, which is based on an IETF draft [2] (FYI, 
draft-ietf-6man-eh-limits is dead). I think we should enforce ordering 
and occurrences in this code path too, instead of relying on a sysctl. 
Let's keep both code paths consistent.

  [1] 
https://lore.kernel.org/netdev/20260314175124.47010-1-tom@herbertland.com/#t
  [2] https://datatracker.ietf.org/doc/draft-iurman-6man-eh-occurrences/

^ permalink raw reply

* [PATCH net v2] vxlan: fix NULL vn6_sock dereference in vxlan_igmp_join() and vxlan_igmp_leave()
From: Weiming Shi @ 2026-04-18 11:41 UTC (permalink / raw)
  To: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Roopa Prabhu, netdev, Xiang Mei, Weiming Shi

vxlan_sock_add() tolerates IPv6 socket creation failure with
-EAFNOSUPPORT (e.g. ipv6.disable=1), leaving vn6_sock as NULL while
successfully creating vn4_sock. vxlan_igmp_join() and
vxlan_igmp_leave() then crash when they dereference the NULL vn6_sock
for VNI filter entries with IPv6 multicast groups:

 Oops: general protection fault, probably for non-canonical address
      0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
 RIP: 0010:vxlan_igmp_join (drivers/net/vxlan/vxlan_multicast.c:40)
 Call Trace:
  vxlan_multicast_join (drivers/net/vxlan/vxlan_multicast.c:195)
  vxlan_open (drivers/net/vxlan/vxlan_core.c:2965)
  __dev_open (net/core/dev.c:1704)
  __dev_change_flags (net/core/dev.c:9781)
  do_setlink.isra.0 (net/core/rtnetlink.c:3180)
  rtnl_newlink (net/core/rtnetlink.c:4238)
  rtnetlink_rcv_msg (net/core/rtnetlink.c:6921)

Skip the IPv6 multicast join/leave when vn6_sock is NULL, consistent
with how vxlan_sock_add() tolerates missing IPv6 support.

Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
v2:
  - drop sock4 NULL checks 

 drivers/net/vxlan/vxlan_multicast.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_multicast.c b/drivers/net/vxlan/vxlan_multicast.c
index a7f2d67dc61b..e6aa5ab1c939 100644
--- a/drivers/net/vxlan/vxlan_multicast.c
+++ b/drivers/net/vxlan/vxlan_multicast.c
@@ -37,6 +37,9 @@ int vxlan_igmp_join(struct vxlan_dev *vxlan, union vxlan_addr *rip,
 	} else {
 		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
 
+		if (!sock6)
+			return 0;
+
 		sk = sock6->sock->sk;
 		lock_sock(sk);
 		ret = ipv6_stub->ipv6_sock_mc_join(sk, ifindex,
@@ -71,6 +74,9 @@ int vxlan_igmp_leave(struct vxlan_dev *vxlan, union vxlan_addr *rip,
 	} else {
 		struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
 
+		if (!sock6)
+			return 0;
+
 		sk = sock6->sock->sk;
 		lock_sock(sk);
 		ret = ipv6_stub->ipv6_sock_mc_drop(sk, ifindex,
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC PATCH v4 01/19] landlock: Support socket access-control
From: Mikhail Ivanov @ 2026-04-18 11:29 UTC (permalink / raw)
  To: Günther Noack
  Cc: mic, gnoack, willemdebruijn.kernel, matthieu,
	linux-security-module, netdev, netfilter-devel, yusongping,
	artem.kuzin, konstantin.meskhidze
In-Reply-To: <af464773-b01b-f3a4-474d-0efb2cfae142@huawei-partners.com>

On 11/22/2025 2:13 PM, Mikhail Ivanov wrote:
> On 11/22/2025 1:49 PM, Günther Noack wrote:
>> On Tue, Nov 18, 2025 at 09:46:21PM +0800, Mikhail Ivanov wrote:
>>> +/**
>>> + * struct landlock_socket_attr - Socket protocol definition
>>> + *
>>> + * Argument of sys_landlock_add_rule().
>>> + */
>>> +struct landlock_socket_attr {
>>> +    /**
>>> +     * @allowed_access: Bitmask of allowed access for a socket protocol
>>> +     * (cf. `Socket flags`_).
>>> +     */
>>> +    __u64 allowed_access;
>>> +    /**
>>> +     * @family: Protocol family used for communication
>>> +     * (cf. include/linux/socket.h).
>>> +     */
>>> +    __s32 family;
>>> +    /**
>>> +     * @type: Socket type (cf. include/linux/net.h)
>>> +     */
>>> +    __s32 type;
>>> +    /**
>>> +     * @protocol: Communication protocol specific to protocol family 
>>> set in
>>> +     * @family field.
>>
>> This is specific to both the @family and the @type, not just the @family.
>>
>>> From socket(2):
>>
>>    Normally only a single protocol exists to support a particular
>>    socket type within a given protocol family.
>>
>> For instance, in your commit message above the protocol in the example
>> is IPPROTO_TCP, which would imply the type SOCK_STREAM, but not work
>> with SOCK_DGRAM.
> 
> You're right.
> 

I revised the socket(2) semantics and this part is about that kernel
maps (family, type, 0) to the default protocol of given family and type.
Eg. (AF_INET, SOCK_STREAM, 0) is mapped to (AF_INET, SOCK_STREAM,
IPPROTO_TCP). I would like to clarify that such mapping is taking place
in landlock_socket_attr.protocol field doc.

There should be list of protocols defined per protocol family. From
socket(2):
	The domain argument specifies a communication domain.
	...
	The protocol number to use is specific to the “communication
	domain” in which communication is to take place.

Such mapping allows to define strange socket rules if setting @type=-1.
For example:
	struct landlock_socket_attr attr = {
		.family = AF_INET,
		.type = -1,
		.protocol = 0,
	};

This definition corresponds to (AF_INET, SOCK_STREAM, 0->IPPROTO_TCP)
and to (AF_INET, SOCK_DGRAM, 0->IPPROTO_UDP).

I don't see this as a bad thing as far as there is proper documentation
for landlock_socket_attr.

^ permalink raw reply

* [PATCH net] net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd()
From: Bingquan Chen @ 2026-04-18 11:20 UTC (permalink / raw)
  To: Willem de Bruijn, Greg KH
  Cc: Stephen Hemminger, security, David S . Miller, Jakub Kicinski,
	Eric Dumazet, netdev, Bingquan Chen
In-Reply-To: <2026041858-estimator-shower-0f16@gregkh>

In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points
directly into the mmap'd TX ring buffer shared with userspace. The
kernel validates the header via __packet_snd_vnet_parse() but then
re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent
userspace thread can modify the vnet_hdr fields between validation
and use, bypassing all safety checks.

The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr
to a stack-local variable. All other vnet_hdr consumers in the kernel
(tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX
path is the only caller of virtio_net_hdr_to_skb() that reads directly
from user-controlled shared memory.

Fix this by copying vnet_hdr from the mmap'd ring buffer to a
stack-local variable before validation and use, consistent with the
approach used in packet_snd() and all other callers.

Fixes: 1d036d25e560 ("packet: tpacket_snd gso and checksum offload")
Signed-off-by: Bingquan Chen <patzilla007@gmail.com>
---
 net/packet/af_packet.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4b043241fd56..8e6f3a734ba0 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2718,7 +2718,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 {
 	struct sk_buff *skb = NULL;
 	struct net_device *dev;
-	struct virtio_net_hdr *vnet_hdr = NULL;
+	struct virtio_net_hdr vnet_hdr;
+	bool has_vnet_hdr = false;
 	struct sockcm_cookie sockc;
 	__be16 proto;
 	int err, reserve = 0;
@@ -2819,16 +2820,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		hlen = LL_RESERVED_SPACE(dev);
 		tlen = dev->needed_tailroom;
 		if (vnet_hdr_sz) {
-			vnet_hdr = data;
 			data += vnet_hdr_sz;
 			tp_len -= vnet_hdr_sz;
-			if (tp_len < 0 ||
-			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
+			if (tp_len < 0) {
+				tp_len = -EINVAL;
+				goto tpacket_error;
+			}
+			memcpy(&vnet_hdr, data - vnet_hdr_sz, sizeof(vnet_hdr));
+			if (__packet_snd_vnet_parse(&vnet_hdr, tp_len)) {
 				tp_len = -EINVAL;
 				goto tpacket_error;
 			}
 			copylen = __virtio16_to_cpu(vio_le(),
-						    vnet_hdr->hdr_len);
+						    vnet_hdr.hdr_len);
+			has_vnet_hdr = true;
 		}
 		copylen = max_t(int, copylen, dev->hard_header_len);
 		skb = sock_alloc_send_skb(&po->sk,
@@ -2865,12 +2870,12 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			}
 		}
 
-		if (vnet_hdr_sz) {
-			if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) {
+		if (has_vnet_hdr) {
+			if (virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le())) {
 				tp_len = -EINVAL;
 				goto tpacket_error;
 			}
-			virtio_net_hdr_set_proto(skb, vnet_hdr);
+			virtio_net_hdr_set_proto(skb, &vnet_hdr);
 		}
 
 		skb->destructor = tpacket_destruct_skb;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Justin Iurman @ 2026-04-18 10:59 UTC (permalink / raw)
  To: Daniel Borkmann, kuba
  Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev
In-Reply-To: <20260417220358.693101-1-daniel@iogearbox.net>

On 4/18/26 00:03, Daniel Borkmann wrote:
> Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
> Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
> and applied them in ip6_parse_tlv(), the generic TLV walker
> invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
> 
> ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
> it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
> branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
> loop is bounded only by optlen, which can be up to 2048 bytes.
> Stuffing the Destination Options header with 2046 Pad1 (type=0)
> entries advances the scanner a single byte at a time, yielding
> ~2000 TLV iterations per extension header.
> 
> Reuse max_dst_opts_cnt to bound the TLV iterations, matching
> the semantics from 47d3d7ac656a.
> 
> Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   net/ipv6/ip6_tunnel.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 907c6a2af331..0ab76f93c136 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>   				break;
>   		}
>   		if (nexthdr == NEXTHDR_DEST) {
> +			int tlv_max = READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt);
> +			int tlv_cnt = 0;
>   			u16 i = 2;
>   
>   			while (1) {
>   				struct ipv6_tlv_tnl_enc_lim *tel;
>   
> +				if (unlikely(tlv_cnt++ >= tlv_max))
> +					break;
> +
>   				/* No more room for encapsulation limit */
>   				if (i + sizeof(*tel) > optlen)
>   					break;

Good point on reusing max_dst_opts_cnt in ip6_tnl_parse_tlv_enc_lim(), 
but this patch is not ready yet.

We need to be careful: max_dst_opts_cnt can be negative. If this is the 
case, ip6_tnl_parse_tlv_enc_lim() would probably return 0, which is not 
what we want here. From the doc:

max_dst_opts_number - INTEGER
         Maximum number of non-padding TLVs allowed in a Destination
         options extension header. If this value is less than zero
         then unknown options are disallowed and the number of known
         TLVs allowed is the absolute value of this number.

         Default: 8

Since ip6_tnl_parse_tlv_enc_lim() does not check for specific option 
types (e.g., Pad1, PadN, you-name-it) and does not differentiate known 
from unknown options during parsing, I would simply use the absolute 
value of max_dst_opts_cnt by default.

Also, I wouldn't use unlikely() because it could harm us more than it 
helps in this specific context (consistent with ip6_parse_tlv()).

^ permalink raw reply

* Re: [PATCH 1/4 nf] netfilter: nft_exthdr: skip SCTP chunk evaluation for non-first fragments
From: Fernando Fernandez Mancera @ 2026-04-18  9:51 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, netdev, coreteam, fw, phil
In-Reply-To: <aeM3gmXM43beA3ot@chamomile>

On 4/18/26 9:49 AM, Pablo Neira Ayuso wrote:
> Hi Fernando,
> 
> On Fri, Apr 17, 2026 at 08:34:30PM +0200, Fernando Fernandez Mancera wrote:
>> The SCTP chunk matching logic in nft_exthdr relies on SCTP common header
>> being present at the transport header offset. For fragmented packets at
>> IP level, only the first fragment would match this condition.
>>
>> The nft_exthdr could be used in a PREROUTING chain with a priority lower
>> than -400. This would bypass defragmentation. In addition, it can be use
>> in stateless environments so it should work on a environment where
>> defragmentation is not being performed at all.
> 
> Yes, and stateless filtering is still a valid configuration, ie.
> nf_conntrack is not loaded.
> 
>> Add a check for pkt->fragoff to ensure exthdr SCTP only evaluates
>> unfragmented packets or the first fragment in the stream.
> 
> I would suggest to squash the three small patches to check for
> pkt->fragoff in one patch. The three expressions have been already
> around for a while (backporting the combo patch that makes the same
> logical change should be easy) and it is basically the same logical
> change.
> 

Hi Pablo,

Thanks for the review! I am not sure about squashing them as they all 
have different blamed commits. I find accurate fixes tag quite useful 
when handling backports and I guess others do too (also for stable 
kernels). Is that convincing?

Anyway, not a big deal if there is a strong preference I will squash them.

Thanks,
Fernando.

> Thanks!
> 
>> Fixes: 133dc203d77d ("netfilter: nft_exthdr: Support SCTP chunks")
>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>> ---
>>   net/netfilter/nft_exthdr.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
>> index 7eedf4e3ae9c..8eb708bb8cff 100644
>> --- a/net/netfilter/nft_exthdr.c
>> +++ b/net/netfilter/nft_exthdr.c
>> @@ -376,7 +376,7 @@ static void nft_exthdr_sctp_eval(const struct nft_expr *expr,
>>   	const struct sctp_chunkhdr *sch;
>>   	struct sctp_chunkhdr _sch;
>>   
>> -	if (pkt->tprot != IPPROTO_SCTP)
>> +	if (pkt->tprot != IPPROTO_SCTP || pkt->fragoff)
>>   		goto err;
>>   
>>   	do {
>> -- 
>> 2.53.0
>>


^ permalink raw reply

* [RFC PATCH net] mptcp: pm: fix ADD_ADDR timer infinite retry on option space insufficient
From: Li Xiasong @ 2026-04-18 10:00 UTC (permalink / raw)
  To: Matthieu Baerts, Mat Martineau, Geliang Tang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, mptcp, linux-kernel, yuehaibing, zhangchangzhong,
	weiyongjun1

When TCP option space is insufficient (e.g., IPv6 with tcp_timestamps
enabled), the original code jumped to out_unlock without clearing the
addr_signal flag. This caused mptcp_pm_add_timer to keep rescheduling
indefinitely without sending ADD_ADDR, preventing the endpoint list from
being traversed.

In a pure ACK scenario (indicated by drop_other_suboptions=true), if
the option space is insufficient to carry the ADD_ADDR suboption, it
is appropriate to drop this address signal to allow the timer handler
to move on to other addresses.

Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Signed-off-by: Li Xiasong <lixiasong1@huawei.com>
---

Seeking feedback on:

When announcing addresses to the peer, MPTCP sends a pure ACK packet
to carry MPTCP options (ADD_ADDR). In this scenario, if the option space
is insufficient for ADD_ADDR, clearing addr_signal would:

  - Prevent the timer from retrying infinitely
  - Allow the timer to continue traversing and processing other addresses
  - Not block other subflow creation or address announcement operations

Is there any scenario where we should retry later instead of clearing
the address signal/echo flag? However, if a pure ACK doesn't have
enough space for the flag, subsequent packets won't either.

---
 net/mptcp/pm.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 57a456690406..1d49779c6a1f 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -881,19 +881,18 @@ bool mptcp_pm_add_addr_signal(struct mptcp_sock *msk, const struct sk_buff *skb,
 	}

 	*echo = mptcp_pm_should_add_signal_echo(msk);
+	add_addr = msk->pm.addr_signal &
+		~(*echo ? BIT(MPTCP_ADD_ADDR_ECHO) : BIT(MPTCP_ADD_ADDR_SIGNAL));
 	port = !!(*echo ? msk->pm.remote.port : msk->pm.local.port);
-
 	family = *echo ? msk->pm.remote.family : msk->pm.local.family;
-	if (remaining < mptcp_add_addr_len(family, *echo, port))
-		goto out_unlock;

-	if (*echo) {
-		*addr = msk->pm.remote;
-		add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_ECHO);
-	} else {
-		*addr = msk->pm.local;
-		add_addr = msk->pm.addr_signal & ~BIT(MPTCP_ADD_ADDR_SIGNAL);
+	if (remaining < mptcp_add_addr_len(family, *echo, port)) {
+		if (*drop_other_suboptions)
+			WRITE_ONCE(msk->pm.addr_signal, add_addr);
+		goto out_unlock;
 	}
+
+	*addr = *echo ? msk->pm.remote : msk->pm.local;
 	WRITE_ONCE(msk->pm.addr_signal, add_addr);
 	ret = true;

-- 
2.34.1

^ permalink raw reply related

* pre-boot plugged SFP autoneg advertisement
From: markus.stockhausen @ 2026-04-18  9:27 UTC (permalink / raw)
  To: linux, andrew, hkallweit1, netdev; +Cc: 'Jonas Jelonek', jan

Hi,

I'm currently analyzing an issue where a pre-boot-plugged SFP module 
comes up with autoneg=no advertisement during boot. After an
unplug/replug autoneg=yes advertisement is chosen. 

The following addition in phylink_start() just before the call to
phylink_mac_initial_config() mitigiates this.

+  /* If an SFP module was already present before phylink_start() was
+   * called, phylink_sfp_set_config() was unable to call
+   * phylink_mac_initial_config() as phylink was not yet started.
+   * Ensure the SFP capabilities are reflected in advertising.
+   */
+  if (pl->sfp_bus && !linkmode_empty(pl->sfp_support))
+    linkmode_copy(pl->link_config.advertising, pl->sfp_support);

Remark! This is about the OpenWrt Realtek Switch ecosystem with 
kernel 6.18 where we are working hard to get hardware up and 
running. We still rely heavily on pcs/dsa downstream drivers. So 
I'm unsure if my observation/idea regarding upstream phylink is 
right.

Thanks for your feedback in advance.

Markus

^ permalink raw reply

* [PATCH iwl-net v1] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race
From: Kohei Enju @ 2026-04-18  9:01 UTC (permalink / raw)
  To: intel-wired-lan, netdev
  Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Wojciech Drewek,
	Jacob Keller, Larysa Zaremba, Maciej Fijalkowski, Kohei Enju

ice_xdp_setup_prog() unconditionally hot-swaps xdp_prog when
ICE_VSI_REBUILD_PENDING is set. In the attach path, this can publish a
new rx_ring->xdp_prog before rx_ring->xdp_ring becomes valid while the
rebuild is pending. As a result, ice_clean_rx_irq() may dereference
rx_ring->xdp_ring too early.

With high-volume RX packets, running these commands in parallel
triggered a KASAN splat [1].
 # ethtool --reset $DEV irq dma filter offload
 # ip link set dev $DEV xdp {obj $OBJ sec xdp,off}

Fix this by rejecting XDP attach while rebuild is pending.
Keep XDP detach allowed in this window. Detach clears rx_ring->xdp_prog,
so the RX path will not attempt to access rx_ring->xdp_ring.

[1]
BUG: KASAN: slab-use-after-free in ice_napi_poll+0x3921/0x41a0
Read of size 2 at addr ffff88812475b880 by task ksoftirqd/1/23
[...]
Call Trace:
 <TASK>
 ice_napi_poll+0x3921/0x41a0
 __napi_poll+0x98/0x520
 net_rx_action+0x8f2/0xfa0
 handle_softirqs+0x1cb/0x7f0
[...]
 </TASK>

Allocated by task 7246:
 ice_prepare_xdp_rings+0x3de/0x12d0
 ice_xdp+0x61c/0xef0
 dev_xdp_install+0x3c4/0x840
 dev_xdp_attach+0x50a/0x10a0
 dev_change_xdp_fd+0x175/0x210
[...]

Freed by task 7251:
 __rcu_free_sheaf_prepare+0x5f/0x230
 rcu_free_sheaf+0x1a/0xf0
 rcu_core+0x567/0x1d80
 handle_softirqs+0x1cb/0x7f0

Fixes: 2504b8405768 ("ice: protect XDP configuration with a mutex")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
---
 drivers/net/ethernet/intel/ice/ice_main.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d1f628f1c8ac..4681cbe193f6 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2912,12 +2912,21 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 	}
 
 	/* hot swap progs and avoid toggling link */
-	if (ice_is_xdp_ena_vsi(vsi) == !!prog ||
-	    test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) {
+	if (ice_is_xdp_ena_vsi(vsi) == !!prog) {
 		ice_vsi_assign_bpf_prog(vsi, prog);
 		return 0;
 	}
 
+	if (test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) {
+		if (prog) {
+			NL_SET_ERR_MSG_MOD(extack, "VSI rebuild is pending");
+			return -EAGAIN;
+		}
+
+		ice_vsi_assign_bpf_prog(vsi, NULL);
+		return 0;
+	}
+
 	if_running = netif_running(vsi->netdev) &&
 		     !test_and_set_bit(ICE_VSI_DOWN, vsi->state);
 
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH net-deletions] caif: remove CAIF NETWORK LAYER
From: Greg KH @ 2026-04-18  8:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, corbet,
	skhan, alexs, si.yanteng, dzm91, linux, mst, jasowang, xuanzhuo,
	eperezma, xu.xin16, wang.yaxin, jiang.kun2, linusw,
	jihed.chaibi.dev, arnd, tytso, jiayuan.chen
In-Reply-To: <20260416182829.1440262-1-kuba@kernel.org>

On Thu, Apr 16, 2026 at 11:28:28AM -0700, Jakub Kicinski wrote:
> Remove CAIF (Communication CPU to Application CPU Interface), the
> ST-Ericsson modem protocol. The subsystem has been orphaned since 2013.
> The last meaningful changes from the maintainers were in March 2013:
>   a8c7687bf216 ("caif_virtio: Check that vringh_config is not null")
>   b2273be8d2df ("caif_virtio: Use vringh_notify_enable correctly")
>   0d2e1a2926b1 ("caif_virtio: Introduce caif over virtio")
> 
> Not-so-coincidentally, according to "the Internet" ST-Ericsson officially
> shut down its modem joint venture in Aug 2013.
> 
> If anyone is using this code please yell!
> 
> In the 13 years since, the code has accumulated 200 non-merge commits,
> of which 71 were cross-tree API changes, 21 carried Fixes: tags, and
> the remaining ~110 were cleanups, doc conversions, treewide refactors,
> and one partial removal (caif_hsi, ca75bcf0a83b).
> 
> We are still getting fixes to this code, in the last 10 days there were
> 3 reports on security@ about CAIF that I have been CCed on.
> 
> UAPI constants (AF_CAIF, ARPHRD_CAIF, N_CAIF, VIRTIO_ID_CAIF) and the
> SELinux classmap entry are intentionally kept for ABI stability.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> I think we should accumulate such patches over the coming days on a separate
> branch. CAIF is a no-brainer IMO but other removals may be more controversial.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply

* Re: [PATCH 1/2 nf] netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
From: Pablo Neira Ayuso @ 2026-04-18  7:57 UTC (permalink / raw)
  To: Fernando Fernandez Mancera; +Cc: netfilter-devel, netdev, coreteam, fw, phil
In-Reply-To: <20260417162057.3732-1-fmancera@suse.de>

On Fri, Apr 17, 2026 at 06:20:56PM +0200, Fernando Fernandez Mancera wrote:
> In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once
> and passed by reference to nf_osf_match_one() for each fingerprint
> checked. During TCP option parsing, nf_osf_match_one() advances the
> shared ctx->optp pointer.
> 
> If a fingerprint perfectly matches, the function returns early without
> restoring ctx->optp to its initial state. If the user has configured
> NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint.
> However, because ctx->optp was not restored, the next call to
> nf_osf_match_one() starts parsing from the end of the options buffer.
> This causes subsequent matches to read garbage data and fail
> immediately, making it impossible to log more than one match or logging
> incorrect matches.
> 
> Instead of using a shared ctx->optp pointer, pass the context as a
> constant pointer and use a local pointer (optp) for TCP option
> traversal. This makes nf_osf_match_one() strictly stateless from the
> caller's perspective, ensuring every fingerprint check starts at the
> correct option offset.
> 
> Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check")
> Suggested-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>

Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>

^ permalink raw reply

* Re: [PATCH 2/2 nf] netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
From: Pablo Neira Ayuso @ 2026-04-18  7:53 UTC (permalink / raw)
  To: Fernando Fernandez Mancera
  Cc: netfilter-devel, netdev, coreteam, fw, phil, Kito Xu (veritas501)
In-Reply-To: <20260417162057.3732-2-fmancera@suse.de>

On Fri, Apr 17, 2026 at 06:20:57PM +0200, Fernando Fernandez Mancera wrote:
> The nf_osf_ttl() function accessed skb->dev to perform a local interface
> address lookup without verifying that the device pointer was valid.
> 
> Additionally, the implementation utilized an in_dev_for_each_ifa_rcu
> loop to match the packet source address against local interface
> addresses. It assumed that packets from the same subnet should not see a
> decrement on the initial TTL. A packet might appear it is from the same
> subnet but it actually isn't especially in modern environments with
> containers and virtual switching.
> 
> Remove the device dereference and interface loop. Replace the logic with
> a switch statement that evaluates the TTL according to the ttl_check.
> 
> Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
> Reported-by: Kito Xu (veritas501) <hxzene@gmail.com>
> Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>

Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>

> ---
> Note: if some help is needed during the backport I can assist.
> ---
>  net/netfilter/nfnetlink_osf.c | 22 +++++++---------------
>  1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
> index f58267986453..f0d1e596e146 100644
> --- a/net/netfilter/nfnetlink_osf.c
> +++ b/net/netfilter/nfnetlink_osf.c
> @@ -31,26 +31,18 @@ EXPORT_SYMBOL_GPL(nf_osf_fingers);
>  static inline int nf_osf_ttl(const struct sk_buff *skb,
>  			     int ttl_check, unsigned char f_ttl)
>  {
> -	struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
>  	const struct iphdr *ip = ip_hdr(skb);
> -	const struct in_ifaddr *ifa;
> -	int ret = 0;
>  
> -	if (ttl_check == NF_OSF_TTL_TRUE)
> +	switch (ttl_check) {
> +	case NF_OSF_TTL_TRUE:
>  		return ip->ttl == f_ttl;
> -	if (ttl_check == NF_OSF_TTL_NOCHECK)
> -		return 1;
> -	else if (ip->ttl <= f_ttl)
> +		break;
> +	case NF_OSF_TTL_NOCHECK:
>  		return 1;
> -
> -	in_dev_for_each_ifa_rcu(ifa, in_dev) {
> -		if (inet_ifa_match(ip->saddr, ifa)) {
> -			ret = (ip->ttl == f_ttl);
> -			break;
> -		}
> +	case NF_OSF_TTL_LESS:
> +	default:
> +		return ip->ttl <= f_ttl;
>  	}
> -
> -	return ret;
>  }
>  
>  struct nf_osf_hdr_ctx {
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH 4/4 nf] netfilter: xtables: fix L4 header parsing for non-first fragments
From: Pablo Neira Ayuso @ 2026-04-18  7:51 UTC (permalink / raw)
  To: Fernando Fernandez Mancera; +Cc: netfilter-devel, netdev, coreteam, fw, phil
In-Reply-To: <20260417183433.4739-6-fmancera@suse.de>

On Fri, Apr 17, 2026 at 08:34:35PM +0200, Fernando Fernandez Mancera wrote:
> The TPROXY target and osf match relies on L4 header to operate. For
> fragmented packets, every fragment carries the transport protocol
> identifier, but only the first fragment contains the L4 header.
> 
> As the 'raw' table can be configured to run at priority -450 (before
> defragmentation at -400), the target/match can be reached before
> reassembly. In this case, non-first fragments have their payload
> incorrectly parsed as a TCP/UDP header.

I see, this refers to a misconfiguration scenario.

> Add a fragment check to ensure TPROXY/osf only evaluates unfragmented
> packets or the first fragment in the stream.

LGTM this combo patch for osf and TPROXY in xtables.

Thanks.

> Fixes: 902d6a4c2a4f ("netfilter: nf_defrag: Skip defrag if NOTRACK is set")
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> ---
>  net/netfilter/xt_TPROXY.c | 8 ++++++--
>  net/netfilter/xt_osf.c    | 3 +++
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
> index e4bea1d346cf..ac4b011ce48c 100644
> --- a/net/netfilter/xt_TPROXY.c
> +++ b/net/netfilter/xt_TPROXY.c
> @@ -40,6 +40,9 @@ tproxy_tg4(struct net *net, struct sk_buff *skb, __be32 laddr, __be16 lport,
>  	struct udphdr _hdr, *hp;
>  	struct sock *sk;
>  
> +	if (ip_is_fragment(iph))
> +		return NF_DROP;
> +
>  	hp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(_hdr), &_hdr);
>  	if (hp == NULL)
>  		return NF_DROP;
> @@ -106,6 +109,7 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
>  {
>  	const struct ipv6hdr *iph = ipv6_hdr(skb);
>  	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
> +	unsigned short fragoff = 0;
>  	struct udphdr _hdr, *hp;
>  	struct sock *sk;
>  	const struct in6_addr *laddr;
> @@ -113,8 +117,8 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
>  	int thoff = 0;
>  	int tproto;
>  
> -	tproto = ipv6_find_hdr(skb, &thoff, -1, NULL, NULL);
> -	if (tproto < 0)
> +	tproto = ipv6_find_hdr(skb, &thoff, -1, &fragoff, NULL);
> +	if (tproto < 0 || fragoff)
>  		return NF_DROP;
>  
>  	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
> diff --git a/net/netfilter/xt_osf.c b/net/netfilter/xt_osf.c
> index dc9485854002..889dff4daff0 100644
> --- a/net/netfilter/xt_osf.c
> +++ b/net/netfilter/xt_osf.c
> @@ -27,6 +27,9 @@
>  static bool
>  xt_osf_match_packet(const struct sk_buff *skb, struct xt_action_param *p)
>  {
> +	if (ip_is_fragment(ip_hdr(skb)))
> +		return false;
> +
>  	return nf_osf_match(skb, xt_family(p), xt_hooknum(p), xt_in(p),
>  			    xt_out(p), p->matchinfo, xt_net(p), nf_osf_fingers);
>  }
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH 1/4 nf] netfilter: nft_exthdr: skip SCTP chunk evaluation for non-first fragments
From: Pablo Neira Ayuso @ 2026-04-18  7:49 UTC (permalink / raw)
  To: Fernando Fernandez Mancera; +Cc: netfilter-devel, netdev, coreteam, fw, phil
In-Reply-To: <20260417183433.4739-1-fmancera@suse.de>

Hi Fernando,

On Fri, Apr 17, 2026 at 08:34:30PM +0200, Fernando Fernandez Mancera wrote:
> The SCTP chunk matching logic in nft_exthdr relies on SCTP common header
> being present at the transport header offset. For fragmented packets at
> IP level, only the first fragment would match this condition.
> 
> The nft_exthdr could be used in a PREROUTING chain with a priority lower
> than -400. This would bypass defragmentation. In addition, it can be use
> in stateless environments so it should work on a environment where
> defragmentation is not being performed at all.

Yes, and stateless filtering is still a valid configuration, ie.
nf_conntrack is not loaded.

> Add a check for pkt->fragoff to ensure exthdr SCTP only evaluates
> unfragmented packets or the first fragment in the stream.

I would suggest to squash the three small patches to check for
pkt->fragoff in one patch. The three expressions have been already
around for a while (backporting the combo patch that makes the same
logical change should be easy) and it is basically the same logical
change.

Thanks!

> Fixes: 133dc203d77d ("netfilter: nft_exthdr: Support SCTP chunks")
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> ---
>  net/netfilter/nft_exthdr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
> index 7eedf4e3ae9c..8eb708bb8cff 100644
> --- a/net/netfilter/nft_exthdr.c
> +++ b/net/netfilter/nft_exthdr.c
> @@ -376,7 +376,7 @@ static void nft_exthdr_sctp_eval(const struct nft_expr *expr,
>  	const struct sctp_chunkhdr *sch;
>  	struct sctp_chunkhdr _sch;
>  
> -	if (pkt->tprot != IPPROTO_SCTP)
> +	if (pkt->tprot != IPPROTO_SCTP || pkt->fragoff)
>  		goto err;
>  
>  	do {
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH net-next] r8169: report per-queue statistics through netdev qstats
From: Eric Dumazet @ 2026-04-18  6:27 UTC (permalink / raw)
  To: Gustavo Arantes
  Cc: Heiner Kallweit, nic_swsd, Andrew Lunn, David S . Miller,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260418021232.5425-1-dev.gustavoa@gmail.com>

On Fri, Apr 17, 2026 at 7:12 PM Gustavo Arantes <dev.gustavoa@gmail.com> wrote:
>
> r8169 maintains synchronized per-CPU software counters for packet and byte
> accounting, but does not expose them through the netdev qstats interface.
>
> Add netdev_stat_ops callbacks and report the existing software counters
> through queue 0 for both Rx and Tx. Provide zero base stats so device-scope
> qstats report the packet and byte counters as supported and match the
> existing RTNL statistics.
>
> Signed-off-by: Gustavo Arantes <dev.gustavoa@gmail.com>

## Form letter - net-next-closed

Please repost when net-next reopens after Apr 27th.

RFC patches sent for review only are obviously welcome at any time.

See:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle

pw-bot: cr

^ permalink raw reply

* Re: [PATCH net 1/2] tcp: call sk_data_ready() after listener migration
From: Eric Dumazet @ 2026-04-18  6:02 UTC (permalink / raw)
  To: Zhenzhong Wu
  Cc: netdev, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
	shuah, tamird, linux-kernel, linux-kselftest, stable
In-Reply-To: <20260418041633.691435-2-jt26wzz@gmail.com>

On Fri, Apr 17, 2026 at 9:17 PM Zhenzhong Wu <jt26wzz@gmail.com> wrote:
>
> When inet_csk_listen_stop() migrates an established child socket from
> a closing listener to another socket in the same SO_REUSEPORT group,
> the target listener gets a new accept-queue entry via
> inet_csk_reqsk_queue_add(), but that path never notifies the target
> listener's waiters.
>
> As a result, a nonblocking accept() still succeeds because it checks
> the accept queue directly, but waiters that sleep for listener
> readiness can remain asleep until another connection generates a
> wakeup. This affects poll()/epoll_wait()-based waiters, and can also
> leave a blocking accept() asleep after migration even though the
> child is already in the target listener's accept queue.
>
> This was observed in a local test where listener A completed the
> handshake, queued the child, and was closed before userspace called
> accept(). The child was migrated to listener B, but listener B never
> received a wakeup for the migrated accept-queue entry.
>
> Call READ_ONCE(nsk->sk_data_ready)(nsk) after a successful migration
> in inet_csk_listen_stop().
>
> The reqsk_timer_handler() path does not need the same change:
> half-open requests only become readable to userspace when the final
> ACK completes the handshake, and tcp_child_process() already wakes
> the listener in that case.
>
> Fixes: 54b92e841937 ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
> ---
>  net/ipv4/inet_connection_sock.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 4ac3ae1bc..da1ce082f 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -1483,6 +1483,7 @@ void inet_csk_listen_stop(struct sock *sk)
>                                         __NET_INC_STATS(sock_net(nsk),
>                                                         LINUX_MIB_TCPMIGRATEREQSUCCESS);
>                                         reqsk_migrate_reset(req);
> +                                       READ_ONCE(nsk->sk_data_ready)(nsk);

I think this is adding a potential UAF (Use Afte Free).
@nsk might have been freed already by another thread/cpu.
Note the existing code already has similar issues.

Untested patch:

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4ac3ae1bc1afc3a39f2790e39b4dda877dc3272b..287b6e01c4f71bfec3dd2a708f316224d9eb4a64
100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1479,6 +1479,7 @@ void inet_csk_listen_stop(struct sock *sk)
                        if (nreq) {
                                refcount_set(&nreq->rsk_refcnt, 1);

+                               rcu_read_lock();
                                if (inet_csk_reqsk_queue_add(nsk,
nreq, child)) {
                                        __NET_INC_STATS(sock_net(nsk),

LINUX_MIB_TCPMIGRATEREQSUCCESS);
@@ -1489,7 +1490,7 @@ void inet_csk_listen_stop(struct sock *sk)
                                        reqsk_migrate_reset(nreq);
                                        __reqsk_free(nreq);
                                }
-
+                               rcu_read_unlock();
                                /* inet_csk_reqsk_queue_add() has already
                                 * called inet_child_forget() on failure case.
                                 */

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox