Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio LAWSON @ 2026-04-16 14:25 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Marek Vasut, Woojung Huh, UNGLinuxDriver, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Chevallier, Simon Horman, Heiner Kallweit, Russell King,
	netdev, linux-kernel, Fidelio Lawson
In-Reply-To: <03040421-89e7-4422-9fb5-0367a34323e4@lunn.ch>

On 4/16/26 14:25, Andrew Lunn wrote:
>> Yes, I think a reasonable compromise could be to expose three tunables:
>>
>> - a boolean "short-cable" tunable, which applies the known good settings
>>    (LPF 62 MHz BW, DSP EQ initial value 0).
>>
>> - an integer LPF bandwidth tunable, for advanced use cases where further
>>    tuning is needed;
>>
>> - an integer DSP EQ initial value tunable, for the same advanced cases.
>>
>> The boolean tunable would follow the KISS principle and cover the common
>> scenario, while the more granular controls would remain optional.
> 
> How do the three interact? Do you need to first enable short-cable
> before you set LPG bandwidth or DSP EQ? If it is not enabled, do you
> get -EINVAL?
> 
> It seems like having extack would be useful to return informative
> error messages to user space, however, that requires netlink
> ethtool. And ETHTOOL_PHY_STUNABLE has not been added to netlink
> ethtool yet :-(
> 
> 	Andrew

My intention would be to keep the interactions as simple and
non-surprising as possible, and avoid requiring any particular ordering
or state machine between the tunables.

The boolean short-cable tunable would simply apply the preset
in one step. The LPF bandwidth and DSP EQ initial value tunables would 
be orthogonal knobs which can be set independently at any time, 
regardless of whether short-cable is enabled or not.

With this model, we don’t need to return -EINVAL for combinations or
ordering, and userspace does not need detailed error reporting. The
tunables behave more like simple setters than a mode switch, which
keeps the API predictable and avoids the need for extack or netlink
ethtool support at this point.

Fidelio

^ permalink raw reply

* Re: [PATCH net v3 4/5] ice: skip unnecessary VF reset when setting trust
From: Simon Horman @ 2026-04-16 13:55 UTC (permalink / raw)
  To: jtornosm
  Cc: Simon Horman, stable, davem, jesse.brandeburg, edumazet,
	przemyslaw.kitszel, netdev, pabeni, intel-wired-lan,
	anthony.l.nguyen, kuba
In-Reply-To: <20260414110006.124286-5-jtornosm@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
ice: skip unnecessary VF reset when setting trust

Similar to the i40e fix, ice_set_vf_trust() unconditionally calls
ice_reset_vf() when the trust setting changes. The ice driver already
has logic to clean up MAC LLDP filters when removing trust, which is
the only operation that requires filter synchronization. After this
cleanup, the VF reset is only necessary if there were actually filters
to remove.

> diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c
> --- a/drivers/net/ethernet/intel/ice/ice_sriov.c
> +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c
> @@ -1399,14 +1399,19 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted)
>
>  	mutex_lock(&vf->cfg_lock);
>
> -	while (!trusted && vf->num_mac_lldp)
> -		ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false);
> -
>  	vf->trusted = trusted;
> -	ice_reset_vf(vf, ICE_VF_RESET_NOTIFY);
>  	dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n",
>  		 vf_id, trusted ? "" : "un");
>
> +	/* Only reset VF if removing trust and there are MAC LLDP filters
> +	 * to clean up. Reset is needed to ensure filter removal completes.
> +	 */
> +	if (!trusted && vf->num_mac_lldp) {
> +		while (vf->num_mac_lldp)
> +			ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false);
> +		ice_reset_vf(vf, ICE_VF_RESET_NOTIFY);
> +	}
> +

Does setting vf->trusted before the cleanup loop break the filter
removal? Looking at the cleanup mechanism, ice_vf_update_mac_lldp_num()
calls ice_vf_is_lldp_ena() which checks both num_mac_lldp AND
vf->trusted:

ice_vf_lib.h:ice_vf_is_lldp_ena():
    return vf->num_mac_lldp && vf->trusted;

With the new ordering where vf->trusted is set to false before the loop:

ice_vf_update_mac_lldp_num() logic:
    was_ena = ice_vf_is_lldp_ena(vf);  // evaluates to (num > 0 && FALSE) = FALSE
    vf->num_mac_lldp--;
    is_ena = ice_vf_is_lldp_ena(vf);   // evaluates to FALSE

    if (was_ena != is_ena)              // FALSE != FALSE is FALSE
        ice_vsi_cfg_sw_lldp(vsi, false, false);  // NEVER called

But in the old code, vf->trusted was still TRUE during the loop:

    was_ena = (num > 0 && TRUE) = TRUE
    vf->num_mac_lldp--;
    is_ena = (0 && TRUE) = FALSE      // when last filter removed

    if (was_ena != is_ena)            // TRUE != FALSE is TRUE
        ice_vsi_cfg_sw_lldp(...)      // called to remove LLDP filters

So the ETH_P_LLDP packet filters are not removed from the hardware
switch, leaving stale filter rules. Should vf->trusted be set after the
cleanup loop instead?

>  	mutex_unlock(&vf->cfg_lock);
>
>  out_put_vf:

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3] seg6: SRv6 L2 VPN with End.DT2U and srl2 device
From: Ferenc Fejes @ 2026-04-16 13:55 UTC (permalink / raw)
  To: justin.iurman@6wind.com, stefano.salsano@uniroma2.it,
	andrea.mayer@uniroma2.it
  Cc: davem@davemloft.net, dsahern@kernel.org,
	paolo.lungaroni@uniroma2.it, ahabdels@cisco.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com
In-Reply-To: <6fef33af-b351-4bf6-a210-ae2283ec9f69@uniroma2.it>

Hi

...

> 
> > I would personally love to have
> > "l2srv6", although I suspect we may end up with "l2seg6" to remain
> > consistent.
> 
> I'd like to keep it as short as possible, our initial idea was
> seg6l2, 
> very close to your last suggestion, then we opted for srl2 losing the
> "6" concept
> 
> now it comes to my mind that l2 is somehow redudant in an interface
> type 
> name, as an interface is an l2 concept per se, so my preferred option
> becomes:
> 
> sr6

For this type I agree that srl2 and sr6 both makes sense. For the long
run, maybe someone else want to implement [SR-]MPLS pseudowire for L2
in linux (AFAIK only L3 LSR mode supported right now). For such
interface type, srl2 name would make more sense.

Or use "pwl2" or "srpw" which would cover both SR-MPLS and SRv6 case?

If case SR-MPLS PW and SRv6 L2 must be different iface types, I would
use seg6 or sr6 to be consistent with the naming we have in iproute2
right now. Linux/iproute2 use seg6 and seg6local instead of the IETF
names like SRv6 H.End and SRv6 End, so I think its more consistent if
we stick to it. 

> 
> (short and memorable...)
> 
> as an alternative seg6 can work but I strongly prefer sr6
> 
> ciao
> Stefano
> 
> 
> > 
> > Cheers,
> > Justin
> 

Best,
Ferenc

^ permalink raw reply

* RE: [PATCH v1 net 1/1] net/sched: sch_dualpi2: fix limit/memlimit enforcement when dequeueing L-queue
From: Chia-Yu Chang (Nokia) @ 2026-04-16 13:52 UTC (permalink / raw)
  To: Paolo Abeni, linux-hardening@vger.kernel.org, kees@kernel.org,
	gustavoars@kernel.org, jhs@mojatatu.com, jiri@resnulli.us,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	horms@kernel.org, ij@kernel.org, ncardwell@google.com,
	Koen De Schepper (Nokia), g.white@cablelabs.com,
	ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	vidhi_goel@apple.com
In-Reply-To: <9ff2df3e-08cf-4f61-8a58-cac0a6980b2d@redhat.com>

> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com> 
> Sent: Thursday, April 16, 2026 3:26 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; linux-hardening@vger.kernel.org; kees@kernel.org; gustavoars@kernel.org; jhs@mojatatu.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; horms@kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> Subject: Re: [PATCH v1 net 1/1] net/sched: sch_dualpi2: fix limit/memlimit enforcement when dequeueing L-queue
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On 4/13/26 6:37 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> >
> > Fix dualpi2_change() to correctly enforce updated limit and memlimit 
> > values after a configuration change of the dualpi2 qdisc.
> >
> > Before this patch, dualpi2_change() always attempted to dequeue 
> > packets via the root qdisc (C-queue) when reducing backlog or memory 
> > usage, and unconditionally assumed that a valid skb will be returned. 
> > When traffic classification results in packets being queued in the 
> > L-queue while the C-queue is empty, this leads to a NULL skb 
> > dereference during limit or memlimit enforcement.
> >
> > This is fixed by first dequeuing from the C-queue path if it is non-empty.
> > Once the C-queue is empty, packets are dequeued directly from the L-queue.
> > Return values from qdisc_dequeue_internal() are checked for both 
> > queues. When dequeuing from the L-queue, the parent qdisc qlen and 
> > backlog counters are updated explicitly to keep overall qdisc statistics consistent.
> >
> > Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 
> > qdisc")
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  net/sched/sch_dualpi2.c | 24 +++++++++++++++++++-----
> >  1 file changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c index 
> > 6d7e6389758d..56d4422970b6 100644
> > --- a/net/sched/sch_dualpi2.c
> > +++ b/net/sched/sch_dualpi2.c
> > @@ -872,11 +872,25 @@ static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
> >       old_backlog = sch->qstats.backlog;
> >       while (qdisc_qlen(sch) > sch->limit ||
> >              q->memory_used > q->memory_limit) {
> > -             struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
> > -
> > -             q->memory_used -= skb->truesize;
> > -             qdisc_qstats_backlog_dec(sch, skb);
> > -             rtnl_qdisc_drop(skb, sch);
> > +             int c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> > +             struct sk_buff *skb = NULL;
> > +
> > +             if (c_len) {
> > +                     skb = qdisc_dequeue_internal(sch, true);
> > +                     if (!skb)
> > +                             break;
> > +                     q->memory_used -= skb->truesize;
> > +                     rtnl_qdisc_drop(skb, sch);
> > +             } else if (qdisc_qlen(q->l_queue)) {
> > +                     skb = qdisc_dequeue_internal(q->l_queue, true);
> > +                     if (!skb)
> > +                             break;
> > +                     q->memory_used -= skb->truesize;
> > +                     rtnl_qdisc_drop(skb, q->l_queue);
> > +                     /* Keep the overall qdisc stats consistent */
> > +                     --sch->q.qlen;
> > +                     qdisc_qstats_backlog_dec(sch, skb);
> 
> Sashiko says:
> ---
> The drop counter is incremented for the L-queue via rtnl_qdisc_drop(), but it appears the drop counter for the parent qdisc (sch) is not updated.
> Will this cause user-facing statistics for the overall dualpi2 qdisc to underreport drops?
> ---

Hi Paolo,

You are right, this is my miss.
I will add "qdisc_qstats_drop(sch)" for the L-queue dropping case.

Thanks!
Chia-Yu


^ permalink raw reply

* Re: [PATCH net-next 5/6] net: stmmac: move PHY handling out of __stmmac_open()/release()
From: Russell King (Oracle) @ 2026-04-16 13:47 UTC (permalink / raw)
  To: Alexander Stein
  Cc: Andrew Lunn, Heiner Kallweit, Alexandre Torgue, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, linux-arm-kernel,
	linux-stm32, Maxime Coquelin, netdev, Paolo Abeni
In-Reply-To: <aeDSTIS9-TDSihbX@shell.armlinux.org.uk>

On Thu, Apr 16, 2026 at 01:13:00PM +0100, Russell King (Oracle) wrote:
> On Thu, Apr 16, 2026 at 02:02:53PM +0200, Alexander Stein wrote:
> > Hi Russel,
> > 
> > Am Donnerstag, 16. April 2026, 12:49:25 CEST schrieb Russell King (Oracle):
> > > On Thu, Apr 16, 2026 at 08:20:13AM +0200, Alexander Stein wrote:
> > > > Am Mittwoch, 15. April 2026, 14:59:32 CEST schrieb Russell King (Oracle):
> > > > > On Wed, Apr 15, 2026 at 08:08:40AM +0200, Alexander Stein wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > Am Dienstag, 23. September 2025, 13:26:19 CEST schrieb Russell King (Oracle):
> > > > > > > Move the PHY attachment/detachment from the network driver out of
> > > > > > > __stmmac_open() and __stmmac_release() into stmmac_open() and
> > > > > > > stmmac_release() where these actions will only happen when the
> > > > > > > interface is administratively brought up or down. It does not make
> > > > > > > sense to detach and re-attach the PHY during a change of MTU.
> > > > > > 
> > > > > > Sorry for coming up now. But I recently noticed this commit breaks changing
> > > > > > the MTU on i.MX8MP. Once I simply change the MTU I run into some DMA error:
> > > > > > $ ip link set dev end1 mtu 1400
> > > > > > imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0
> > > > > > imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-1
> > > > > > imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-2
> > > > > > imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-3
> > > > > > imx-dwmac 30bf0000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-4
> > > > > > imx-dwmac 30bf0000.ethernet end1: Link is Down
> > > > > > imx-dwmac 30bf0000.ethernet end1: Failed to reset the dma
> > > > > > imx-dwmac 30bf0000.ethernet end1: stmmac_hw_setup: DMA engine initialization failed
> > > > > 
> > > > > This basically means that a clock is missing. Please provide more
> > > > > information:
> > > > > 
> > > > > - what kernel version are you using?
> > > > 
> > > > Currently I am using v6.18.22.
> > > > $ ethtool -i end1
> > > > driver: st_gmac
> > > > version: 6.18.22
> > > > firmware-version: 
> > > > expansion-rom-version: 
> > > > bus-info: 30bf0000.ethernet
> > > > supports-statistics: yes
> > > > supports-test: no
> > > > supports-eeprom-access: no
> > > > supports-register-dump: yes
> > > > supports-priv-flags: no
> > > > 
> > > > > - has EEE been negotiated?
> > > > 
> > > > No. It is marked as not supported
> > > > 
> > > > $ ethtool --show-eee end1
> > > > EEE settings for end1:
> > > >         EEE status: not supported
> > > > 
> > > > > - does the problem persist when EEE is disabled?
> > > > 
> > > > As EEE is not supported the problem occurs even with EEE disabled.
> > > > 
> > > > > - which PHY is attached to stmmac?
> > > > 
> > > > It is a TI DP83867.
> > > > 
> > > > imx-dwmac 30bf0000.ethernet eth1: PHY [stmmac-1:03] driver [TI DP83867] (irq=136)
> > > > 
> > > > > - which PHY interface mode is being used to connect the PHY to stmmac?
> > > > 
> > > > For this interface
> > > > > phy-mode = "rgmii-id";
> > > > is set.
> > > > 
> > > > In case it is helpful. My platform is arch/arm64/boot/dts/freescale/imx8mp-tqma8mpql-mba8mpxl.dts
> > > > Thanks for assisting. If there a further questions, don't hesitate to ask.
> > > 
> > > Thanks.
> > > 
> > > So, as best I can determine at the moment, we end up with the following
> > > sequence:
> > > 
> > > stmmac_change_mtu()
> > >  __stmmac_release()
> > >   phylink_stop()
> > >    phy_stop()
> > >     phy->state = PHY_HALTED
> > >     _phy_state_machine() returns PHY_STATE_WORK_SUSPEND
> > >     _phy_state_machine_post_work()
> > >      phy_suspend()
> > >       genphy_suspend()
> > >        phy_set_bits(phydev, MII_BMCR, BMCR_PDOWN)
> > > 
> > > With the DP83867, this causes most of the PHY to be powered down, thus
> > > stopping the clocks, and this causes the stmmac reset to time out.
> > > 
> > > Prior to this commit, we would have called phylink_disconnect_phy()
> > > immediately after phylink_stop(), but I can see nothing that would
> > > be affected by this change there (since that also calls
> > > phy_suspend(), but as the PHY is already suspended, this becomes a
> > > no-op.)
> > > 
> > > However, __stmmac_open() would have called stmmac_init_phy(), which
> > > would reattach the PHY. This would have called phy_init_hw(), 
> > > resetting the PHY, and phy_resume() which would ensure that the
> > > PDOWN bit is clear - thus clocks would be running.
> > > 
> > > As a hack, please can you try calling phylink_prepare_resume()
> > > between the __stmmac_release() and __stmmac_open() in
> > > stmmac_change_mtu(). This should resume the PHY, thus restoring the
> > > clocks necessary for stmmac to reset.
> > 
> > I tried the following patch. This works as you suspected.
> 
> Brilliant, thanks for proving the theory why it broke.
> 
> I'll have a think about the best way to solve this, because
> phylink_prepare_resume() is supposed to be paired with phylink_resume()
> and that isn't the case here.
> 
> Please bear with me as my availability for looking at the kernel is
> very unpredictable at present (family health issues.)

I have some patches which passed build testing, but more chaos means
I can't post them nor test them. I'll do something when I'm next able
to, whenever that will be.

The next problem will be netdev's policy over reviews vs patches
balance which I'm already in deficit, and I have *NO* *TIME*
what so ever to review patches - let alone propose patches to
fix people's problems.

So I'm going to say this plainly: if netdev wants to enforce that
rule, then I won't be fixing people's problems.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH net] selftests: net: add missing CMAC to tcp_ao config
From: Matthieu Baerts @ 2026-04-16 13:47 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, shuah, fw,
	antonio, phil, linux-kselftest
In-Reply-To: <20260416010439.1053587-1-kuba@kernel.org>

Hi Jakub,

On 16/04/2026 03:04, Jakub Kicinski wrote:
> Recent changes to crypto and wifi made CMAC no longer
> selected by default on x86 and tcp_ao needs it.
> Add the missing config.

Thank you for the fix, I see that CMAC is the default algo used in the
TCP AO selftests, and this modification fixes the recent issue.

Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply

* Re: [PATCH net] sctp: fix OOB write to userspace in sctp_getsockopt_peer_auth_chunks
From: Xin Long @ 2026-04-16 13:46 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: linux-sctp, Marcelo Ricardo Leitner, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev,
	linux-kernel, stable
In-Reply-To: <20260416031903.1447072-1-michael.bommarito@gmail.com>

On Wed, Apr 15, 2026 at 11:19 PM Michael Bommarito
<michael.bommarito@gmail.com> wrote:
>
> sctp_getsockopt_peer_auth_chunks() checks that the caller's optval
> buffer is large enough for the peer AUTH chunk list with
>
>     if (len < num_chunks)
>             return -EINVAL;
>
> but then writes num_chunks bytes to p->gauth_chunks, which lives
> at offset offsetof(struct sctp_authchunks, gauth_chunks) == 8
> inside optval.  The check is missing the sizeof(struct
> sctp_authchunks) = 8-byte header.  When the caller supplies
> len == num_chunks (for any num_chunks > 0) the test passes but
> copy_to_user() writes sizeof(struct sctp_authchunks) = 8 bytes
> past the declared buffer.
>
> The sibling function sctp_getsockopt_local_auth_chunks() at the
> next line already has the correct check:
>
>     if (len < sizeof(struct sctp_authchunks) + num_chunks)
>             return -EINVAL;
>
> Align the peer variant with its sibling.
>
> Reproducer confirms on v7.0-13-generic: an unprivileged userspace
> caller that opens a loopback SCTP association with AUTH enabled,
> queries num_chunks with a short optval, then issues the real
> getsockopt with len == num_chunks and sentinel bytes painted past
> the buffer observes those sentinel bytes overwritten with the
> peer's AUTH chunk type.  The bytes written are under the peer's
> control but land in the caller's own userspace; this is not a
> kernel memory corruption, but it is a kernel-side contract
> violation that can silently corrupt adjacent userspace data.
>
> Fixes: 65b07e5d0d09 ("[SCTP]: API updates to suport SCTP-AUTH extensions.")
> Cc: stable@vger.kernel.org
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
>  net/sctp/socket.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 05fb00c9c335..f5d442753dc9 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -7033,7 +7033,7 @@ static int sctp_getsockopt_peer_auth_chunks(struct sock *sk, int len,
>
>         /* See if the user provided enough room for all the data */
>         num_chunks = ntohs(ch->param_hdr.length) - sizeof(struct sctp_paramhdr);
> -       if (len < num_chunks)
> +       if (len < sizeof(struct sctp_authchunks) + num_chunks)
>                 return -EINVAL;
>
>         if (copy_to_user(to, ch->chunks, num_chunks))
> --
> 2.53.0
>

Acked-by: Xin Long <lucien.xin@gmail.com>

^ permalink raw reply

* Re: [patch 35/38] s390: Select ARCH_HAS_RANDOM_ENTROPY
From: Heiko Carstens @ 2026-04-16 13:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-s390, Arnd Bergmann, x86, Lu Baolu, iommu,
	Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
	linux-crypto, Vlastimil Babka, linux-mm, David Woodhouse,
	Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
	Andrew Morton, Uladzislau Rezki, Marco Elver, Dmitry Vyukov,
	kasan-dev, Andrey Ryabinin, Thomas Sailer, linux-hams,
	Jason A. Donenfeld, Richard Henderson, linux-alpha, Russell King,
	linux-arm-kernel, Catalin Marinas, Huacai Chen, loongarch,
	Geert Uytterhoeven, linux-m68k, Dinh Nguyen, Jonas Bonn,
	linux-openrisc, Helge Deller, linux-parisc, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, linux-riscv, David S. Miller,
	sparclinux
In-Reply-To: <20260410120319.924028412@kernel.org>

On Fri, Apr 10, 2026 at 02:21:19PM +0200, Thomas Gleixner wrote:
> The only remaining non-architecture usage of get_cycles() is to provide
> random_get_entropy().
> 
> Switch s390 over to the new scheme of selecting ARCH_HAS_RANDOM_ENTROPY and
> providing random_get_entropy() in asm/random.h.
> 
> Add 'asm/timex.h' includes to the relevant files, so the global include can
> be removed once all architectures are converted over.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: linux-s390@vger.kernel.org
> ---
>  arch/s390/Kconfig              |    1 +
>  arch/s390/include/asm/random.h |   12 ++++++++++++
>  arch/s390/include/asm/timex.h  |    6 ------
>  arch/s390/kernel/time.c        |    1 +
>  arch/s390/kernel/vtime.c       |    1 +
>  5 files changed, 15 insertions(+), 6 deletions(-)

Acked-by: Heiko Carstens <hca@linux.ibm.com>

Thomas, would you mind adding the below as minor improvement to this
series?

From 7072e5d66b99a7fa666d17c6ad8cb254f2d8f473 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <hca@linux.ibm.com>
Date: Thu, 16 Apr 2026 15:08:15 +0200
Subject: [PATCH] s390: Use get_tod_clock_fast() for random_get_entropy()

Use get_tod_clock_fast() instead of get_tod_clock_monotonic() to implement
random_get_entropy().

There is no need for random_get_entropy() to provide monotonic increasing
values, nor is there any need to provide (close to) nanosecond granularity
timestamps by shifting the result.

This slightly reduces the execution time of random_get_entropy() and adds
two bits of randomness.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/random.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/random.h b/arch/s390/include/asm/random.h
index 7daf42dbed32..f6d9312efdbf 100644
--- a/arch/s390/include/asm/random.h
+++ b/arch/s390/include/asm/random.h
@@ -6,7 +6,7 @@
 
 static inline unsigned long random_get_entropy(void)
 {
-	return (unsigned long)get_tod_clock_monotonic() >> 2;
+	return get_tod_clock_fast();
 }
 
 #endif
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH net 00/14] Netfilter/IPVS fixes for net
From: Fernando Fernandez Mancera @ 2026-04-16 13:37 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, netfilter-devel, davem, netdev, kuba, pabeni,
	edumazet, horms
In-Reply-To: <aeDgvwlyuGF4HnWK@strlen.de>

On 4/16/26 3:14 PM, Florian Westphal wrote:
> Fernando Fernandez Mancera <fmancera@suse.de> wrote:
>> I would like to propose to add netfilter-devel mailing list to
>> sashiko.dev and also to Netdev CI.. I think Jakub mentioned it was
>> possible on a previous situation.
> 
> I already run all my pull requests through most of NIPAs test, with
> additional netfilter-specific tests.
> 
>> I think it isn't sustainable to review and address the AI/LLM comments
>> when sending the pull request to for net/net-next.
> 
> The current bug report influx is already unsustainable for us.
> 

Yes, it isn't. It should be fine to just delay everything. The resources 
are limited and if the influx of LLM/AI generated reports increases, it 
will just take more time to get through them.

>> If you agree I could help moving this forward.
> 
> If you know who to contact to make sashiko also digest netfilter-devel
> that would be good to have.
> 

Yes, I can reach out to Roman Gushchin regarding it.

Thanks,
Fernando.


^ permalink raw reply

* Re: [PATCH net 1/1] 8021q: free cleared egress QoS mappings safely
From: Simon Horman @ 2026-04-16 13:34 UTC (permalink / raw)
  To: Yuan Tan
  Cc: Ren Wei, netdev, andrew+netdev, davem, edumazet, kuba, pabeni,
	kees, yifanwucs, tomapufckgml, bird, ylong030
In-Reply-To: <d22967f1-789d-4f53-baa8-492f64fc725a@gmail.com>

On Wed, Apr 15, 2026 at 10:35:19PM -0700, Yuan Tan wrote:
> 
> On 4/15/26 08:15, Simon Horman wrote:
> > On Mon, Apr 13, 2026 at 05:07:20PM +0800, Ren Wei wrote:
> >> From: Longxuan Yu <ylong030@ucr.edu>
> >>
> >> vlan_dev_set_egress_priority() leaves cleared egress priority mapping
> >> nodes in the hash until device teardown. Repeated set/clear cycles with
> >> distinct skb priorities therefore allocate an unbounded number of
> >> vlan_priority_tci_mapping objects and leak memory.
> >>
> >> Delete mappings when vlan_prio is cleared instead of keeping
> >> tombstones. The TX fast path and reporting paths walk the lists without
> >> RTNL, so convert the egress mapping lists to RCU-protected pointers and
> >> defer freeing removed nodes until after a grace period.
> >>
> >> Cc: stable@kernel.org
> >> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> >> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> >> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> >> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
> >> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
> >> Suggested-by: Xin Liu <bird@lzu.edu.cn>
> >> Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
> >> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> >> ---
> >>  include/linux/if_vlan.h  | 23 +++++++++++--------
> >>  net/8021q/vlan_dev.c     | 48 +++++++++++++++++++++++-----------------
> >>  net/8021q/vlan_netlink.c |  9 +++-----
> >>  net/8021q/vlanproc.c     | 12 ++++++----
> >>  4 files changed, 53 insertions(+), 39 deletions(-)
> > There is a lot of change here. And I'd suggest splitting the patch up into
> > (at least) two patches:
> >
> > 1. Convert mappings to use RCU
> > 2. Fix bug
> >
> > As is, the bug fix itself is difficult to isolate amongst the other changes.
> >
> > Also, AI generated review suggests that this bug was introduced by commit
> > b020cb488586 ("[VLAN]: Keep track of number of QoS mappings"). If so,
> > it would be appropriate to use that commit in the Fixes tag.
> >
> Thank you very much for your review and suggestions. We will try to
> revise it in this direction.
> May I ask whether we should include your “Suggested-by” tag in the patch?

I don't think you should include a Suggested-by tag.

My reasoning is that I'm only providing feedback and possible enhancements
to your approach.  While I think Suggested-by would be appropriate if I'd
suggested creating this patch in the first place.

^ permalink raw reply

* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Simon Horman @ 2026-04-16 13:29 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <ad_DrrC6duye_lR0@lore-desk>

On Wed, Apr 15, 2026 at 06:58:22PM +0200, Lorenzo Bianconi wrote:
> On Apr 15, Simon Horman wrote:
> > On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> > > Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> > > airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> > > TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> > > 
> > > Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > ---
> > > Changes in v2:
> > > - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> > >   order to avoid any possible NULL pointer dereference in
> > >   airoha_qdma_cleanup_tx_queue()
> > 
> > This seems to be a separate issue.
> > If so, I think it should be split out into a separate patch.
> > 
> > > - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> > > - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
> > 
> > I think it was covered in the review Jakub forwarded for v1.  But FTR,
> > Sashiko has some feedback on this patch in the form of an existing bug
> > (that should almost certainly be handled separately from this patch).
> 
> Hi Simon,
> 
> I took a look to the Sashiko's report [0] but this issue is not introduced by
> this patch and, even if it would be a better approach, I guess the hw is
> capable of managing out-of-order TX descriptors. So I guess this patch is fine
> in this way, agree?
> 
> [0] https://sashiko.dev/#/patchset/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022%40kernel.org

Hi Lorenzo,

You responded in a different sub thread, so I think this is probably
implied. But FTR:

1. I agree [0] is not introduced by this patch
2. If the hw is capable of managing TX descriptors then
   I think that [0] is a false positive

Regardless, [0] doesn't effect this patch.

^ permalink raw reply

* Re: [PATCH v1 net 1/1] net/sched: sch_dualpi2: fix limit/memlimit enforcement when dequeueing L-queue
From: Paolo Abeni @ 2026-04-16 13:25 UTC (permalink / raw)
  To: chia-yu.chang, linux-hardening, kees, gustavoars, jhs, jiri,
	davem, edumazet, kuba, linux-kernel, netdev, horms, ij, ncardwell,
	koen.de_schepper, g.white, ingemar.s.johansson, mirja.kuehlewind,
	cheshire, rs.ietf, Jason_Livingood, vidhi_goel
In-Reply-To: <20260413163711.56191-1-chia-yu.chang@nokia-bell-labs.com>

On 4/13/26 6:37 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> 
> Fix dualpi2_change() to correctly enforce updated limit and memlimit values
> after a configuration change of the dualpi2 qdisc.
> 
> Before this patch, dualpi2_change() always attempted to dequeue packets via
> the root qdisc (C-queue) when reducing backlog or memory usage, and
> unconditionally assumed that a valid skb will be returned. When traffic
> classification results in packets being queued in the L-queue while the
> C-queue is empty, this leads to a NULL skb dereference during limit or
> memlimit enforcement.
> 
> This is fixed by first dequeuing from the C-queue path if it is non-empty.
> Once the C-queue is empty, packets are dequeued directly from the L-queue.
> Return values from qdisc_dequeue_internal() are checked for both queues. When
> dequeuing from the L-queue, the parent qdisc qlen and backlog counters are
> updated explicitly to keep overall qdisc statistics consistent.
> 
> Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc")
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  net/sched/sch_dualpi2.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> index 6d7e6389758d..56d4422970b6 100644
> --- a/net/sched/sch_dualpi2.c
> +++ b/net/sched/sch_dualpi2.c
> @@ -872,11 +872,25 @@ static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
>  	old_backlog = sch->qstats.backlog;
>  	while (qdisc_qlen(sch) > sch->limit ||
>  	       q->memory_used > q->memory_limit) {
> -		struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
> -
> -		q->memory_used -= skb->truesize;
> -		qdisc_qstats_backlog_dec(sch, skb);
> -		rtnl_qdisc_drop(skb, sch);
> +		int c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> +		struct sk_buff *skb = NULL;
> +
> +		if (c_len) {
> +			skb = qdisc_dequeue_internal(sch, true);
> +			if (!skb)
> +				break;
> +			q->memory_used -= skb->truesize;
> +			rtnl_qdisc_drop(skb, sch);
> +		} else if (qdisc_qlen(q->l_queue)) {
> +			skb = qdisc_dequeue_internal(q->l_queue, true);
> +			if (!skb)
> +				break;
> +			q->memory_used -= skb->truesize;
> +			rtnl_qdisc_drop(skb, q->l_queue);
> +			/* Keep the overall qdisc stats consistent */
> +			--sch->q.qlen;
> +			qdisc_qstats_backlog_dec(sch, skb);

Sashiko says:
---
The drop counter is incremented for the L-queue via rtnl_qdisc_drop(),
but it appears the drop counter for the parent qdisc (sch) is not updated.
Will this cause user-facing statistics for the overall dualpi2 qdisc to
underreport drops?
---


^ permalink raw reply

* Re: [PATCH v2] vsock/virtio: fix accept queue count leak on transport mismatch
From: patchwork-bot+netdevbpf @ 2026-04-16 13:20 UTC (permalink / raw)
  To: Dudu Lu; +Cc: netdev, stefanha, sgarzare, mst, jasowang
In-Reply-To: <20260413131409.19022-1-phx0fer@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 13 Apr 2026 21:14:09 +0800 you wrote:
> virtio_transport_recv_listen() calls sk_acceptq_added() before
> vsock_assign_transport(). If vsock_assign_transport() fails or
> selects a different transport, the error path returns without
> calling sk_acceptq_removed(), permanently incrementing
> sk_ack_backlog.
> 
> After approximately backlog+1 such failures, sk_acceptq_is_full()
> returns true, causing the listener to reject all new connections.
> 
> [...]

Here is the summary with links:
  - [v2] vsock/virtio: fix accept queue count leak on transport mismatch
    https://git.kernel.org/netdev/net/c/52bcb57a4e8a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] net: mdio: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
From: Simon Horman @ 2026-04-16 13:20 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Charles Perry, Conor Dooley, Jakub Kicinski, Maxime Chevallier,
	Andrew Lunn, Heiner Kallweit, Russell King, David S . Miller,
	Eric Dumazet, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <980c57efa5843733ef95459c3283aebade56f142.1776162544.git.geert+renesas@glider.be>

On Tue, Apr 14, 2026 at 12:30:47PM +0200, Geert Uytterhoeven wrote:
> The PIC64-HPSC/HX MDIO interface is only present on Microchip
> PIC64-HPSC/HX SoCs.  Hence add a dependency on ARCH_MICROCHIP, to
> prevent asking the user about this driver when configuring a kernel
> without Microchip SoC support.
> 
> Fixes: f76aef980206e7c6 ("net: mdio: add a driver for PIC64-HPSC/HX MDIO controller")
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* [PATCH net 11/11] netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

Publish new hooks in the list into the basechain/flowtable using
splice_list_rcu() to ensure netlink dump list traversal via rcu is safe
while concurrent ruleset update is going on.

Fixes: 78d9f48f7f44 ("netfilter: nf_tables: add devices to existing flowtable")
Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 090d4d688a33..8c0706d6d887 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -10904,8 +10904,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 				nft_chain_commit_update(nft_trans_container_chain(trans));
 				nf_tables_chain_notify(&ctx, NFT_MSG_NEWCHAIN,
 						       &nft_trans_chain_hooks(trans));
-				list_splice(&nft_trans_chain_hooks(trans),
-					    &nft_trans_basechain(trans)->hook_list);
+				list_splice_rcu(&nft_trans_chain_hooks(trans),
+						&nft_trans_basechain(trans)->hook_list);
 				/* trans destroyed after rcu grace period */
 			} else {
 				nft_chain_commit_drop_policy(nft_trans_container_chain(trans));
@@ -11034,8 +11034,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 							   nft_trans_flowtable(trans),
 							   &nft_trans_flowtable_hooks(trans),
 							   NFT_MSG_NEWFLOWTABLE);
-				list_splice(&nft_trans_flowtable_hooks(trans),
-					    &nft_trans_flowtable(trans)->hook_list);
+				list_splice_rcu(&nft_trans_flowtable_hooks(trans),
+						&nft_trans_flowtable(trans)->hook_list);
 			} else {
 				nft_clear(net, nft_trans_flowtable(trans));
 				nf_tables_flowtable_notify(&ctx,
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 10/11] rculist: add list_splice_rcu() for private lists
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

This patch adds a helper function, list_splice_rcu(), to safely splice
a private (non-RCU-protected) list into an RCU-protected list.

The function ensures that only the pointer visible to RCU readers
(prev->next) is updated using rcu_assign_pointer(), while the rest of
the list manipulations are performed with regular assignments, as the
source list is private and not visible to concurrent RCU readers.

This is useful for moving elements from a private list into a global
RCU-protected list, ensuring safe publication for RCU readers.
Subsystems with some sort of batching mechanism from userspace can
benefit from this new function.

The function __list_splice_rcu() has been added for clarity and to
follow the same pattern as in the existing list_splice*() interfaces,
where there is a check to ensure that the list to splice is not
empty. Note that __list_splice_rcu() has no documentation for this
reason.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/rculist.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 2abba7552605..e3bc44225692 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -261,6 +261,35 @@ static inline void list_replace_rcu(struct list_head *old,
 	old->prev = LIST_POISON2;
 }

+static inline void __list_splice_rcu(struct list_head *list,
+				     struct list_head *prev,
+				     struct list_head *next)
+{
+	struct list_head *first = list->next;
+	struct list_head *last = list->prev;
+
+	last->next = next;
+	first->prev = prev;
+	next->prev = last;
+	rcu_assign_pointer(list_next_rcu(prev), first);
+}
+
+/**
+ * list_splice_rcu - splice a non-RCU list into an RCU-protected list,
+ *                   designed for stacks.
+ * @list:	the non RCU-protected list to splice
+ * @head:	the place in the existing RCU-protected list to splice
+ *
+ * The list pointed to by @head can be RCU-read traversed concurrently with
+ * this function.
+ */
+static inline void list_splice_rcu(struct list_head *list,
+				   struct list_head *head)
+{
+	if (!list_empty(list))
+		__list_splice_rcu(list, head, head->next);
+}
+
 /**
  * __list_splice_init_rcu - join an RCU-protected list into an existing list.
  * @list:	the RCU-protected list to splice
-- 
2.47.3

^ permalink raw reply related

* [PATCH net 09/11] netfilter: nf_tables: use list_del_rcu for netlink hooks
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

nft_netdev_unregister_hooks and __nft_unregister_flowtable_net_hooks need
to use list_del_rcu(), this list can be walked by concurrent dumpers.

Add a new helper and use it consistently.

Fixes: f9a43007d3f7 ("netfilter: nf_tables: double hook unregistration in netns path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c | 44 ++++++++++++++---------------------
 1 file changed, 18 insertions(+), 26 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 8c42247a176c..090d4d688a33 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -374,6 +374,12 @@ static void nft_netdev_hook_free_rcu(struct nft_hook *hook)
 	call_rcu(&hook->rcu, __nft_netdev_hook_free_rcu);
 }
 
+static void nft_netdev_hook_unlink_free_rcu(struct nft_hook *hook)
+{
+	list_del_rcu(&hook->list);
+	nft_netdev_hook_free_rcu(hook);
+}
+
 static void nft_netdev_unregister_hooks(struct net *net,
 					struct list_head *hook_list,
 					bool release_netdev)
@@ -384,10 +390,8 @@ static void nft_netdev_unregister_hooks(struct net *net,
 	list_for_each_entry_safe(hook, next, hook_list, list) {
 		list_for_each_entry(ops, &hook->ops_list, list)
 			nf_unregister_net_hook(net, ops);
-		if (release_netdev) {
-			list_del(&hook->list);
-			nft_netdev_hook_free_rcu(hook);
-		}
+		if (release_netdev)
+			nft_netdev_hook_unlink_free_rcu(hook);
 	}
 }
 
@@ -2323,10 +2327,8 @@ void nf_tables_chain_destroy(struct nft_chain *chain)
 
 		if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) {
 			list_for_each_entry_safe(hook, next,
-						 &basechain->hook_list, list) {
-				list_del_rcu(&hook->list);
-				nft_netdev_hook_free_rcu(hook);
-			}
+						 &basechain->hook_list, list)
+				nft_netdev_hook_unlink_free_rcu(hook);
 		}
 		module_put(basechain->type->owner);
 		if (rcu_access_pointer(basechain->stats)) {
@@ -3026,6 +3028,7 @@ static int nf_tables_updchain(struct nft_ctx *ctx, u8 genmask, u8 policy,
 				list_for_each_entry(ops, &h->ops_list, list)
 					nf_unregister_net_hook(ctx->net, ops);
 			}
+			/* hook.list is on stack, no need for list_del_rcu() */
 			list_del(&h->list);
 			nft_netdev_hook_free_rcu(h);
 		}
@@ -8903,10 +8906,8 @@ static void __nft_unregister_flowtable_net_hooks(struct net *net,
 	list_for_each_entry_safe(hook, next, hook_list, list) {
 		list_for_each_entry(ops, &hook->ops_list, list)
 			nft_unregister_flowtable_ops(net, flowtable, ops);
-		if (release_netdev) {
-			list_del(&hook->list);
-			nft_netdev_hook_free_rcu(hook);
-		}
+		if (release_netdev)
+			nft_netdev_hook_unlink_free_rcu(hook);
 	}
 }
 
@@ -8977,8 +8978,7 @@ static int nft_register_flowtable_net_hooks(struct net *net,
 
 			nft_unregister_flowtable_ops(net, flowtable, ops);
 		}
-		list_del_rcu(&hook->list);
-		nft_netdev_hook_free_rcu(hook);
+		nft_netdev_hook_unlink_free_rcu(hook);
 	}
 
 	return err;
@@ -8988,10 +8988,8 @@ static void nft_hooks_destroy(struct list_head *hook_list)
 {
 	struct nft_hook *hook, *next;
 
-	list_for_each_entry_safe(hook, next, hook_list, list) {
-		list_del_rcu(&hook->list);
-		nft_netdev_hook_free_rcu(hook);
-	}
+	list_for_each_entry_safe(hook, next, hook_list, list)
+		nft_netdev_hook_unlink_free_rcu(hook);
 }
 
 static int nft_flowtable_update(struct nft_ctx *ctx, const struct nlmsghdr *nlh,
@@ -9079,8 +9077,7 @@ static int nft_flowtable_update(struct nft_ctx *ctx, const struct nlmsghdr *nlh,
 				nft_unregister_flowtable_ops(ctx->net,
 							     flowtable, ops);
 		}
-		list_del_rcu(&hook->list);
-		nft_netdev_hook_free_rcu(hook);
+		nft_netdev_hook_unlink_free_rcu(hook);
 	}
 
 	return err;
@@ -9586,13 +9583,8 @@ static void nf_tables_flowtable_notify(struct nft_ctx *ctx,
 
 static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable)
 {
-	struct nft_hook *hook, *next;
-
 	flowtable->data.type->free(&flowtable->data);
-	list_for_each_entry_safe(hook, next, &flowtable->hook_list, list) {
-		list_del_rcu(&hook->list);
-		nft_netdev_hook_free_rcu(hook);
-	}
+	nft_hooks_destroy(&flowtable->hook_list);
 	kfree(flowtable->name);
 	module_put(flowtable->data.type->owner);
 	kfree(flowtable);
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 08/11] ipvs: fix MTU check for GSO packets in tunnel mode
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

From: Yingnan Zhang <342144303@qq.com>

Currently, IPVS skips MTU checks for GSO packets by excluding them with
the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel
mode encapsulates GSO packets with IPIP headers.

The issue manifests in two ways:

1. MTU violation after encapsulation:
   When a GSO packet passes through IPVS tunnel mode, the original MTU
   check is bypassed. After adding the IPIP tunnel header, the packet
   size may exceed the outgoing interface MTU, leading to unexpected
   fragmentation at the IP layer.

2. Fragmentation with problematic IP IDs:
   When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments
   is fragmented after encapsulation, each segment gets a sequentially
   incremented IP ID (0, 1, 2, ...). This happens because:

   a) The GSO packet bypasses MTU check and gets encapsulated
   b) At __ip_finish_output, the oversized GSO packet is split into
      separate SKBs (one per segment), with IP IDs incrementing
   c) Each SKB is then fragmented again based on the actual MTU

   This sequential IP ID allocation differs from the expected behavior
   and can cause issues with fragment reassembly and packet tracking.

Fix this by properly validating GSO packets using
skb_gso_validate_network_len(). This function correctly validates
whether the GSO segments will fit within the MTU after segmentation. If
validation fails, send an ICMP Fragmentation Needed message to enable
proper PMTU discovery.

Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling")
Signed-off-by: Yingnan Zhang <342144303@qq.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 3601eb86d025..7c570f48ade2 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -102,6 +102,18 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
 	return dest_dst;
 }

+/* Based on ip_exceeds_mtu(). */
+static bool ip_vs_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
+{
+	if (skb->len <= mtu)
+		return false;
+
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+
+	return true;
+}
+
 static inline bool
 __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
 {
@@ -111,10 +123,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
 		 */
 		if (IP6CB(skb)->frag_max_size > mtu)
 			return true; /* largest fragment violate MTU */
-	}
-	else if (skb->len > mtu && !skb_is_gso(skb)) {
+	} else if (ip_vs_exceeds_mtu(skb, mtu))
 		return true; /* Packet size violate MTU size */
-	}
+
 	return false;
 }

@@ -232,7 +243,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 			return true;

 		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
-			     skb->len > mtu && !skb_is_gso(skb) &&
+			     ip_vs_exceeds_mtu(skb, mtu) &&
 			     !ip_vs_iph_icmp(ipvsh))) {
 			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 				  htonl(mtu));
-- 
2.47.3

^ permalink raw reply related

* [PATCH net 07/11] netfilter: nat: use kfree_rcu to release ops
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

Florian Westphal says:

"Historically this is not an issue, even for normal base hooks: the data
path doesn't use the original nf_hook_ops that are used to register the
callbacks.

However, in v5.14 I added the ability to dump the active netfilter
hooks from userspace.

This code will peek back into the nf_hook_ops that are available
at the tail of the pointer-array blob used by the datapath.

The nat hooks are special, because they are called indirectly from
the central nat dispatcher hook. They are currently invisible to
the nfnl hook dump subsystem though.

But once that changes the nat ops structures have to be deferred too."

Update nf_nat_register_fn() to deal with partial exposition of the hooks
from error path which can be also an issue for nfnetlink_hook.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/iptable_nat.c  |  2 +-
 net/ipv6/netfilter/ip6table_nat.c |  2 +-
 net/netfilter/nf_nat_core.c       | 10 ++++++----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/iptable_nat.c b/net/ipv4/netfilter/iptable_nat.c
index a5db7c67d61b..3b1de7f82bf8 100644
--- a/net/ipv4/netfilter/iptable_nat.c
+++ b/net/ipv4/netfilter/iptable_nat.c
@@ -100,7 +100,7 @@ static void ipt_nat_unregister_lookups(struct net *net)
 	for (i = 0; i < ARRAY_SIZE(nf_nat_ipv4_ops); i++)
 		nf_nat_ipv4_unregister_fn(net, &ops[i]);
 
-	kfree(ops);
+	kfree_rcu(ops, rcu);
 }
 
 static int iptable_nat_table_init(struct net *net)
diff --git a/net/ipv6/netfilter/ip6table_nat.c b/net/ipv6/netfilter/ip6table_nat.c
index e119d4f090cc..9adfbfeaab0c 100644
--- a/net/ipv6/netfilter/ip6table_nat.c
+++ b/net/ipv6/netfilter/ip6table_nat.c
@@ -102,7 +102,7 @@ static void ip6t_nat_unregister_lookups(struct net *net)
 	for (i = 0; i < ARRAY_SIZE(nf_nat_ipv6_ops); i++)
 		nf_nat_ipv6_unregister_fn(net, &ops[i]);
 
-	kfree(ops);
+	kfree_rcu(ops, rcu);
 }
 
 static int ip6table_nat_table_init(struct net *net)
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 3b5434e4ec9c..b30ca94c2bb7 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -1228,9 +1228,11 @@ int nf_nat_register_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops,
 		ret = nf_register_net_hooks(net, nat_ops, ops_count);
 		if (ret < 0) {
 			mutex_unlock(&nf_nat_proto_mutex);
-			for (i = 0; i < ops_count; i++)
-				kfree(nat_ops[i].priv);
-			kfree(nat_ops);
+			for (i = 0; i < ops_count; i++) {
+				priv = nat_ops[i].priv;
+				kfree_rcu(priv, rcu_head);
+			}
+			kfree_rcu(nat_ops, rcu);
 			return ret;
 		}
 
@@ -1294,7 +1296,7 @@ void nf_nat_unregister_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops,
 		}
 
 		nat_proto_net->nat_hook_ops = NULL;
-		kfree(nat_ops);
+		kfree_rcu(nat_ops, rcu);
 	}
 unlock:
 	mutex_unlock(&nf_nat_proto_mutex);
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 06/11] netfilter: xtables: restrict several matches to inet family
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

This is a partial revert of:

  commit ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")

to allow ipv4 and ipv6 only.

- xt_mac
- xt_owner
- xt_physdev

These extensions are not used by ebtables in userspace.

Moreover, xt_realm is only for ipv4, since dst->tclassid is ipv4
specific.

Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_mac.c     | 34 +++++++++++++++++++++++-----------
 net/netfilter/xt_owner.c   | 37 +++++++++++++++++++++++++------------
 net/netfilter/xt_physdev.c | 29 +++++++++++++++++++----------
 net/netfilter/xt_realm.c   |  2 +-
 4 files changed, 68 insertions(+), 34 deletions(-)

diff --git a/net/netfilter/xt_mac.c b/net/netfilter/xt_mac.c
index 81649da57ba5..bd2354760895 100644
--- a/net/netfilter/xt_mac.c
+++ b/net/netfilter/xt_mac.c
@@ -38,25 +38,37 @@ static bool mac_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	return ret;
 }
 
-static struct xt_match mac_mt_reg __read_mostly = {
-	.name      = "mac",
-	.revision  = 0,
-	.family    = NFPROTO_UNSPEC,
-	.match     = mac_mt,
-	.matchsize = sizeof(struct xt_mac_info),
-	.hooks     = (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) |
-	             (1 << NF_INET_FORWARD),
-	.me        = THIS_MODULE,
+static struct xt_match mac_mt_reg[] __read_mostly = {
+	{
+		.name		= "mac",
+		.family		= NFPROTO_IPV4,
+		.match		= mac_mt,
+		.matchsize	= sizeof(struct xt_mac_info),
+		.hooks		= (1 << NF_INET_PRE_ROUTING) |
+				  (1 << NF_INET_LOCAL_IN) |
+				  (1 << NF_INET_FORWARD),
+		.me		= THIS_MODULE,
+	},
+	{
+		.name		= "mac",
+		.family		= NFPROTO_IPV6,
+		.match		= mac_mt,
+		.matchsize	= sizeof(struct xt_mac_info),
+		.hooks		= (1 << NF_INET_PRE_ROUTING) |
+				  (1 << NF_INET_LOCAL_IN) |
+				  (1 << NF_INET_FORWARD),
+		.me		= THIS_MODULE,
+	},
 };
 
 static int __init mac_mt_init(void)
 {
-	return xt_register_match(&mac_mt_reg);
+	return xt_register_matches(mac_mt_reg, ARRAY_SIZE(mac_mt_reg));
 }
 
 static void __exit mac_mt_exit(void)
 {
-	xt_unregister_match(&mac_mt_reg);
+	xt_unregister_matches(mac_mt_reg, ARRAY_SIZE(mac_mt_reg));
 }
 
 module_init(mac_mt_init);
diff --git a/net/netfilter/xt_owner.c b/net/netfilter/xt_owner.c
index 50332888c8d2..7be2fe22b067 100644
--- a/net/netfilter/xt_owner.c
+++ b/net/netfilter/xt_owner.c
@@ -127,26 +127,39 @@ owner_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	return true;
 }
 
-static struct xt_match owner_mt_reg __read_mostly = {
-	.name       = "owner",
-	.revision   = 1,
-	.family     = NFPROTO_UNSPEC,
-	.checkentry = owner_check,
-	.match      = owner_mt,
-	.matchsize  = sizeof(struct xt_owner_match_info),
-	.hooks      = (1 << NF_INET_LOCAL_OUT) |
-	              (1 << NF_INET_POST_ROUTING),
-	.me         = THIS_MODULE,
+static struct xt_match owner_mt_reg[] __read_mostly = {
+	{
+		.name       = "owner",
+		.revision   = 1,
+		.family     = NFPROTO_IPV4,
+		.checkentry = owner_check,
+		.match      = owner_mt,
+		.matchsize  = sizeof(struct xt_owner_match_info),
+		.hooks      = (1 << NF_INET_LOCAL_OUT) |
+			      (1 << NF_INET_POST_ROUTING),
+		.me         = THIS_MODULE,
+	},
+	{
+		.name       = "owner",
+		.revision   = 1,
+		.family     = NFPROTO_IPV6,
+		.checkentry = owner_check,
+		.match      = owner_mt,
+		.matchsize  = sizeof(struct xt_owner_match_info),
+		.hooks      = (1 << NF_INET_LOCAL_OUT) |
+			      (1 << NF_INET_POST_ROUTING),
+		.me         = THIS_MODULE,
+	}
 };
 
 static int __init owner_mt_init(void)
 {
-	return xt_register_match(&owner_mt_reg);
+	return xt_register_matches(owner_mt_reg, ARRAY_SIZE(owner_mt_reg));
 }
 
 static void __exit owner_mt_exit(void)
 {
-	xt_unregister_match(&owner_mt_reg);
+	xt_unregister_matches(owner_mt_reg, ARRAY_SIZE(owner_mt_reg));
 }
 
 module_init(owner_mt_init);
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 343e65f377d4..130842c35c6f 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -115,24 +115,33 @@ static int physdev_mt_check(const struct xt_mtchk_param *par)
 	return 0;
 }
 
-static struct xt_match physdev_mt_reg __read_mostly = {
-	.name       = "physdev",
-	.revision   = 0,
-	.family     = NFPROTO_UNSPEC,
-	.checkentry = physdev_mt_check,
-	.match      = physdev_mt,
-	.matchsize  = sizeof(struct xt_physdev_info),
-	.me         = THIS_MODULE,
+static struct xt_match physdev_mt_reg[] __read_mostly = {
+	{
+		.name		= "physdev",
+		.family		= NFPROTO_IPV4,
+		.checkentry	= physdev_mt_check,
+		.match		= physdev_mt,
+		.matchsize	= sizeof(struct xt_physdev_info),
+		.me		= THIS_MODULE,
+	},
+	{
+		.name		= "physdev",
+		.family		= NFPROTO_IPV6,
+		.checkentry	= physdev_mt_check,
+		.match		= physdev_mt,
+		.matchsize	= sizeof(struct xt_physdev_info),
+		.me		= THIS_MODULE,
+	},
 };
 
 static int __init physdev_mt_init(void)
 {
-	return xt_register_match(&physdev_mt_reg);
+	return xt_register_matches(physdev_mt_reg, ARRAY_SIZE(physdev_mt_reg));
 }
 
 static void __exit physdev_mt_exit(void)
 {
-	xt_unregister_match(&physdev_mt_reg);
+	xt_unregister_matches(physdev_mt_reg, ARRAY_SIZE(physdev_mt_reg));
 }
 
 module_init(physdev_mt_init);
diff --git a/net/netfilter/xt_realm.c b/net/netfilter/xt_realm.c
index 6df485f4403d..61b2f1e58d15 100644
--- a/net/netfilter/xt_realm.c
+++ b/net/netfilter/xt_realm.c
@@ -33,7 +33,7 @@ static struct xt_match realm_mt_reg __read_mostly = {
 	.matchsize	= sizeof(struct xt_realm_info),
 	.hooks		= (1 << NF_INET_POST_ROUTING) | (1 << NF_INET_FORWARD) |
 			  (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_LOCAL_IN),
-	.family		= NFPROTO_UNSPEC,
+	.family		= NFPROTO_IPV4,
 	.me		= THIS_MODULE
 };
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 05/11] netfilter: conntrack: remove sprintf usage
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Replace it with scnprintf, the buffer sizes are expected to be large enough
to hold the result, no need for snprintf+overflow check.

Increase buffer size in mangle_content_len() while at it.

BUG: KASAN: stack-out-of-bounds in vsnprintf+0xea5/0x1270
Write of size 1 at addr [..]
 vsnprintf+0xea5/0x1270
 sprintf+0xb1/0xe0
 mangle_content_len+0x1ac/0x280
 nf_nat_sdp_session+0x1cc/0x240
 process_sdp+0x8f8/0xb80
 process_invite_request+0x108/0x2b0
 process_sip_msg+0x5da/0xf50
 sip_help_tcp+0x45e/0x780
 nf_confirm+0x34d/0x990
 [..]

Fixes: 9fafcd7b2032 ("[NETFILTER]: nf_conntrack/nf_nat: add SIP helper port")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_nat_amanda.c |  2 +-
 net/netfilter/nf_nat_sip.c    | 33 ++++++++++++++++++---------------
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/nf_nat_amanda.c b/net/netfilter/nf_nat_amanda.c
index 98deef6cde69..8f1054920a85 100644
--- a/net/netfilter/nf_nat_amanda.c
+++ b/net/netfilter/nf_nat_amanda.c
@@ -50,7 +50,7 @@ static unsigned int help(struct sk_buff *skb,
 		return NF_DROP;
 	}
 
-	sprintf(buffer, "%u", port);
+	snprintf(buffer, sizeof(buffer), "%u", port);
 	if (!nf_nat_mangle_udp_packet(skb, exp->master, ctinfo,
 				      protoff, matchoff, matchlen,
 				      buffer, strlen(buffer))) {
diff --git a/net/netfilter/nf_nat_sip.c b/net/netfilter/nf_nat_sip.c
index cf4aeb299bde..c845b6d1a2bd 100644
--- a/net/netfilter/nf_nat_sip.c
+++ b/net/netfilter/nf_nat_sip.c
@@ -68,25 +68,27 @@ static unsigned int mangle_packet(struct sk_buff *skb, unsigned int protoff,
 }
 
 static int sip_sprintf_addr(const struct nf_conn *ct, char *buffer,
+			    size_t size,
 			    const union nf_inet_addr *addr, bool delim)
 {
 	if (nf_ct_l3num(ct) == NFPROTO_IPV4)
-		return sprintf(buffer, "%pI4", &addr->ip);
+		return scnprintf(buffer, size, "%pI4", &addr->ip);
 	else {
 		if (delim)
-			return sprintf(buffer, "[%pI6c]", &addr->ip6);
+			return scnprintf(buffer, size, "[%pI6c]", &addr->ip6);
 		else
-			return sprintf(buffer, "%pI6c", &addr->ip6);
+			return scnprintf(buffer, size, "%pI6c", &addr->ip6);
 	}
 }
 
 static int sip_sprintf_addr_port(const struct nf_conn *ct, char *buffer,
+				 size_t size,
 				 const union nf_inet_addr *addr, u16 port)
 {
 	if (nf_ct_l3num(ct) == NFPROTO_IPV4)
-		return sprintf(buffer, "%pI4:%u", &addr->ip, port);
+		return scnprintf(buffer, size, "%pI4:%u", &addr->ip, port);
 	else
-		return sprintf(buffer, "[%pI6c]:%u", &addr->ip6, port);
+		return scnprintf(buffer, size, "[%pI6c]:%u", &addr->ip6, port);
 }
 
 static int map_addr(struct sk_buff *skb, unsigned int protoff,
@@ -119,7 +121,7 @@ static int map_addr(struct sk_buff *skb, unsigned int protoff,
 	if (nf_inet_addr_cmp(&newaddr, addr) && newport == port)
 		return 1;
 
-	buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, ntohs(newport));
+	buflen = sip_sprintf_addr_port(ct, buffer, sizeof(buffer), &newaddr, ntohs(newport));
 	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
 			     matchoff, matchlen, buffer, buflen);
 }
@@ -212,7 +214,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
 					       &addr, true) > 0 &&
 		    nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.src.u3) &&
 		    !nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.dst.u3)) {
-			buflen = sip_sprintf_addr(ct, buffer,
+			buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer),
 					&ct->tuplehash[!dir].tuple.dst.u3,
 					true);
 			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
@@ -229,7 +231,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
 					       &addr, false) > 0 &&
 		    nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.dst.u3) &&
 		    !nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.src.u3)) {
-			buflen = sip_sprintf_addr(ct, buffer,
+			buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer),
 					&ct->tuplehash[!dir].tuple.src.u3,
 					false);
 			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
@@ -247,7 +249,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
 		    htons(n) == ct->tuplehash[dir].tuple.dst.u.udp.port &&
 		    htons(n) != ct->tuplehash[!dir].tuple.src.u.udp.port) {
 			__be16 p = ct->tuplehash[!dir].tuple.src.u.udp.port;
-			buflen = sprintf(buffer, "%u", ntohs(p));
+			buflen = scnprintf(buffer, sizeof(buffer), "%u", ntohs(p));
 			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
 					   poff, plen, buffer, buflen)) {
 				nf_ct_helper_log(skb, ct, "cannot mangle rport");
@@ -418,7 +420,8 @@ static unsigned int nf_nat_sip_expect(struct sk_buff *skb, unsigned int protoff,
 
 	if (!nf_inet_addr_cmp(&exp->tuple.dst.u3, &exp->saved_addr) ||
 	    exp->tuple.dst.u.udp.port != exp->saved_proto.udp.port) {
-		buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, port);
+		buflen = sip_sprintf_addr_port(ct, buffer, sizeof(buffer),
+					       &newaddr, port);
 		if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
 				   matchoff, matchlen, buffer, buflen)) {
 			nf_ct_helper_log(skb, ct, "cannot mangle packet");
@@ -438,8 +441,8 @@ static int mangle_content_len(struct sk_buff *skb, unsigned int protoff,
 {
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+	char buffer[sizeof("4294967295")];
 	unsigned int matchoff, matchlen;
-	char buffer[sizeof("65536")];
 	int buflen, c_len;
 
 	/* Get actual SDP length */
@@ -454,7 +457,7 @@ static int mangle_content_len(struct sk_buff *skb, unsigned int protoff,
 			      &matchoff, &matchlen) <= 0)
 		return 0;
 
-	buflen = sprintf(buffer, "%u", c_len);
+	buflen = scnprintf(buffer, sizeof(buffer), "%u", c_len);
 	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
 			     matchoff, matchlen, buffer, buflen);
 }
@@ -491,7 +494,7 @@ static unsigned int nf_nat_sdp_addr(struct sk_buff *skb, unsigned int protoff,
 	char buffer[INET6_ADDRSTRLEN];
 	unsigned int buflen;
 
-	buflen = sip_sprintf_addr(ct, buffer, addr, false);
+	buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), addr, false);
 	if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen,
 			      sdpoff, type, term, buffer, buflen))
 		return 0;
@@ -509,7 +512,7 @@ static unsigned int nf_nat_sdp_port(struct sk_buff *skb, unsigned int protoff,
 	char buffer[sizeof("nnnnn")];
 	unsigned int buflen;
 
-	buflen = sprintf(buffer, "%u", port);
+	buflen = scnprintf(buffer, sizeof(buffer), "%u", port);
 	if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
 			   matchoff, matchlen, buffer, buflen))
 		return 0;
@@ -529,7 +532,7 @@ static unsigned int nf_nat_sdp_session(struct sk_buff *skb, unsigned int protoff
 	unsigned int buflen;
 
 	/* Mangle session description owner and contact addresses */
-	buflen = sip_sprintf_addr(ct, buffer, addr, false);
+	buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), addr, false);
 	if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff,
 			      SDP_HDR_OWNER, SDP_HDR_MEDIA, buffer, buflen))
 		return 0;
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 04/11] netfilter: nfnetlink_osf: fix null-ptr-deref in nf_osf_ttl
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

From: "Kito Xu (veritas501)" <hxzene@gmail.com>

nf_osf_ttl() calls __in_dev_get_rcu(skb->dev) and passes the result
to in_dev_for_each_ifa_rcu() without checking for NULL. When the
receiving device has no IPv4 configuration (ip_ptr is NULL),
__in_dev_get_rcu() returns NULL and in_dev_for_each_ifa_rcu()
dereferences it unconditionally, causing a kernel crash.

This can happen when a packet arrives on a device that has had its
IPv4 configuration removed (e.g., MTU set below IPV4_MIN_MTU causing
inetdev_destroy) or on a device that was never assigned an IPv4
address, while an xt_osf or nft_osf rule with TTL_LESS mode is
active and the packet TTL exceeds the fingerprint TTL.

Add a NULL check for in_dev before using it. When in_dev is NULL,
return 0 (no match) since source-address locality cannot be
determined without IPv4 addresses on the device.

KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:nf_osf_match_one+0x204/0xa70
Call Trace:
 <IRQ>
 nf_osf_match+0x2f8/0x780
 xt_osf_match_packet+0x11c/0x1f0
 ipt_do_table+0x7fe/0x12b0
 nf_hook_slow+0xac/0x1e0
 ip_rcv+0x123/0x370
 __netif_receive_skb_one_core+0x166/0x1b0
 process_backlog+0x197/0x590
 __napi_poll+0xa1/0x540
 net_rx_action+0x401/0xd80
 handle_softirqs+0x19f/0x610
 </IRQ>

Fixes: a218dc82f0b5 ("netfilter: nft_osf: Add ttl option support")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_osf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index 70172ca07858..4bbe64288b90 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -36,6 +36,9 @@ static inline int nf_osf_ttl(const struct sk_buff *skb,
 	const struct in_ifaddr *ifa;
 	int ret = 0;

+	if (!in_dev)
+		return 0;
+
 	if (ttl_check == NF_OSF_TTL_TRUE)
 		return ip->ttl == f_ttl;
 	if (ttl_check == NF_OSF_TTL_NOCHECK)
-- 
2.47.3

^ permalink raw reply related

* [PATCH net 03/11] netfilter: nft_osf: restrict it to ipv4
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

This expression only supports for ipv4, restrict it.

Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_osf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nft_osf.c b/net/netfilter/nft_osf.c
index 1c0b493ef0a9..bdc2f6c90e2f 100644
--- a/net/netfilter/nft_osf.c
+++ b/net/netfilter/nft_osf.c
@@ -28,6 +28,11 @@ static void nft_osf_eval(const struct nft_expr *expr, struct nft_regs *regs,
 	struct nf_osf_data data;
 	struct tcphdr _tcph;
 
+	if (nft_pf(pkt) != NFPROTO_IPV4) {
+		regs->verdict.code = NFT_BREAK;
+		return;
+	}
+
 	if (pkt->tprot != IPPROTO_TCP) {
 		regs->verdict.code = NFT_BREAK;
 		return;
@@ -114,7 +119,6 @@ static int nft_osf_validate(const struct nft_ctx *ctx,
 
 	switch (ctx->family) {
 	case NFPROTO_IPV4:
-	case NFPROTO_IPV6:
 	case NFPROTO_INET:
 		hooks = (1 << NF_INET_LOCAL_IN) |
 			(1 << NF_INET_PRE_ROUTING) |
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 02/11] netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

From: Xiang Mei <xmei5@asu.edu>

nf_osf_match_one() computes ctx->window % f->wss.val in the
OSF_WSS_MODULO branch with no guard for f->wss.val == 0. A
CAP_NET_ADMIN user can add such a fingerprint via nfnetlink; a
subsequent matching TCP SYN divides by zero and panics the kernel.

Reject the bogus fingerprint in nfnl_osf_add_callback() above the
per-option for-loop. f->wss is per-fingerprint, not per-option, so
the check must run regardless of f->opt_num (including 0). Also
reject wss.wc >= OSF_WSS_MAX; nf_osf_match_one() already treats that
as "should not happen".

Crash:
 Oops: divide error: 0000 [#1] SMP KASAN NOPTI
 RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
 Call Trace:
 <IRQ>
  nf_osf_match (net/netfilter/nfnetlink_osf.c:220)
  xt_osf_match_packet (net/netfilter/xt_osf.c:32)
  ipt_do_table (net/ipv4/netfilter/ip_tables.c:348)
  nf_hook_slow (net/netfilter/core.c:622)
  ip_local_deliver (net/ipv4/ip_input.c:265)
  ip_rcv (include/linux/skbuff.h:1162)
  __netif_receive_skb_one_core (net/core/dev.c:6181)
  process_backlog (net/core/dev.c:6642)
  __napi_poll (net/core/dev.c:7710)
  net_rx_action (net/core/dev.c:7945)
  handle_softirqs (kernel/softirq.c:622)

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_osf.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index 45d9ad231a92..70172ca07858 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -320,6 +320,10 @@ static int nfnl_osf_add_callback(struct sk_buff *skb,
 	if (f->opt_num > ARRAY_SIZE(f->opt))
 		return -EINVAL;
 
+	if (f->wss.wc >= OSF_WSS_MAX ||
+	    (f->wss.wc == OSF_WSS_MODULO && f->wss.val == 0))
+		return -EINVAL;
+
 	for (i = 0; i < f->opt_num; i++) {
 		if (!f->opt[i].length || f->opt[i].length > MAX_IPOPTLEN)
 			return -EINVAL;
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 01/11] netfilter: arp_tables: fix IEEE1394 ARP payload parsing in arp_packet_match()
From: Pablo Neira Ayuso @ 2026-04-16 13:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260416131453.308611-1-pablo@netfilter.org>

From: Weiming Shi <bestswngs@gmail.com>

arp_packet_match() unconditionally parses the ARP payload assuming two
hardware addresses are present (source and target). However,
IPv4-over-IEEE1394 ARP (RFC 2734) omits the target hardware address
field, and arp_hdr_len() already accounts for this by returning a
shorter length for ARPHRD_IEEE1394 devices.

As a result, on IEEE1394 interfaces arp_packet_match() advances past a
nonexistent target hardware address and reads the wrong bytes for both
the target device address comparison and the target IP address. This
causes arptables rules to match against garbage data, leading to
incorrect filtering decisions: packets that should be accepted may be
dropped and vice versa.

The ARP stack in net/ipv4/arp.c (arp_create and arp_process) already
handles this correctly by skipping the target hardware address for
ARPHRD_IEEE1394. Apply the same pattern to arp_packet_match().

[ Pablo has mangled this patch to include Simon Horman's suggestions ]

Fixes: 6752c8db8e0c ("firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/arp_tables.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 1cdd9c28ab2d..a7a56890b5b5 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -110,13 +110,21 @@ static inline int arp_packet_match(const struct arphdr *arphdr,
 	arpptr += dev->addr_len;
 	memcpy(&src_ipaddr, arpptr, sizeof(u32));
 	arpptr += sizeof(u32);
-	tgt_devaddr = arpptr;
-	arpptr += dev->addr_len;
+
+	if (IS_ENABLED(CONFIG_FIREWIRE_NET) && dev->type == ARPHRD_IEEE1394) {
+		tgt_devaddr = NULL;
+	} else {
+		tgt_devaddr = arpptr;
+		arpptr += dev->addr_len;
+	}
 	memcpy(&tgt_ipaddr, arpptr, sizeof(u32));

 	if (NF_INVF(arpinfo, ARPT_INV_SRCDEVADDR,
 		    arp_devaddr_compare(&arpinfo->src_devaddr, src_devaddr,
-					dev->addr_len)) ||
+					dev->addr_len)))
+		return 0;
+
+	if (tgt_devaddr &&
 	    NF_INVF(arpinfo, ARPT_INV_TGTDEVADDR,
 		    arp_devaddr_compare(&arpinfo->tgt_devaddr, tgt_devaddr,
 					dev->addr_len)))
-- 
2.47.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox