Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v3 09/12] net: airoha: Support multiple net_devices for a single FE GDM port
From: Lorenzo Bianconi @ 2026-04-10 10:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: andrew+netdev, davem, edumazet, pabeni, robh, krzk+dt, conor+dt,
	ansuelsmth, benjamin.larsson, linux-arm-kernel, linux-mediatek,
	netdev, devicetree
In-Reply-To: <20260410025724.1810007-1-kuba@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 8640 bytes --]

On Apr 09, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email says "For your consideration."
> ---
> net: airoha: Support multiple net_devices for a single FE GDM port
> 
> This commit introduces support for mapping multiple net_device interfaces
> to a single Frame Engine (FE) GDM port via an external hardware arbiter.
> It updates the rx/tx paths, port allocation, and registration logic to
> handle the new net_device array.
> 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> [ ... ]
> > @@ -615,9 +617,8 @@ static int airoha_qdma_rx_process(struct airoha_queue *q, int budget)
> >  		u32 hash, reason, msg1 = le32_to_cpu(desc->msg1);
> >  		struct page *page = virt_to_head_page(e->buf);
> >  		u32 desc_ctrl = le32_to_cpu(desc->ctrl);
> > -		struct airoha_gdm_port *port;
> > -		struct net_device *netdev;
> > -		int data_len, len, p;
> > +		struct airoha_gdm_dev *dev;
> > +		int data_len, len;
> >  
> >  		if (!(desc_ctrl & QDMA_DESC_DONE_MASK))
> >  			break;
> 
> While looking at this code, I noticed an issue that predates this patch.
> Can the lack of a memory barrier after checking the descriptor ownership bit
> cause stale reads?
> 
> The driver checks QDMA_DESC_DONE_MASK to verify that the hardware has finished
> writing the descriptor. However, it reads desc->msg1 before this check and
> without a dma_rmb():
> 
> 		u32 hash, reason, msg1 = le32_to_cpu(desc->msg1);
> 		struct page *page = virt_to_head_page(e->buf);
> 		u32 desc_ctrl = le32_to_cpu(desc->ctrl);
> 
> 		if (!(desc_ctrl & QDMA_DESC_DONE_MASK))
> 			break;
> 
> Since weakly ordered architectures can reorder memory reads, could the CPU
> prefetch msg1 before the DMA engine writes the new descriptor values, leading
> to routing the packet to the wrong gdm_dev?

This issue has been already fixed with the following commit:

commit 4ae0604a0673e11e2075b178387151fcad5111b5
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date:   Tue Apr 7 08:48:04 2026 +0200

    net: airoha: Add dma_rmb() and READ_ONCE() in airoha_qdma_rx_process()
    
    Add missing dma_rmb() in airoha_qdma_rx_process routine to make sure the
    DMA read operations are completed when the NIC reports the processing on
    the current descriptor is done. Moreover, add missing READ_ONCE() in
    airoha_qdma_rx_process() for DMA descriptor control fields in order to
    avoid any compiler reordering.

> 
> [ ... ]
> > @@ -1764,9 +1765,7 @@ static int airoha_set_gdm2_loopback(struct airoha_gdm_dev *dev)
> >  	airoha_fe_clear(eth, REG_FE_VIP_PORT_EN, BIT(AIROHA_GDM2_IDX));
> >  	airoha_fe_clear(eth, REG_FE_IFC_PORT_EN, BIT(AIROHA_GDM2_IDX));
> >  
> > -	/* XXX: handle XSI_USB_PORT and XSI_PCE1_PORT */
> > -	nbq = port->id == AIROHA_GDM3_IDX && airoha_is_7581(eth) ? 4 : 0;
> > -	src_port = eth->soc->ops.get_src_port_id(port, nbq);
> > +	src_port = eth->soc->ops.get_src_port_id(port, dev->nbq);
> >  	if (src_port < 0)
> >  		return src_port;
> 
> Can configuring the loopback for one device break connectivity for other
> devices sharing the same port?
> 
> When multiple net_devices (like PCIE0 and PCIE1 on GDM3) open and call
> airoha_set_gdm2_loopback(), they modify global hardware registers such as
> REG_FE_WAN_PORT.
> 
> Since WAN0_MASK is unconditionally updated to track the current device's
> src_port:
> 
> 	airoha_fe_rmw(eth, REG_FE_WAN_PORT,
> 		      WAN1_EN_MASK | WAN1_MASK | WAN0_MASK,
> 		      FIELD_PREP(WAN0_MASK, src_port));
> 
> won't the last configured device clobber the hardware state for the others?

This issue is fixed with the following patch in the series:

commit 008995bd8f39c5fdf37a933fa3c962e7ef52e8c6
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date:   Fri Apr 3 12:07:27 2026 +0200

    net: airoha: Introduce WAN device flag
    
    Introduce WAN flag to specify if a given device is used to transmit/receive
    WAN or LAN traffic. Current codebase supports specifying LAN/WAN device
    configuration in ndo_init() callback during device bootstrap.
    Please note it is possible to specify multiple LAN devices but just a
    single WAN one.

> 
> [ ... ]
> > @@ -1987,7 +1986,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
> >  	}
> >  
> >  	fport = airoha_get_fe_port(dev);
> > -	msg1 = FIELD_PREP(QDMA_ETH_TXMSG_FPORT_MASK, fport) |
> > +	msg1 = FIELD_PREP(QDMA_ETH_TXMSG_NBOQ_MASK, dev->nbq) |
> > +	       FIELD_PREP(QDMA_ETH_TXMSG_FPORT_MASK, fport) |
> >  	       FIELD_PREP(QDMA_ETH_TXMSG_METER_MASK, 0x7f);
> >  
> >  	q = &qdma->q_tx[qid];
> 
> Does sharing the hardware QDMA TX queues among multiple net_device instances
> introduce a permanent transmit stall?
> 
> In airoha_dev_xmit(), if a shared hardware queue is full, the driver stops the
> netdev_queue of the specific netdev making the call:
> 
> 	if (q->queued + nr_frags >= q->ndesc) {
> 		/* not enough space in the queue */
> 		netif_tx_stop_queue(txq);
> 		spin_unlock_bh(&q->lock);
> 		return NETDEV_TX_BUSY;
> 	}
> 
> However, the TX completion handler (airoha_tx_irq_process()) only wakes the
> queue of the netdev attached to the completed packet. If an interface's TX
> queue is stopped but it has no pending packets in the hardware queue (because
> another interface filled it), will it ever get woken up?

This issue is orthogonal to this series and it will be fixed with a dedicated
patch.

> 
> [ ... ]
> > @@ -2955,11 +2958,25 @@ static int airoha_alloc_gdm_device(struct airoha_eth *eth,
> >  			 netdev->dev_addr);
> >  	}
> >  
> > +	/* Allowed nbq for EN7581 on GDM3 port are 4 and 5 for PCIE0
> > +	 * and PCIE1 respectively.
> > +	 */
> > +	index = nbq;
> > +	if (index && airoha_is_7581(eth) && port->id == AIROHA_GDM3_IDX)
> > +		index -= 4;
> > +
> > +	if (index >= ARRAY_SIZE(port->devs) || port->devs[index]) {
> > +		dev_err(eth->dev, "invalid nbq id: %d\n", nbq);
> > +		return -EINVAL;
> > +	}
> > +
> > +	netdev->dev.of_node = of_node_get(np);
> >  	dev = netdev_priv(netdev);
> 
> Does this code leak the of_node reference?
> 
> The node reference is explicitly incremented via of_node_get(np). The networking
> core does not automatically drop explicit of_node references on free_netdev().
> Without a matching of_node_put() in the error paths or the driver's teardown
> function (airoha_remove()), will this reference permanently leak when the device
> is destroyed?

ack, I will fix it in v4.

> 
> [ ... ]
> > @@ -3003,7 +3021,43 @@ static int airoha_alloc_gdm_port(struct airoha_eth *eth,
> >  	if (err)
> >  		return err;
> >  
> > -	return airoha_alloc_gdm_device(eth, port, np);
> > +	for_each_child_of_node(np, node) {
> > +		/* Multiple external serdes connected to the FE GDM port via an
> > +		 * external arbiter.
> > +		 */
> > +		const __be32 *nbq_ptr;
> > +		int nbq;
> > +
> > +		if (!of_device_is_compatible(node, "airoha,eth-port"))
> > +			continue;
> > +
> > +		d++;
> > +		if (!of_device_is_available(node))
> > +			continue;
> > +
> > +		nbq_ptr = of_get_property(node, "reg", NULL);
> > +		if (!nbq_ptr) {
> > +			dev_err(eth->dev, "missing nbq id\n");
> > +			of_node_put(node);
> > +			return -EINVAL;
> > +		}
> > +
> > +		/* Verify the provided nbq parameter is valid */
> > +		nbq = be32_to_cpup(nbq_ptr);
> > +		err = eth->soc->ops.get_src_port_id(port, nbq);
> > +		if (err < 0) {
> > +			of_node_put(node);
> > +			return err;
> > +		}
> > +
> > +		err = airoha_alloc_gdm_device(eth, port, nbq, node);
> > +		if (err) {
> > +			of_node_put(node);
> > +			return err;
> > +		}
> > +	}
> > +
> > +	return !d ? airoha_alloc_gdm_device(eth, port, 0, np) : 0;
> >  }
> 
> Does passing 0 for nbq in the fallback path break older device trees on the
> EN7581 SoC?
> 
> Later, when the interface is brought up, airoha_set_gdm2_loopback() calls
> airoha_en7581_get_src_port_id(), which returns -EINVAL for GDM3 if nbq is
> not 4 or 5:
> 
> 	switch (port->id) {
> 	case AIROHA_GDM3_IDX:
> 		/* 7581 SoC supports PCIe serdes on GDM3 port */
> 		if (nbq == 4)
> 			return HSGMII_LAN_7581_PCIE0_SRCPORT;
> 		if (nbq == 5)
> 			return HSGMII_LAN_7581_PCIE1_SRCPORT;
> 		break;
> 
> If nbq is hardcoded to 0 for legacy DTs, won't this cause interface
> initialization to fail unconditionally for GDM3?

ack, I will fix it in v4.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net] netrom: do some basic forms of validation on incoming frames
From: Simon Horman @ 2026-04-10 10:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jakub Kicinski, netdev, linux-kernel, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-hams, Yizhe Zhuang, stable
In-Reply-To: <2026041026-excuse-slashing-c4ee@gregkh>

On Fri, Apr 10, 2026 at 07:24:36AM +0200, Greg Kroah-Hartman wrote:
> On Thu, Apr 09, 2026 at 08:32:35PM -0700, Jakub Kicinski wrote:
> > On Thu, 9 Apr 2026 20:03:28 +0100 Simon Horman wrote:
> > > I expect that checking skb->len isn't sufficient here
> > > and pskb_may_pull needs to be used to ensure that
> > > the data is also available in the linear section of the skb.
> > 
> > Or for simplicity we could also be testing against skb_headlen()
> > since we don't expect any legit non-linear frames here? Dunno.

Sure, that's find by me if it leads to simpler code than
using pskb_may_pull(). Else I'd lean towards pskb_may_pull()
as it is a more general approach that feels worth proliferating.

> I'll be glad to change this either way, your call.  Given that this is
> an obsolete protocol that seems to only be a target for drive-by fuzzers
> to attack, whatever the simplest thing to do to quiet them up I'll be
> glad to implement.
> 
> Or can we just delete this stuff entirely?  :)

Deleting sounds good to me.
But we likely need a deprecation process.
In which case fixing these bugs still makes sense for the short term.

^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: use RCU-safe list primitives for basechain hook list
From: Florian Westphal @ 2026-04-10 10:31 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Pablo Neira Ayuso, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Phil Sutter, Simon Horman, netfilter-devel, coreteam,
	netdev, linux-kernel, Xiang Mei
In-Reply-To: <20260410101321.915190-2-bestswngs@gmail.com>

Weiming Shi <bestswngs@gmail.com> wrote:
> NFT_MSG_GETCHAIN runs as an NFNL_CB_RCU callback, so chain dumps
> traverse basechain->hook_list under rcu_read_lock() without holding
> commit_mutex. Meanwhile, nft_delchain_hook() mutates that same live
> hook_list with plain list_move() and list_splice(), and the commit/abort
> paths splice hooks back with plain list_splice(). None of these are
> RCU-safe list operations.
> 
> A concurrent GETCHAIN dump can observe partially updated list pointers,
> follow them into stack-local or transaction-private list heads, and
> crash when container_of() produces a bogus struct nft_hook pointer.

Right, but this is broken by design.

> Replace list_move() in nft_delchain_hook() with list_del_rcu() plus an
> intermediate pointer array, followed by synchronize_rcu() before the
> deleted hooks' list pointers are reused to link them into the
> transaction's private list. In the error paths, put hooks back with
> list_add_tail_rcu() which is safe for concurrent RCU readers (they
> either continue to the original successor or see the list head and
> terminate the walk).

I don't understand the existing code.

I don't even understand why
we have a difference between the 'update delete' and chain delete cases.

I think its wrong to unlink and then relink on abort.
What prevents nft_delchain_hook() from using the normal approach done
by nft_delchain()...?

This existing code appears to be way too complex.

^ permalink raw reply

* Re: [PATCH net-next v3 10/12] net: airoha: Do not stop GDM port if it is shared
From: Lorenzo Bianconi @ 2026-04-10 10:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: andrew+netdev, davem, edumazet, pabeni, robh, krzk+dt, conor+dt,
	ansuelsmth, benjamin.larsson, linux-arm-kernel, linux-mediatek,
	netdev, devicetree, Xuegang Lu
In-Reply-To: <20260410025726.1810041-1-kuba@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4665 bytes --]

> This is an AI-generated review of your patch. The human sending this
> email says "For your consideration."
> ---
> net: airoha: Do not stop GDM port if it is shared
> 
> Theoretically, in the current codebase, two independent net_devices can
> be connected to the same GDM port so we need to check the GDM port is not
> used by any other running net_device before setting the forward
> configuration to FE_PSE_PORT_DROP.
> 
> Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
> 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index fd27ba13029bd..5b0cd37b155e4 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> 
> [ ... ]
> 
> > @@ -1688,18 +1693,18 @@ static int airoha_dev_stop(struct net_device *netdev)
> >  	struct airoha_gdm_dev *dev = netdev_priv(netdev);
> >  	struct airoha_gdm_port *port = dev->port;
> >  	struct airoha_qdma *qdma = dev->qdma;
> > -	int i, err;
> > +	int i;
> >  
> >  	netif_tx_disable(netdev);
> > -	err = airoha_set_vip_for_gdm_port(dev, false);
> > -	if (err)
> > -		return err;
> > -
> >  	for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++)
> >  		netdev_tx_reset_subqueue(netdev, i);
> >  
> > -	airoha_set_gdm_port_fwd_cfg(qdma->eth, REG_GDM_FWD_CFG(port->id),
> > -				    FE_PSE_PORT_DROP);
> > +	if (atomic_dec_and_test(&port->users)) {
> > +		airoha_set_vip_for_gdm_port(dev, false);
> > +		airoha_set_gdm_port_fwd_cfg(qdma->eth,
> > +					    REG_GDM_FWD_CFG(port->id),
> > +					    FE_PSE_PORT_DROP);
> > +	}
> >  
> >  	if (atomic_dec_and_test(&qdma->users)) {
> >  		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> > 				  GLOBAL_CFG_TX_DMA_EN_MASK |
> > 				  GLOBAL_CFG_RX_DMA_EN_MASK);
> > 
> > 		for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
> > 			if (!qdma->q_tx[i].ndesc)
> > 				continue;
> > 
> > 			airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
> > 		}
> > 	}
> 
> This problem wasn't introduced by this patch, but does the QDMA teardown
> path have a DMA corruption vulnerability here?
> 
> When GLOBAL_CFG_TX_DMA_EN_MASK is cleared, it fails to poll
> GLOBAL_CFG_TX_DMA_BUSY_MASK, proceeding to unmap memory while the hardware
> DMA engine may still be processing in-flight descriptors.
> 
> Furthermore, airoha_qdma_cleanup_tx_queue() unmaps the DMA addresses and
> frees the SKBs, but it does not appear to clear the physical hardware
> descriptors or reset the hardware ring pointers (REG_TX_CPU_IDX /
> REG_TX_DMA_IDX).
> 
> When the device is later reopened in airoha_dev_open(),
> GLOBAL_CFG_TX_DMA_EN_MASK is set again without re-initializing the rings.
> Will the hardware DMA engine resume fetching from its previous index,
> processing stale descriptors whose backing memory was already unmapped
> and freed?

ack, I will work on it with a dedicated patch.

> 
> [ ... ]
> 
> > @@ -1851,10 +1856,22 @@ static void airoha_dev_get_stats64(struct net_device *netdev,
> >  static int airoha_dev_change_mtu(struct net_device *netdev, int mtu)
> >  {
> >  	struct airoha_gdm_dev *dev = netdev_priv(netdev);
> > +	u32 cur_len, len = ETH_HLEN + mtu + ETH_FCS_LEN;
> >  	struct airoha_gdm_port *port = dev->port;
> > -	u32 len = ETH_HLEN + mtu + ETH_FCS_LEN;
> >  	struct airoha_eth *eth = dev->eth;
> >  
> > +	cur_len = FIELD_GET(GDM_LONG_LEN_MASK,
> > +			    airoha_fe_rr(eth, REG_GDM_LEN_CFG(port->id)));
> > +	if (len < cur_len) {
> > +		u8 port_refcnt = atomic_read(&port->users);
> > +
> > +		/* We can decrease the device MTU just if the GDM port is
> > +		 * not shared or if the other device is not running.
> > +		 */
> > +		if (port_refcnt > 1 || (port_refcnt && !netif_running(netdev)))
> > +			return -EBUSY;
> > +	}
> > +
> 
> Does this MTU decrease validation create a state trap on shared ports?
> 
> If both eth0 and eth1 (sharing a port) are up, and a user increases eth0 MTU
> to 9000, the hardware MTU updates to 9000. If the user then tries to lower
> eth0 MTU back to 1500, len < cur_len evaluates to true. Because eth1 is also
> up (port_refcnt > 1), this returns -EBUSY.
> 
> This would prevent either interface from decreasing its MTU without fully
> bringing down the sibling interface.
> 
> Should the software netdev->mtu be decoupled from the hardware validation,
> allowing valid MTU changes in software while dynamically programming the
> hardware MTU to the maximum of all currently up interfaces on the shared
> port?

It seems more correct to me avoid receiving in the stack packet bigger than the
configured 'SW MTU".

Regards,
Lorenzo


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v3 00/12] net: airoha: Support multiple net_devices connected to the same GDM port
From: Lorenzo Bianconi @ 2026-04-10 10:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Christian Marangi,
	Benjamin Larsson, linux-arm-kernel, linux-mediatek, netdev,
	devicetree, Xuegang Lu
In-Reply-To: <20260409195645.16c68979@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 629 bytes --]

On Apr 09, Jakub Kicinski wrote:
> On Mon, 06 Apr 2026 12:34:05 +0200 Lorenzo Bianconi wrote:
> > EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
> > Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
> > manages the traffic in a TDM manner. As a result multiple net_devices can
> > connect to the same GDM{3,4} port and there is a theoretical "1:n"
> > relation between GDM ports and net_devices.
> 
> Still waiting for the device tree review. I'm going to blindly send out
> the Sashiko review, please comment if any of it makes sense?

ack, I will do.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [bug report] ipv4: icmp: fix null-ptr-deref in icmp_build_probe()
From: Fernando Fernandez Mancera @ 2026-04-10 10:51 UTC (permalink / raw)
  To: Dan Carpenter, Yiqi Sun; +Cc: Simon Horman, netdev
In-Reply-To: <adjOCdCW1EpPl8lf@stanley.mountain>

On 4/10/26 12:16 PM, Dan Carpenter wrote:
> Hello Yiqi Sun,
> 
> Commit fde29fd93493 ("ipv4: icmp: fix null-ptr-deref in
> icmp_build_probe()") from Apr 2, 2026 (linux-next), leads to the
> following Smatch static checker warning:
> 
> 	net/ipv4/icmp.c:1351 icmp_build_probe()
> 	warn: 'dev' is not an error pointer
> 
> net/ipv4/icmp.c
>      1341 #if IS_ENABLED(CONFIG_IPV6)
>      1342                 case ICMP_AFI_IP6:
>      1343                         if (iio->ident.addr.ctype3_hdr.addrlen != sizeof(struct in6_addr))
>      1344                                 goto send_mal_query;
>      1345                         dev = ipv6_dev_find(net, &iio->ident.addr.ip_addr.ipv6_addr, dev);
>      1346
>      1347                         /*
>      1348                          * If IPv6 identifier lookup is unavailable, silently
>      1349                          * discard the request instead of misreporting NO_IF.
>      1350                          */
> --> 1351                         if (IS_ERR(dev))
>      1352                                 return false;
> 
> It looks like there were two patches that went in around the same
> time.  Commit fde29fd93493 ("ipv4: icmp: fix null-ptr-deref in
> icmp_build_probe()") updated the checking for
> ipv6_stub->ipv6_dev_find() but d98adfbdd5c0 ("ipv4: drop ipv6_stub usage
> and use direct function calls") changed it to not return error pointers.
> 
> This IS_ERR() check can be removed.
> 

Yes, I thought it was going to happen during merging but I guess it 
makes sense to do it on a separate patch.

I am sending a patch to net-next addressing this.

Thanks!


^ permalink raw reply

* Re: [PATCH net-next v11 03/14] net: Add lease info to queue-get response
From: Daniel Borkmann @ 2026-04-10 11:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6
In-Reply-To: <20260409185105.721a6465@kernel.org>

On 4/10/26 3:51 AM, Jakub Kicinski wrote:
> On Thu, 9 Apr 2026 17:32:31 +0200 Daniel Borkmann wrote:
>>> I think the test has to be reworked but of the available options seems
>>> like merging it as is and following up quickly is the best. I've only
>>> set up the container testing in our CI yesterday anyway so there may
>>> be more things that need changing in the test as we gain experience :S
>>
>> No objections obviously if you want to land as-is with your refactor on
>> top.
> 
> Done, please double check my work, there were some conflicts with net.

Looks good to me, thanks a lot for everything!

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net v2 3/4] iavf: send MAC change request synchronously
From: Jose Ignacio Tornos Martinez @ 2026-04-10 11:12 UTC (permalink / raw)
  To: przemyslaw.kitszel
  Cc: anthony.l.nguyen, davem, edumazet, intel-wired-lan,
	jacob.e.keller, jtornosm, kohei.enju, kuba, netdev, pabeni, poros,
	stable
In-Reply-To: <89bfd605-1877-4d40-95e1-bfeae6624168@intel.com>

Hello Przemek,

Thank you for your comments.
I will try to include them in a next version.

Best regards
Jose Ignacio


^ permalink raw reply

* Re: [syzbot] [mptcp?] possible deadlock in mptcp_pm_mp_prio_send_ack
From: Matthieu Baerts @ 2026-04-10 11:13 UTC (permalink / raw)
  To: syzbot
  Cc: davem, edumazet, geliang, horms, kuba, linux-kernel, martineau,
	mptcp, netdev, pabeni, syzkaller-bugs
In-Reply-To: <69d7de34.050a0220.3030df.0019.GAE@google.com>

Hello,

On 09/04/2026 19:13, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    1caa871bb061 Merge branch 'net-stmmac-fix-tegra234-mgbe-cl..
> git tree:       net
> console output: https://syzkaller.appspot.com/x/log.txt?x=11d74e06580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6754c86e8d9e4c91
> dashboard link: https://syzkaller.appspot.com/bug?extid=2204dbe6a049b3218db9
> compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/014aae23b990/disk-1caa871b.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/c574a710638c/vmlinux-1caa871b.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/b29909f4efc4/bzImage-1caa871b.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+2204dbe6a049b3218db9@syzkaller.appspotmail.com
> 
> netlink: 8 bytes leftover after parsing attributes in process `syz.2.2034'.
> netlink: 8 bytes leftover after parsing attributes in process `syz.2.2034'.
> ======================================================
> WARNING: possible circular locking dependency detected
> syzkaller #0 Not tainted
> ------------------------------------------------------
> syz.2.2034/13659 is trying to acquire lock:
> ffff888031173560 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_pm_mp_prio_send_ack+0xaf8/0xba0 net/mptcp/pm.c:296
> 
> but task is already holding lock:
> ffff88807e300ea0 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
> ffff88807e300ea0 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_pm_nl_set_flags_all net/mptcp/pm_kernel.c:1482 [inline]
> ffff88807e300ea0 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_pm_nl_set_flags+0x795/0xc90 net/mptcp/pm_kernel.c:1551
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #7 (sk_lock-AF_INET){+.+.}-{0:0}:
>        lock_sock_nested+0x48/0x100 net/core/sock.c:3780
>        lock_sock include/net/sock.h:1709 [inline]
>        inet_shutdown+0x6a/0x390 net/ipv4/af_inet.c:919
>        nbd_mark_nsock_dead+0x2e9/0x560 drivers/block/nbd.c:318

If I'm not mistaken, it looks like this issue is also due to nbd
introducing a lockdep dependency between reclaim and af_socket, and this
is similar to a previous report:

#syz dup: [syzbot] [mptcp?] possible deadlock in mptcp_subflow_create_socket (2)

If that's not correct, please unduplicate it.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: use RCU-safe list primitives for basechain hook list
From: Pablo Neira Ayuso @ 2026-04-10 11:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Weiming Shi, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Phil Sutter, Simon Horman, netfilter-devel, coreteam,
	netdev, linux-kernel, Xiang Mei
In-Reply-To: <adjRiG_Bp3WpRYOz@strlen.de>

On Fri, Apr 10, 2026 at 12:31:36PM +0200, Florian Westphal wrote:
> Weiming Shi <bestswngs@gmail.com> wrote:
[...]
> > Replace list_move() in nft_delchain_hook() with list_del_rcu() plus an
> > intermediate pointer array, followed by synchronize_rcu() before the
> > deleted hooks' list pointers are reused to link them into the
> > transaction's private list. In the error paths, put hooks back with
> > list_add_tail_rcu() which is safe for concurrent RCU readers (they
> > either continue to the original successor or see the list head and
> > terminate the walk).
> 
> I don't understand the existing code.

I am working on an alternative fix.

^ permalink raw reply

* [PATCH net 1/1] net/sched: act_ct: Only release RCU read lock after ct_ft
From: Jamal Hadi Salim @ 2026-04-10 11:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, jiri, zdi-disclosures,
	security, Jamal Hadi Salim, Victor Nogueira

When looking up a flow table in act_ct in tcf_ct_flow_table_get(),
rhashtable_lookup_fast() internally opens and closes an RCU read critical
section before returning ct_ft.
The tcf_ct_flow_table_cleanup_work() can complete before refcount_inc_not_zero()
is invoked on the returned ct_ft resulting in a UAF on the already freed ct_ft
object. This vulnerability can lead to privilege escalation.

Analysis from zdi-disclosures@trendmicro.com:
When initializing act_ct, tcf_ct_init() is called, which internally triggers
tcf_ct_flow_table_get().

static int tcf_ct_flow_table_get(struct net *net, struct tcf_ct_params *params)

{
                struct zones_ht_key key = { .net = net, .zone = params->zone };
                struct tcf_ct_flow_table *ct_ft;
                int err = -ENOMEM;

                mutex_lock(&zones_mutex);
                ct_ft = rhashtable_lookup_fast(&zones_ht, &key, zones_params); // [1]
                if (ct_ft && refcount_inc_not_zero(&ct_ft->ref)) // [2]
                                goto out_unlock;
                ...
}

static __always_inline void *rhashtable_lookup_fast(
                struct rhashtable *ht, const void *key,
                const struct rhashtable_params params)
{
                void *obj;

                rcu_read_lock();
                obj = rhashtable_lookup(ht, key, params);
                rcu_read_unlock();

                return obj;
}

At [1], rhashtable_lookup_fast() looks up and returns the corresponding ct_ft
from zones_ht . The lookup is performed within an RCU read critical section
through rcu_read_lock() / rcu_read_unlock(), which prevents the object from
being freed. However, at the point of function return, rcu_read_unlock() has
already been called, and there is nothing preventing ct_ft from being freed
before reaching refcount_inc_not_zero(&ct_ft->ref) at [2]. This interval becomes
the race window, during which ct_ft can be freed.

Free Process:

tcf_ct_flow_table_put() is executed through the path tcf_ct_cleanup() call_rcu()
tcf_ct_params_free_rcu() tcf_ct_params_free() tcf_ct_flow_table_put().

static void tcf_ct_flow_table_put(struct tcf_ct_flow_table *ct_ft)
{
                if (refcount_dec_and_test(&ct_ft->ref)) {
                                rhashtable_remove_fast(&zones_ht, &ct_ft->node, zones_params);
                                INIT_RCU_WORK(&ct_ft->rwork, tcf_ct_flow_table_cleanup_work); // [3]
                                queue_rcu_work(act_ct_wq, &ct_ft->rwork);
                }
}

At [3], tcf_ct_flow_table_cleanup_work() is scheduled as RCU work

static void tcf_ct_flow_table_cleanup_work(struct work_struct *work)

{
                struct tcf_ct_flow_table *ct_ft;
                struct flow_block *block;

                ct_ft = container_of(to_rcu_work(work), struct tcf_ct_flow_table,
                                                                rwork);
                nf_flow_table_free(&ct_ft->nf_ft);
                block = &ct_ft->nf_ft.flow_block;
                down_write(&ct_ft->nf_ft.flow_block_lock);
                WARN_ON(!list_empty(&block->cb_list));
                up_write(&ct_ft->nf_ft.flow_block_lock);
                kfree(ct_ft); // [4]

                module_put(THIS_MODULE);
}

tcf_ct_flow_table_cleanup_work() frees ct_ft at [4]. When this function executes
between [1] and [2], UAF occurs.

This race condition has a very short race window, making it generally
difficult to trigger. Therefore, to trigger the vulnerability an msleep(100) was
inserted after[1]

Fixes: 138470a9b2cc2 ("net/sched: act_ct: fix lockdep splat in tcf_ct_flow_table_get")
Reported-by: zdi-disclosures@trendmicro.com
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/act_ct.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 7d5e50c921a0..6158e13c98d3 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -328,9 +328,13 @@ static int tcf_ct_flow_table_get(struct net *net, struct tcf_ct_params *params)
 	int err = -ENOMEM;

 	mutex_lock(&zones_mutex);
-	ct_ft = rhashtable_lookup_fast(&zones_ht, &key, zones_params);
-	if (ct_ft && refcount_inc_not_zero(&ct_ft->ref))
+	rcu_read_lock();
+	ct_ft = rhashtable_lookup(&zones_ht, &key, zones_params);
+	if (ct_ft && refcount_inc_not_zero(&ct_ft->ref)) {
+		rcu_read_unlock();
 		goto out_unlock;
+	}
+	rcu_read_unlock();

 	ct_ft = kzalloc_obj(*ct_ft);
 	if (!ct_ft)
-- 
2.34.1

^ permalink raw reply related

* Re: [bug report] ipv4: icmp: fix null-ptr-deref in icmp_build_probe()
From: Fernando Fernandez Mancera @ 2026-04-10 11:19 UTC (permalink / raw)
  To: Dan Carpenter, Yiqi Sun; +Cc: Simon Horman, netdev
In-Reply-To: <00cba68f-2e37-4ad9-872b-cc41a113de00@suse.de>

On 4/10/26 12:51 PM, Fernando Fernandez Mancera wrote:
> On 4/10/26 12:16 PM, Dan Carpenter wrote:
>> Hello Yiqi Sun,
>>
>> Commit fde29fd93493 ("ipv4: icmp: fix null-ptr-deref in
>> icmp_build_probe()") from Apr 2, 2026 (linux-next), leads to the
>> following Smatch static checker warning:
>>
>>     net/ipv4/icmp.c:1351 icmp_build_probe()
>>     warn: 'dev' is not an error pointer
>>
>> net/ipv4/icmp.c
>>      1341 #if IS_ENABLED(CONFIG_IPV6)
>>      1342                 case ICMP_AFI_IP6:
>>      1343                         if (iio- 
>> >ident.addr.ctype3_hdr.addrlen != sizeof(struct in6_addr))
>>      1344                                 goto send_mal_query;
>>      1345                         dev = ipv6_dev_find(net, &iio- 
>> >ident.addr.ip_addr.ipv6_addr, dev);
>>      1346
>>      1347                         /*
>>      1348                          * If IPv6 identifier lookup is 
>> unavailable, silently
>>      1349                          * discard the request instead of 
>> misreporting NO_IF.
>>      1350                          */
>> --> 1351                         if (IS_ERR(dev))
>>      1352                                 return false;
>>
>> It looks like there were two patches that went in around the same
>> time.  Commit fde29fd93493 ("ipv4: icmp: fix null-ptr-deref in
>> icmp_build_probe()") updated the checking for
>> ipv6_stub->ipv6_dev_find() but d98adfbdd5c0 ("ipv4: drop ipv6_stub usage
>> and use direct function calls") changed it to not return error pointers.
>>
>> This IS_ERR() check can be removed.
>>
> 
> Yes, I thought it was going to happen during merging but I guess it 
> makes sense to do it on a separate patch.
> 

Actually, I believe this has been handled during the net merge with 
net-next.

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b6e39e48469e37057fce27a1b87cf6d3e456aa42

It should reach linux-next, so all good.

Thanks,
Fernando.

^ permalink raw reply

* [PATCH net-next 00/11] netfilter: updates for net-next
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo

Hi,

The following patchset contains Netfilter updates for *net-next*:

1-3) IPVS updates from Julian Anastasov to enhance visibility into
     IPVS internal state by exposing hash size, load factor etc and
     allows userspace to tune the load factor used for resizing hash
     tables.

4) reject empty/not nul terminated device names from xt_physdev.
   This isn't a bug fix; existing code doesn't require a c-string.
   But clean this up anyway because conceptually the interface name
   definitely should be a c-string.

5) Switch nfnetlink to skb_mac_header helpers that didn't exist back
   when this code was written.  This gives us additional debug checks
   but is not intended to change functionality.

6) Let the xt ttl/hoplimit match reject unknown operator modes.
   This is a cleanup, the evaluation function simply returns false when
   the mode is out of range.  From Marino Dzalto.

7) xt_socket match should enable defrag after all other checks. This
   bug is harmless, historically defrag could not be disabled either
   except by rmmod.

8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera.

9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old
   xtables 32bit compat code, from Gustavo A. R. Silva.

10) nftables fwd expression should drop packets when their ttl/hl has
    expired.  This is a bug fix deferred, its not deemed important
    enough for -rc8.
11) Add additional checks before assuming the mac header is an ethernet
    header, from Zhengchuan Liang.


Please, pull these changes from:
The following changes since commit 42f9b4c6ef19e71d2c7d9bfd3c5037d4fe434ad7:

  tools: ynl: tests: fix leading space on Makefile target (2026-04-09 20:41:40 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next.git tags/nf-next-26-04-10

for you to fetch changes up to 62443dc21114c0bbc476fa62973db89743f2f137:

  netfilter: require Ethernet MAC header before using eth_hdr() (2026-04-10 12:16:27 +0200)

----------------------------------------------------------------
netfilter pull request nf-next-26-04-10

----------------------------------------------------------------

Fernando Fernandez Mancera (1):
  netfilter: conntrack: remove UDP-Lite conntrack support

Florian Westphal (4):
  netfilter: x_physdev: reject empty or not-nul terminated device names
  netfilter: nfnetlink: prefer skb_mac_header helpers
  netfilter: xt_socket: enable defrag after all other checks
  netfilter: nft_fwd_netdev: check ttl/hl before forwarding

Gustavo A. R. Silva (1):
  netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings

Julian Anastasov (3):
  ipvs: show the current conn_tab size to users
  ipvs: add ip_vs_status info
  ipvs: add conn_lfactor and svc_lfactor sysctl vars

Marino Dzalto (1):
  netfilter: xt_HL: add pr_fmt and checkentry validation

Zhengchuan Liang (1):
  netfilter: require Ethernet MAC header before using eth_hdr()

 Documentation/networking/ipvs-sysctl.rst      |  37 +++
 .../net/netfilter/ipv4/nf_conntrack_ipv4.h    |   3 -
 include/net/netfilter/nf_conntrack_l4proto.h  |   7 -
 net/ipv6/netfilter/ip6t_eui64.c               |   7 +-
 net/netfilter/Kconfig                         |  11 -
 net/netfilter/ipset/ip_set_bitmap_ipmac.c     |   5 +-
 net/netfilter/ipset/ip_set_hash_ipmac.c       |   9 +-
 net/netfilter/ipset/ip_set_hash_mac.c         |   5 +-
 net/netfilter/ipvs/ip_vs_ctl.c                | 247 +++++++++++++++++-
 net/netfilter/nf_conntrack_core.c             |   8 -
 net/netfilter/nf_conntrack_proto.c            |   3 -
 net/netfilter/nf_conntrack_proto_udp.c        | 108 --------
 net/netfilter/nf_conntrack_standalone.c       |   2 -
 net/netfilter/nf_log_syslog.c                 |   8 +-
 net/netfilter/nf_nat_core.c                   |   6 -
 net/netfilter/nf_nat_proto.c                  |  20 --
 net/netfilter/nfnetlink_cttimeout.c           |   1 -
 net/netfilter/nfnetlink_log.c                 |  19 +-
 net/netfilter/nfnetlink_queue.c               |  25 +-
 net/netfilter/nft_ct.c                        |   1 -
 net/netfilter/nft_fwd_netdev.c                |  10 +
 net/netfilter/x_tables.c                      |  12 +-
 net/netfilter/xt_hl.c                         |  27 ++
 net/netfilter/xt_mac.c                        |   4 +-
 net/netfilter/xt_physdev.c                    |  22 ++
 net/netfilter/xt_socket.c                     |  23 +-
 26 files changed, 399 insertions(+), 231 deletions(-)

-- 
2.52.0


^ permalink raw reply

* [PATCH net-next 01/11] ipvs: show the current conn_tab size to users
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: Julian Anastasov <ja@ssi.bg>

As conn_tab is per-net, better to show the current hash table size
to users instead of the ip_vs_conn_tab_size (max).

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/ipvs/ip_vs_ctl.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index a1f070cb76c3..1322dd54ed7c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -281,6 +281,20 @@ static void est_reload_work_handler(struct work_struct *work)
 	mutex_unlock(&ipvs->est_mutex);
 }
 
+static int get_conn_tab_size(struct netns_ipvs *ipvs)
+{
+	const struct ip_vs_rht *t;
+	int size = 0;
+
+	rcu_read_lock();
+	t = rcu_dereference(ipvs->conn_tab);
+	if (t)
+		size = t->size;
+	rcu_read_unlock();
+
+	return size;
+}
+
 int
 ip_vs_use_count_inc(void)
 {
@@ -2741,10 +2755,13 @@ static void ip_vs_info_seq_stop(struct seq_file *seq, void *v)
 
 static int ip_vs_info_seq_show(struct seq_file *seq, void *v)
 {
+	struct net *net = seq_file_net(seq);
+	struct netns_ipvs *ipvs = net_ipvs(net);
+
 	if (v == SEQ_START_TOKEN) {
 		seq_printf(seq,
 			"IP Virtual Server version %d.%d.%d (size=%d)\n",
-			NVERSION(IP_VS_VERSION_CODE), ip_vs_conn_tab_size);
+			NVERSION(IP_VS_VERSION_CODE), get_conn_tab_size(ipvs));
 		seq_puts(seq,
 			 "Prot LocalAddress:Port Scheduler Flags\n");
 		seq_puts(seq,
@@ -3425,7 +3442,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 		char buf[64];
 
 		sprintf(buf, "IP Virtual Server version %d.%d.%d (size=%d)",
-			NVERSION(IP_VS_VERSION_CODE), ip_vs_conn_tab_size);
+			NVERSION(IP_VS_VERSION_CODE), get_conn_tab_size(ipvs));
 		if (copy_to_user(user, buf, strlen(buf)+1) != 0) {
 			ret = -EFAULT;
 			goto out;
@@ -3437,8 +3454,9 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	case IP_VS_SO_GET_INFO:
 	{
 		struct ip_vs_getinfo info;
+
 		info.version = IP_VS_VERSION_CODE;
-		info.size = ip_vs_conn_tab_size;
+		info.size = get_conn_tab_size(ipvs);
 		info.num_services =
 			atomic_read(&ipvs->num_services[IP_VS_AF_INET]);
 		if (copy_to_user(user, &info, sizeof(info)) != 0)
@@ -4447,7 +4465,7 @@ static int ip_vs_genl_get_cmd(struct sk_buff *skb, struct genl_info *info)
 		if (nla_put_u32(msg, IPVS_INFO_ATTR_VERSION,
 				IP_VS_VERSION_CODE) ||
 		    nla_put_u32(msg, IPVS_INFO_ATTR_CONN_TAB_SIZE,
-				ip_vs_conn_tab_size))
+				get_conn_tab_size(ipvs)))
 			goto nla_put_failure;
 		break;
 	}
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 02/11] ipvs: add ip_vs_status info
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: Julian Anastasov <ja@ssi.bg>

Add /proc/net/ip_vs_status to show current state of IPVS.

The motivation for this new /proc interface is to provide the output
for the users to help them decide when to tune the load factor for
hash tables, which is possible with the new sysctl knobs coming in
followup patch.

The output also includes information for the kthreads used for stats.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/ipvs/ip_vs_ctl.c | 145 +++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 1322dd54ed7c..fb1df61edfdd 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2924,6 +2924,144 @@ static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
 
 	return 0;
 }
+
+static int ip_vs_status_show(struct seq_file *seq, void *v)
+{
+	struct net *net = seq_file_single_net(seq);
+	struct netns_ipvs *ipvs = net_ipvs(net);
+	unsigned int resched_score = 0;
+	struct ip_vs_conn_hnode *hn;
+	struct hlist_bl_head *head;
+	struct ip_vs_service *svc;
+	struct ip_vs_rht *t, *pt;
+	struct hlist_bl_node *e;
+	int old_gen, new_gen;
+	u32 counts[8];
+	u32 bucket;
+	int count;
+	u32 sum1;
+	u32 sum;
+	int i;
+
+	rcu_read_lock();
+
+	t = rcu_dereference(ipvs->conn_tab);
+
+	seq_printf(seq, "Conns:\t%d\n", atomic_read(&ipvs->conn_count));
+	seq_printf(seq, "Conn buckets:\t%d (%d bits, lfactor %d)\n",
+		   t ? t->size : 0, t ? t->bits : 0, t ? t->lfactor : 0);
+
+	if (!atomic_read(&ipvs->conn_count))
+		goto after_conns;
+	old_gen = atomic_read(&ipvs->conn_tab_changes);
+
+repeat_conn:
+	smp_rmb(); /* ipvs->conn_tab and conn_tab_changes */
+	memset(counts, 0, sizeof(counts));
+	ip_vs_rht_for_each_table_rcu(ipvs->conn_tab, t, pt) {
+		for (bucket = 0; bucket < t->size; bucket++) {
+			DECLARE_IP_VS_RHT_WALK_BUCKET_RCU();
+
+			count = 0;
+			resched_score++;
+			ip_vs_rht_walk_bucket_rcu(t, bucket, head) {
+				count = 0;
+				hlist_bl_for_each_entry_rcu(hn, e, head, node)
+					count++;
+			}
+			resched_score += count;
+			if (resched_score >= 100) {
+				resched_score = 0;
+				cond_resched_rcu();
+				new_gen = atomic_read(&ipvs->conn_tab_changes);
+				/* New table installed ? */
+				if (old_gen != new_gen) {
+					old_gen = new_gen;
+					goto repeat_conn;
+				}
+			}
+			counts[min(count, (int)ARRAY_SIZE(counts) - 1)]++;
+		}
+	}
+	for (sum = 0, i = 0; i < ARRAY_SIZE(counts); i++)
+		sum += counts[i];
+	sum1 = sum - counts[0];
+	seq_printf(seq, "Conn buckets empty:\t%u (%lu%%)\n",
+		   counts[0], (unsigned long)counts[0] * 100 / max(sum, 1U));
+	for (i = 1; i < ARRAY_SIZE(counts); i++) {
+		if (!counts[i])
+			continue;
+		seq_printf(seq, "Conn buckets len-%d:\t%u (%lu%%)\n",
+			   i, counts[i],
+			   (unsigned long)counts[i] * 100 / max(sum1, 1U));
+	}
+
+after_conns:
+	t = rcu_dereference(ipvs->svc_table);
+
+	count = ip_vs_get_num_services(ipvs);
+	seq_printf(seq, "Services:\t%d\n", count);
+	seq_printf(seq, "Service buckets:\t%d (%d bits, lfactor %d)\n",
+		   t ? t->size : 0, t ? t->bits : 0, t ? t->lfactor : 0);
+
+	if (!count)
+		goto after_svc;
+	old_gen = atomic_read(&ipvs->svc_table_changes);
+
+repeat_svc:
+	smp_rmb(); /* ipvs->svc_table and svc_table_changes */
+	memset(counts, 0, sizeof(counts));
+	ip_vs_rht_for_each_table_rcu(ipvs->svc_table, t, pt) {
+		for (bucket = 0; bucket < t->size; bucket++) {
+			DECLARE_IP_VS_RHT_WALK_BUCKET_RCU();
+
+			count = 0;
+			resched_score++;
+			ip_vs_rht_walk_bucket_rcu(t, bucket, head) {
+				count = 0;
+				hlist_bl_for_each_entry_rcu(svc, e, head,
+							    s_list)
+					count++;
+			}
+			resched_score += count;
+			if (resched_score >= 100) {
+				resched_score = 0;
+				cond_resched_rcu();
+				new_gen = atomic_read(&ipvs->svc_table_changes);
+				/* New table installed ? */
+				if (old_gen != new_gen) {
+					old_gen = new_gen;
+					goto repeat_svc;
+				}
+			}
+			counts[min(count, (int)ARRAY_SIZE(counts) - 1)]++;
+		}
+	}
+	for (sum = 0, i = 0; i < ARRAY_SIZE(counts); i++)
+		sum += counts[i];
+	sum1 = sum - counts[0];
+	seq_printf(seq, "Service buckets empty:\t%u (%lu%%)\n",
+		   counts[0], (unsigned long)counts[0] * 100 / max(sum, 1U));
+	for (i = 1; i < ARRAY_SIZE(counts); i++) {
+		if (!counts[i])
+			continue;
+		seq_printf(seq, "Service buckets len-%d:\t%u (%lu%%)\n",
+			   i, counts[i],
+			   (unsigned long)counts[i] * 100 / max(sum1, 1U));
+	}
+
+after_svc:
+	seq_printf(seq, "Stats thread slots:\t%d (max %lu)\n",
+		   ipvs->est_kt_count, ipvs->est_max_threads);
+	seq_printf(seq, "Stats chain max len:\t%d\n", ipvs->est_chain_max);
+	seq_printf(seq, "Stats thread ests:\t%d\n",
+		   ipvs->est_chain_max * IPVS_EST_CHAIN_FACTOR *
+		   IPVS_EST_NTICKS);
+
+	rcu_read_unlock();
+	return 0;
+}
+
 #endif
 
 /*
@@ -4825,6 +4963,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 				    ipvs->net->proc_net,
 				    ip_vs_stats_percpu_show, NULL))
 		goto err_percpu;
+	if (!proc_create_net_single("ip_vs_status", 0, ipvs->net->proc_net,
+				    ip_vs_status_show, NULL))
+		goto err_status;
 #endif
 
 	ret = ip_vs_control_net_init_sysctl(ipvs);
@@ -4835,6 +4976,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 
 err:
 #ifdef CONFIG_PROC_FS
+	remove_proc_entry("ip_vs_status", ipvs->net->proc_net);
+
+err_status:
 	remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
 
 err_percpu:
@@ -4860,6 +5004,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
 	ip_vs_control_net_cleanup_sysctl(ipvs);
 	cancel_delayed_work_sync(&ipvs->est_reload_work);
 #ifdef CONFIG_PROC_FS
+	remove_proc_entry("ip_vs_status", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
 	remove_proc_entry("ip_vs", ipvs->net->proc_net);
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 03/11] ipvs: add conn_lfactor and svc_lfactor sysctl vars
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: Julian Anastasov <ja@ssi.bg>

Allow the default load factor for the connection and service tables
to be configured.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Documentation/networking/ipvs-sysctl.rst | 37 ++++++++++++
 net/netfilter/ipvs/ip_vs_ctl.c           | 76 ++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 3fb5fa142eef..a556439f8be7 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -29,6 +29,33 @@ backup_only - BOOLEAN
 	If set, disable the director function while the server is
 	in backup mode to avoid packet loops for DR/TUN methods.
 
+conn_lfactor - INTEGER
+	Possible values: -8 (larger table) .. 8 (smaller table)
+
+	Default: -4
+
+	Controls the sizing of the connection hash table based on the
+	load factor (number of connections per table buckets):
+
+		2^conn_lfactor = nodes / buckets
+
+	As result, the table grows if load increases and shrinks when
+	load decreases in the range of 2^8 - 2^conn_tab_bits (module
+	parameter).
+	The value is a shift count where negative values select
+	buckets = (connection hash nodes << -value) while positive
+	values select buckets = (connection hash nodes >> value). The
+	negative values reduce the collisions and reduce the time for
+	lookups but increase the table size. Positive values will
+	tolerate load above 100% when using smaller table is
+	preferred with the cost of more collisions. If using NAT
+	connections consider decreasing the value with one because
+	they add two nodes in the hash table.
+
+	Example:
+	-4: grow if load goes above 6% (buckets = nodes * 16)
+	2: grow if load goes above 400% (buckets = nodes / 4)
+
 conn_reuse_mode - INTEGER
 	1 - default
 
@@ -219,6 +246,16 @@ secure_tcp - INTEGER
 	The value definition is the same as that of drop_entry and
 	drop_packet.
 
+svc_lfactor - INTEGER
+	Possible values: -8 (larger table) .. 8 (smaller table)
+
+	Default: -3
+
+	Controls the sizing of the service hash table based on the
+	load factor (number of services per table buckets). The table
+	will grow and shrink in the range of 2^4 - 2^20.
+	See conn_lfactor for explanation.
+
 sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period
 	default 3 50
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index fb1df61edfdd..6632daa87ded 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2445,6 +2445,60 @@ static int ipvs_proc_run_estimation(const struct ctl_table *table, int write,
 	return ret;
 }
 
+static int ipvs_proc_conn_lfactor(const struct ctl_table *table, int write,
+				  void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		if (val < -8 || val > 8) {
+			ret = -EINVAL;
+		} else {
+			*valp = val;
+			if (rcu_access_pointer(ipvs->conn_tab))
+				mod_delayed_work(system_unbound_wq,
+						 &ipvs->conn_resize_work, 0);
+		}
+	}
+	return ret;
+}
+
+static int ipvs_proc_svc_lfactor(const struct ctl_table *table, int write,
+				 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct netns_ipvs *ipvs = table->extra2;
+	int *valp = table->data;
+	int val = *valp;
+	int ret;
+
+	struct ctl_table tmp_table = {
+		.data = &val,
+		.maxlen = sizeof(int),
+	};
+
+	ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+	if (write && ret >= 0) {
+		if (val < -8 || val > 8) {
+			ret = -EINVAL;
+		} else {
+			*valp = val;
+			if (rcu_access_pointer(ipvs->svc_table))
+				mod_delayed_work(system_unbound_wq,
+						 &ipvs->svc_resize_work, 0);
+		}
+	}
+	return ret;
+}
+
 /*
  *	IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
  *	Do not change order or insert new entries without
@@ -2633,6 +2687,18 @@ static struct ctl_table vs_vars[] = {
 		.mode		= 0644,
 		.proc_handler	= ipvs_proc_est_nice,
 	},
+	{
+		.procname	= "conn_lfactor",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_conn_lfactor,
+	},
+	{
+		.procname	= "svc_lfactor",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= ipvs_proc_svc_lfactor,
+	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
 		.procname	= "debug_level",
@@ -4853,6 +4919,16 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx].extra2 = ipvs;
 	tbl[idx++].data = &ipvs->sysctl_est_nice;
 
+	if (unpriv)
+		tbl[idx].mode = 0444;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_conn_lfactor;
+
+	if (unpriv)
+		tbl[idx].mode = 0444;
+	tbl[idx].extra2 = ipvs;
+	tbl[idx++].data = &ipvs->sysctl_svc_lfactor;
+
 #ifdef CONFIG_IP_VS_DEBUG
 	/* Global sysctls must be ro in non-init netns */
 	if (!net_eq(net, &init_net))
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 04/11] netfilter: x_physdev: reject empty or not-nul terminated device names
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

Reject names that lack a \0 character and reject the empty string as
well. iptables allows this but it fails to re-parse iptables-save output
that contain such rules.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/xt_physdev.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 343e65f377d4..53997771013f 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -107,6 +107,28 @@ static int physdev_mt_check(const struct xt_mtchk_param *par)
 		return -EINVAL;
 	}
 
+#define X(memb) strnlen(info->memb, sizeof(info->memb)) >= sizeof(info->memb)
+	if (info->bitmask & XT_PHYSDEV_OP_IN) {
+		if (info->physindev[0] == '\0')
+			return -EINVAL;
+		if (X(physindev))
+			return -ENAMETOOLONG;
+	}
+
+	if (info->bitmask & XT_PHYSDEV_OP_OUT) {
+		if (info->physoutdev[0] == '\0')
+			return -EINVAL;
+
+		if (X(physoutdev))
+			return -ENAMETOOLONG;
+	}
+
+	if (X(in_mask))
+		return -ENAMETOOLONG;
+	if (X(out_mask))
+		return -ENAMETOOLONG;
+#undef X
+
 	if (!brnf_probed) {
 		brnf_probed = true;
 		request_module("br_netfilter");
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 05/11] netfilter: nfnetlink: prefer skb_mac_header helpers
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

This adds implicit DEBUG_WARN_ON_ONCE for debug configurations.
No other changes intended.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nfnetlink_log.c   | 19 ++++++++++---------
 net/netfilter/nfnetlink_queue.c | 25 ++++++++++++-------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index b2c24cb919d4..2439cbbd5b26 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -401,7 +401,7 @@ nfulnl_timer(struct timer_list *t)
 
 static u32 nfulnl_get_bridge_size(const struct sk_buff *skb)
 {
-	u32 size = 0;
+	u32 mac_len, size = 0;
 
 	if (!skb_mac_header_was_set(skb))
 		return 0;
@@ -412,14 +412,17 @@ static u32 nfulnl_get_bridge_size(const struct sk_buff *skb)
 		size += nla_total_size(sizeof(u16)); /* tag */
 	}
 
-	if (skb->network_header > skb->mac_header)
-		size += nla_total_size(skb->network_header - skb->mac_header);
+	mac_len = skb_mac_header_len(skb);
+	if (mac_len > 0)
+		size += nla_total_size(mac_len);
 
 	return size;
 }
 
 static int nfulnl_put_bridge(struct nfulnl_instance *inst, const struct sk_buff *skb)
 {
+	u32 mac_len;
+
 	if (!skb_mac_header_was_set(skb))
 		return 0;
 
@@ -437,12 +440,10 @@ static int nfulnl_put_bridge(struct nfulnl_instance *inst, const struct sk_buff
 		nla_nest_end(inst->skb, nest);
 	}
 
-	if (skb->mac_header < skb->network_header) {
-		int len = (int)(skb->network_header - skb->mac_header);
-
-		if (nla_put(inst->skb, NFULA_L2HDR, len, skb_mac_header(skb)))
-			goto nla_put_failure;
-	}
+	mac_len = skb_mac_header_len(skb);
+	if (mac_len > 0 &&
+	    nla_put(inst->skb, NFULA_L2HDR, mac_len, skb_mac_header(skb)))
+		goto nla_put_failure;
 
 	return 0;
 
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index c7ee6f6ff725..58304fd1f70f 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -579,6 +579,7 @@ static u32 nfqnl_get_bridge_size(struct nf_queue_entry *entry)
 {
 	struct sk_buff *entskb = entry->skb;
 	u32 nlalen = 0;
+	u32 mac_len;
 
 	if (entry->state.pf != PF_BRIDGE || !skb_mac_header_was_set(entskb))
 		return 0;
@@ -587,9 +588,9 @@ static u32 nfqnl_get_bridge_size(struct nf_queue_entry *entry)
 		nlalen += nla_total_size(nla_total_size(sizeof(__be16)) +
 					 nla_total_size(sizeof(__be16)));
 
-	if (entskb->network_header > entskb->mac_header)
-		nlalen += nla_total_size((entskb->network_header -
-					  entskb->mac_header));
+	mac_len = skb_mac_header_len(entskb);
+	if (mac_len > 0)
+		nlalen += nla_total_size(mac_len);
 
 	return nlalen;
 }
@@ -597,6 +598,7 @@ static u32 nfqnl_get_bridge_size(struct nf_queue_entry *entry)
 static int nfqnl_put_bridge(struct nf_queue_entry *entry, struct sk_buff *skb)
 {
 	struct sk_buff *entskb = entry->skb;
+	u32 mac_len;
 
 	if (entry->state.pf != PF_BRIDGE || !skb_mac_header_was_set(entskb))
 		return 0;
@@ -615,12 +617,10 @@ static int nfqnl_put_bridge(struct nf_queue_entry *entry, struct sk_buff *skb)
 		nla_nest_end(skb, nest);
 	}
 
-	if (entskb->mac_header < entskb->network_header) {
-		int len = (int)(entskb->network_header - entskb->mac_header);
-
-		if (nla_put(skb, NFQA_L2HDR, len, skb_mac_header(entskb)))
-			goto nla_put_failure;
-	}
+	mac_len = skb_mac_header_len(entskb);
+	if (mac_len > 0 &&
+	    nla_put(skb, NFQA_L2HDR, mac_len, skb_mac_header(entskb)))
+		goto nla_put_failure;
 
 	return 0;
 
@@ -1004,13 +1004,13 @@ nf_queue_entry_dup(struct nf_queue_entry *e)
 static void nf_bridge_adjust_skb_data(struct sk_buff *skb)
 {
 	if (nf_bridge_info_get(skb))
-		__skb_push(skb, skb->network_header - skb->mac_header);
+		__skb_push(skb, skb_mac_header_len(skb));
 }
 
 static void nf_bridge_adjust_segmented_data(struct sk_buff *skb)
 {
 	if (nf_bridge_info_get(skb))
-		__skb_pull(skb, skb->network_header - skb->mac_header);
+		__skb_pull(skb, skb_mac_header_len(skb));
 }
 #else
 #define nf_bridge_adjust_skb_data(s) do {} while (0)
@@ -1469,8 +1469,7 @@ static int nfqa_parse_bridge(struct nf_queue_entry *entry,
 	}
 
 	if (nfqa[NFQA_L2HDR]) {
-		int mac_header_len = entry->skb->network_header -
-			entry->skb->mac_header;
+		u32 mac_header_len = skb_mac_header_len(entry->skb);
 
 		if (mac_header_len != nla_len(nfqa[NFQA_L2HDR]))
 			return -EINVAL;
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 06/11] netfilter: xt_HL: add pr_fmt and checkentry validation
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: Marino Dzalto <marino.dzalto@gmail.com>

Add pr_fmt to prefix log messages with the module name for
easier debugging in dmesg.

Add checkentry functions for IPv4 (ttl_mt_check) and IPv6
(hl_mt6_check) to validate the match mode at rule registration
time, rejecting invalid modes with -EINVAL.

The evaluation function returns false in case the mode is
unknown, so this is a cleanup, not a bug fix.

Signed-off-by: Marino Dzalto <marino.dzalto@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/xt_hl.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/net/netfilter/xt_hl.c b/net/netfilter/xt_hl.c
index c1a70f8f0441..4a12a757ecbf 100644
--- a/net/netfilter/xt_hl.c
+++ b/net/netfilter/xt_hl.c
@@ -6,6 +6,7 @@
  * Hop Limit matching module
  * (C) 2001-2002 Maciej Soltysiak <solt@dns.toxicfilms.tv>
  */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/ip.h>
 #include <linux/ipv6.h>
@@ -22,6 +23,18 @@ MODULE_LICENSE("GPL");
 MODULE_ALIAS("ipt_ttl");
 MODULE_ALIAS("ip6t_hl");
 
+static int ttl_mt_check(const struct xt_mtchk_param *par)
+{
+	const struct ipt_ttl_info *info = par->matchinfo;
+
+	if (info->mode > IPT_TTL_GT) {
+		pr_err("Unknown TTL match mode: %d\n", info->mode);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static bool ttl_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct ipt_ttl_info *info = par->matchinfo;
@@ -41,6 +54,18 @@ static bool ttl_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	return false;
 }
 
+static int hl_mt6_check(const struct xt_mtchk_param *par)
+{
+	const struct ip6t_hl_info *info = par->matchinfo;
+
+	if (info->mode > IP6T_HL_GT) {
+		pr_err("Unknown Hop Limit match mode: %d\n", info->mode);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static bool hl_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct ip6t_hl_info *info = par->matchinfo;
@@ -65,6 +90,7 @@ static struct xt_match hl_mt_reg[] __read_mostly = {
 		.name       = "ttl",
 		.revision   = 0,
 		.family     = NFPROTO_IPV4,
+		.checkentry = ttl_mt_check,
 		.match      = ttl_mt,
 		.matchsize  = sizeof(struct ipt_ttl_info),
 		.me         = THIS_MODULE,
@@ -73,6 +99,7 @@ static struct xt_match hl_mt_reg[] __read_mostly = {
 		.name       = "hl",
 		.revision   = 0,
 		.family     = NFPROTO_IPV6,
+		.checkentry = hl_mt6_check,
 		.match      = hl_mt6,
 		.matchsize  = sizeof(struct ip6t_hl_info),
 		.me         = THIS_MODULE,
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 07/11] netfilter: xt_socket: enable defrag after all other checks
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

Originally this did not matter because defrag was enabled once per netns
and only disabled again on netns dismantle.  When this got changed I should
have adjusted checkentry to not leave defrag enabled on error.

Fixes: de8c12110a13 ("netfilter: disable defrag once its no longer needed")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/xt_socket.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 76e01f292aaf..811e53bee408 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -168,52 +168,41 @@ static int socket_mt_enable_defrag(struct net *net, int family)
 static int socket_mt_v1_check(const struct xt_mtchk_param *par)
 {
 	const struct xt_socket_mtinfo1 *info = (struct xt_socket_mtinfo1 *) par->matchinfo;
-	int err;
-
-	err = socket_mt_enable_defrag(par->net, par->family);
-	if (err)
-		return err;
 
 	if (info->flags & ~XT_SOCKET_FLAGS_V1) {
 		pr_info_ratelimited("unknown flags 0x%x\n",
 				    info->flags & ~XT_SOCKET_FLAGS_V1);
 		return -EINVAL;
 	}
-	return 0;
+
+	return socket_mt_enable_defrag(par->net, par->family);
 }
 
 static int socket_mt_v2_check(const struct xt_mtchk_param *par)
 {
 	const struct xt_socket_mtinfo2 *info = (struct xt_socket_mtinfo2 *) par->matchinfo;
-	int err;
-
-	err = socket_mt_enable_defrag(par->net, par->family);
-	if (err)
-		return err;
 
 	if (info->flags & ~XT_SOCKET_FLAGS_V2) {
 		pr_info_ratelimited("unknown flags 0x%x\n",
 				    info->flags & ~XT_SOCKET_FLAGS_V2);
 		return -EINVAL;
 	}
-	return 0;
+
+	return socket_mt_enable_defrag(par->net, par->family);
 }
 
 static int socket_mt_v3_check(const struct xt_mtchk_param *par)
 {
 	const struct xt_socket_mtinfo3 *info =
 				    (struct xt_socket_mtinfo3 *)par->matchinfo;
-	int err;
 
-	err = socket_mt_enable_defrag(par->net, par->family);
-	if (err)
-		return err;
 	if (info->flags & ~XT_SOCKET_FLAGS_V3) {
 		pr_info_ratelimited("unknown flags 0x%x\n",
 				    info->flags & ~XT_SOCKET_FLAGS_V3);
 		return -EINVAL;
 	}
-	return 0;
+
+	return socket_mt_enable_defrag(par->net, par->family);
 }
 
 static void socket_mt_destroy(const struct xt_mtdtor_param *par)
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 08/11] netfilter: conntrack: remove UDP-Lite conntrack support
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: Fernando Fernandez Mancera <fmancera@suse.de>

UDP-Lite (RFC 3828) socket support was recently retired from the core
networking stack. As a follow-up of that, drop the connection tracker
and NAT support for UDP-Lite in Netfilter.

This patch removes CONFIG_NF_CT_PROTO_UDPLITE and scrubs UDP-Lite
awareness from the conntrack core, NAT core, nft_ct, and ctnetlink.
Please note that stateless packet inspection, matching, ipsets or
logging support for IPPROTO_UDPLITE is preserved.

As conntrack no longer extracts UDP-Lite ports or tracks its L4 state,
when performing NAT the UDP-Lite checksum cannot be updated anymore.
That is an expected and acceptable consequence of removing UDP-Lite
conntrack module.

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../net/netfilter/ipv4/nf_conntrack_ipv4.h    |   3 -
 include/net/netfilter/nf_conntrack_l4proto.h  |   7 --
 net/netfilter/Kconfig                         |  11 --
 net/netfilter/nf_conntrack_core.c             |   8 --
 net/netfilter/nf_conntrack_proto.c            |   3 -
 net/netfilter/nf_conntrack_proto_udp.c        | 108 ------------------
 net/netfilter/nf_conntrack_standalone.c       |   2 -
 net/netfilter/nf_nat_core.c                   |   6 -
 net/netfilter/nf_nat_proto.c                  |  20 ----
 net/netfilter/nfnetlink_cttimeout.c           |   1 -
 net/netfilter/nft_ct.c                        |   1 -
 11 files changed, 170 deletions(-)

diff --git a/include/net/netfilter/ipv4/nf_conntrack_ipv4.h b/include/net/netfilter/ipv4/nf_conntrack_ipv4.h
index 8d65ffbf57de..b39417ad955e 100644
--- a/include/net/netfilter/ipv4/nf_conntrack_ipv4.h
+++ b/include/net/netfilter/ipv4/nf_conntrack_ipv4.h
@@ -16,9 +16,6 @@ extern const struct nf_conntrack_l4proto nf_conntrack_l4proto_icmp;
 #ifdef CONFIG_NF_CT_PROTO_SCTP
 extern const struct nf_conntrack_l4proto nf_conntrack_l4proto_sctp;
 #endif
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-extern const struct nf_conntrack_l4proto nf_conntrack_l4proto_udplite;
-#endif
 #ifdef CONFIG_NF_CT_PROTO_GRE
 extern const struct nf_conntrack_l4proto nf_conntrack_l4proto_gre;
 #endif
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index cd5020835a6d..fde2427ceb8f 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -107,11 +107,6 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 			    unsigned int dataoff,
 			    enum ip_conntrack_info ctinfo,
 			    const struct nf_hook_state *state);
-int nf_conntrack_udplite_packet(struct nf_conn *ct,
-				struct sk_buff *skb,
-				unsigned int dataoff,
-				enum ip_conntrack_info ctinfo,
-				const struct nf_hook_state *state);
 int nf_conntrack_tcp_packet(struct nf_conn *ct,
 			    struct sk_buff *skb,
 			    unsigned int dataoff,
@@ -139,8 +134,6 @@ void nf_conntrack_icmpv6_init_net(struct net *net);
 /* Existing built-in generic protocol */
 extern const struct nf_conntrack_l4proto nf_conntrack_l4proto_generic;
 
-#define MAX_NF_CT_PROTO IPPROTO_UDPLITE
-
 const struct nf_conntrack_l4proto *nf_ct_l4proto_find(u8 l4proto);
 
 /* Generic netlink helpers */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index f3ea0cb26f36..682c675125fc 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -209,17 +209,6 @@ config NF_CT_PROTO_SCTP
 
 	  If unsure, say Y.
 
-config NF_CT_PROTO_UDPLITE
-	bool 'UDP-Lite protocol connection tracking support'
-	depends on NETFILTER_ADVANCED
-	default y
-	help
-	  With this option enabled, the layer 3 independent connection
-	  tracking code will be able to do state tracking on UDP-Lite
-	  connections.
-
-	  If unsure, say Y.
-
 config NF_CONNTRACK_AMANDA
 	tristate "Amanda backup protocol support"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 27ce5fda8993..b08189226320 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -323,9 +323,6 @@ nf_ct_get_tuple(const struct sk_buff *skb,
 #endif
 	case IPPROTO_TCP:
 	case IPPROTO_UDP:
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-	case IPPROTO_UDPLITE:
-#endif
 #ifdef CONFIG_NF_CT_PROTO_SCTP
 	case IPPROTO_SCTP:
 #endif
@@ -1987,11 +1984,6 @@ static int nf_conntrack_handle_packet(struct nf_conn *ct,
 	case IPPROTO_ICMPV6:
 		return nf_conntrack_icmpv6_packet(ct, skb, ctinfo, state);
 #endif
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-	case IPPROTO_UDPLITE:
-		return nf_conntrack_udplite_packet(ct, skb, dataoff,
-						   ctinfo, state);
-#endif
 #ifdef CONFIG_NF_CT_PROTO_SCTP
 	case IPPROTO_SCTP:
 		return nf_conntrack_sctp_packet(ct, skb, dataoff,
diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
index bc1d96686b9c..50ddd3d613e1 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -103,9 +103,6 @@ const struct nf_conntrack_l4proto *nf_ct_l4proto_find(u8 l4proto)
 #ifdef CONFIG_NF_CT_PROTO_SCTP
 	case IPPROTO_SCTP: return &nf_conntrack_l4proto_sctp;
 #endif
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-	case IPPROTO_UDPLITE: return &nf_conntrack_l4proto_udplite;
-#endif
 #ifdef CONFIG_NF_CT_PROTO_GRE
 	case IPPROTO_GRE: return &nf_conntrack_l4proto_gre;
 #endif
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index 0030fbe8885c..cc9b7e5e1935 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -129,91 +129,6 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 	return NF_ACCEPT;
 }
 
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-static void udplite_error_log(const struct sk_buff *skb,
-			      const struct nf_hook_state *state,
-			      const char *msg)
-{
-	nf_l4proto_log_invalid(skb, state, IPPROTO_UDPLITE, "%s", msg);
-}
-
-static bool udplite_error(struct sk_buff *skb,
-			  unsigned int dataoff,
-			  const struct nf_hook_state *state)
-{
-	unsigned int udplen = skb->len - dataoff;
-	const struct udphdr *hdr;
-	struct udphdr _hdr;
-	unsigned int cscov;
-
-	/* Header is too small? */
-	hdr = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr);
-	if (!hdr) {
-		udplite_error_log(skb, state, "short packet");
-		return true;
-	}
-
-	cscov = ntohs(hdr->len);
-	if (cscov == 0) {
-		cscov = udplen;
-	} else if (cscov < sizeof(*hdr) || cscov > udplen) {
-		udplite_error_log(skb, state, "invalid checksum coverage");
-		return true;
-	}
-
-	/* UDPLITE mandates checksums */
-	if (!hdr->check) {
-		udplite_error_log(skb, state, "checksum missing");
-		return true;
-	}
-
-	/* Checksum invalid? Ignore. */
-	if (state->hook == NF_INET_PRE_ROUTING &&
-	    state->net->ct.sysctl_checksum &&
-	    nf_checksum_partial(skb, state->hook, dataoff, cscov, IPPROTO_UDP,
-				state->pf)) {
-		udplite_error_log(skb, state, "bad checksum");
-		return true;
-	}
-
-	return false;
-}
-
-/* Returns verdict for packet, and may modify conntracktype */
-int nf_conntrack_udplite_packet(struct nf_conn *ct,
-				struct sk_buff *skb,
-				unsigned int dataoff,
-				enum ip_conntrack_info ctinfo,
-				const struct nf_hook_state *state)
-{
-	unsigned int *timeouts;
-
-	if (udplite_error(skb, dataoff, state))
-		return -NF_ACCEPT;
-
-	timeouts = nf_ct_timeout_lookup(ct);
-	if (!timeouts)
-		timeouts = udp_get_timeouts(nf_ct_net(ct));
-
-	/* If we've seen traffic both ways, this is some kind of UDP
-	   stream.  Extend timeout. */
-	if (test_bit(IPS_SEEN_REPLY_BIT, &ct->status)) {
-		nf_ct_refresh_acct(ct, ctinfo, skb,
-				   timeouts[UDP_CT_REPLIED]);
-
-		if (unlikely((ct->status & IPS_NAT_CLASH)))
-			return NF_ACCEPT;
-
-		/* Also, more likely to be important, and not a probe */
-		if (!test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
-			nf_conntrack_event_cache(IPCT_ASSURED, ct);
-	} else {
-		nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
-	}
-	return NF_ACCEPT;
-}
-#endif
-
 #ifdef CONFIG_NF_CONNTRACK_TIMEOUT
 
 #include <linux/netfilter/nfnetlink.h>
@@ -299,26 +214,3 @@ const struct nf_conntrack_l4proto nf_conntrack_l4proto_udp =
 	},
 #endif /* CONFIG_NF_CONNTRACK_TIMEOUT */
 };
-
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-const struct nf_conntrack_l4proto nf_conntrack_l4proto_udplite =
-{
-	.l4proto		= IPPROTO_UDPLITE,
-	.allow_clash		= true,
-#if IS_ENABLED(CONFIG_NF_CT_NETLINK)
-	.tuple_to_nlattr	= nf_ct_port_tuple_to_nlattr,
-	.nlattr_to_tuple	= nf_ct_port_nlattr_to_tuple,
-	.nlattr_tuple_size	= nf_ct_port_nlattr_tuple_size,
-	.nla_policy		= nf_ct_port_nla_policy,
-#endif
-#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
-	.ctnl_timeout		= {
-		.nlattr_to_obj	= udp_timeout_nlattr_to_obj,
-		.obj_to_nlattr	= udp_timeout_obj_to_nlattr,
-		.nlattr_max	= CTA_TIMEOUT_UDP_MAX,
-		.obj_size	= sizeof(unsigned int) * CTA_TIMEOUT_UDP_MAX,
-		.nla_policy	= udp_timeout_nla_policy,
-	},
-#endif /* CONFIG_NF_CONNTRACK_TIMEOUT */
-};
-#endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 207b240b14e5..be2953c7d702 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -61,7 +61,6 @@ print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
 			   ntohs(tuple->src.u.tcp.port),
 			   ntohs(tuple->dst.u.tcp.port));
 		break;
-	case IPPROTO_UDPLITE:
 	case IPPROTO_UDP:
 		seq_printf(s, "sport=%hu dport=%hu ",
 			   ntohs(tuple->src.u.udp.port),
@@ -277,7 +276,6 @@ static const char* l4proto_name(u16 proto)
 	case IPPROTO_UDP: return "udp";
 	case IPPROTO_GRE: return "gre";
 	case IPPROTO_SCTP: return "sctp";
-	case IPPROTO_UDPLITE: return "udplite";
 	case IPPROTO_ICMPV6: return "icmpv6";
 	}
 
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 3b5434e4ec9c..83b2b5e9759a 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -68,7 +68,6 @@ static void nf_nat_ipv4_decode_session(struct sk_buff *skb,
 		fl4->daddr = t->dst.u3.ip;
 		if (t->dst.protonum == IPPROTO_TCP ||
 		    t->dst.protonum == IPPROTO_UDP ||
-		    t->dst.protonum == IPPROTO_UDPLITE ||
 		    t->dst.protonum == IPPROTO_SCTP)
 			fl4->fl4_dport = t->dst.u.all;
 	}
@@ -79,7 +78,6 @@ static void nf_nat_ipv4_decode_session(struct sk_buff *skb,
 		fl4->saddr = t->src.u3.ip;
 		if (t->dst.protonum == IPPROTO_TCP ||
 		    t->dst.protonum == IPPROTO_UDP ||
-		    t->dst.protonum == IPPROTO_UDPLITE ||
 		    t->dst.protonum == IPPROTO_SCTP)
 			fl4->fl4_sport = t->src.u.all;
 	}
@@ -99,7 +97,6 @@ static void nf_nat_ipv6_decode_session(struct sk_buff *skb,
 		fl6->daddr = t->dst.u3.in6;
 		if (t->dst.protonum == IPPROTO_TCP ||
 		    t->dst.protonum == IPPROTO_UDP ||
-		    t->dst.protonum == IPPROTO_UDPLITE ||
 		    t->dst.protonum == IPPROTO_SCTP)
 			fl6->fl6_dport = t->dst.u.all;
 	}
@@ -110,7 +107,6 @@ static void nf_nat_ipv6_decode_session(struct sk_buff *skb,
 		fl6->saddr = t->src.u3.in6;
 		if (t->dst.protonum == IPPROTO_TCP ||
 		    t->dst.protonum == IPPROTO_UDP ||
-		    t->dst.protonum == IPPROTO_UDPLITE ||
 		    t->dst.protonum == IPPROTO_SCTP)
 			fl6->fl6_sport = t->src.u.all;
 	}
@@ -415,7 +411,6 @@ static bool l4proto_in_range(const struct nf_conntrack_tuple *tuple,
 	case IPPROTO_GRE: /* all fall though */
 	case IPPROTO_TCP:
 	case IPPROTO_UDP:
-	case IPPROTO_UDPLITE:
 	case IPPROTO_SCTP:
 		if (maniptype == NF_NAT_MANIP_SRC)
 			port = tuple->src.u.all;
@@ -612,7 +607,6 @@ static void nf_nat_l4proto_unique_tuple(struct nf_conntrack_tuple *tuple,
 		goto find_free_id;
 #endif
 	case IPPROTO_UDP:
-	case IPPROTO_UDPLITE:
 	case IPPROTO_TCP:
 	case IPPROTO_SCTP:
 		if (maniptype == NF_NAT_MANIP_SRC)
diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index 97c0f841fc96..07f51fe75fbe 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -79,23 +79,6 @@ static bool udp_manip_pkt(struct sk_buff *skb,
 	return true;
 }
 
-static bool udplite_manip_pkt(struct sk_buff *skb,
-			      unsigned int iphdroff, unsigned int hdroff,
-			      const struct nf_conntrack_tuple *tuple,
-			      enum nf_nat_manip_type maniptype)
-{
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-	struct udphdr *hdr;
-
-	if (skb_ensure_writable(skb, hdroff + sizeof(*hdr)))
-		return false;
-
-	hdr = (struct udphdr *)(skb->data + hdroff);
-	__udp_manip_pkt(skb, iphdroff, hdr, tuple, maniptype, true);
-#endif
-	return true;
-}
-
 static bool
 sctp_manip_pkt(struct sk_buff *skb,
 	       unsigned int iphdroff, unsigned int hdroff,
@@ -287,9 +270,6 @@ static bool l4proto_manip_pkt(struct sk_buff *skb,
 	case IPPROTO_UDP:
 		return udp_manip_pkt(skb, iphdroff, hdroff,
 				     tuple, maniptype);
-	case IPPROTO_UDPLITE:
-		return udplite_manip_pkt(skb, iphdroff, hdroff,
-					 tuple, maniptype);
 	case IPPROTO_SCTP:
 		return sctp_manip_pkt(skb, iphdroff, hdroff,
 				      tuple, maniptype);
diff --git a/net/netfilter/nfnetlink_cttimeout.c b/net/netfilter/nfnetlink_cttimeout.c
index fd8652aa7e88..dca6826af7de 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -457,7 +457,6 @@ static int cttimeout_default_get(struct sk_buff *skb,
 		timeouts = nf_tcp_pernet(info->net)->timeouts;
 		break;
 	case IPPROTO_UDP:
-	case IPPROTO_UDPLITE:
 		timeouts = nf_udp_pernet(info->net)->timeouts;
 		break;
 	case IPPROTO_ICMPV6:
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 425525b90ac9..60ee8d932fcb 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -1252,7 +1252,6 @@ static int nft_ct_expect_obj_init(const struct nft_ctx *ctx,
 	switch (priv->l4proto) {
 	case IPPROTO_TCP:
 	case IPPROTO_UDP:
-	case IPPROTO_UDPLITE:
 	case IPPROTO_DCCP:
 	case IPPROTO_SCTP:
 		break;
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 09/11] netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: "Gustavo A. R. Silva" <gustavoars@kernel.org>

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the TRAILING_OVERLAP() helper to fix the following warnings:

1 net/netfilter/x_tables.c:816:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
1 net/netfilter/x_tables.c:811:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

This helper creates a union between a flexible-array member (FAM)
and a set of members that would otherwise follow it. This overlays
the trailing members onto the FAM while preserving the original
memory layout.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/x_tables.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index b39017c80548..9f837fb5ceb4 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -819,13 +819,17 @@ EXPORT_SYMBOL_GPL(xt_compat_match_to_user);
 
 /* non-compat version may have padding after verdict */
 struct compat_xt_standard_target {
-	struct compat_xt_entry_target t;
-	compat_uint_t verdict;
+	/* Must be last as it ends in a flexible-array member. */
+	TRAILING_OVERLAP(struct compat_xt_entry_target, t, data,
+		compat_uint_t verdict;
+	);
 };
 
 struct compat_xt_error_target {
-	struct compat_xt_entry_target t;
-	char errorname[XT_FUNCTION_MAXNAMELEN];
+	/* Must be last as it ends in a flexible-array member. */
+	TRAILING_OVERLAP(struct compat_xt_entry_target, t, data,
+		char errorname[XT_FUNCTION_MAXNAMELEN];
+	);
 };
 
 int xt_compat_check_entry_offsets(const void *base, const char *elems,
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 10/11] netfilter: nft_fwd_netdev: check ttl/hl before forwarding
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

Drop packets if their ttl/hl is too small for forwarding.

Fixes: d32de98ea70f ("netfilter: nft_fwd_netdev: allow to forward packets via neighbour layer")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nft_fwd_netdev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c
index ad48dcd45abe..4bce36c3a6a0 100644
--- a/net/netfilter/nft_fwd_netdev.c
+++ b/net/netfilter/nft_fwd_netdev.c
@@ -116,6 +116,11 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
 			goto out;
 		}
 		iph = ip_hdr(skb);
+		if (iph->ttl <= 1) {
+			verdict = NF_DROP;
+			goto out;
+		}
+
 		ip_decrease_ttl(iph);
 		neigh_table = NEIGH_ARP_TABLE;
 		break;
@@ -132,6 +137,11 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
 			goto out;
 		}
 		ip6h = ipv6_hdr(skb);
+		if (ip6h->hop_limit <= 1) {
+			verdict = NF_DROP;
+			goto out;
+		}
+
 		ip6h->hop_limit--;
 		neigh_table = NEIGH_ND_TABLE;
 		break;
-- 
2.52.0


^ permalink raw reply related

* [PATCH net-next 11/11] netfilter: require Ethernet MAC header before using eth_hdr()
From: Florian Westphal @ 2026-04-10 11:23 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netfilter-devel, pablo
In-Reply-To: <20260410112352.23599-1-fw@strlen.de>

From: Zhengchuan Liang <zcliangcn@gmail.com>

`ip6t_eui64`, `xt_mac`, the `bitmap:ip,mac`, `hash:ip,mac`, and
`hash:mac` ipset types, and `nf_log_syslog` access `eth_hdr(skb)`
after either assuming that the skb is associated with an Ethernet
device or checking only that the `ETH_HLEN` bytes at
`skb_mac_header(skb)` lie between `skb->head` and `skb->data`.

Make these paths first verify that the skb is associated with an
Ethernet device, that the MAC header was set, and that it spans at
least a full Ethernet header before accessing `eth_hdr(skb)`.

Suggested-by: Florian Westphal <fw@strlen.de>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv6/netfilter/ip6t_eui64.c           | 7 +++++--
 net/netfilter/ipset/ip_set_bitmap_ipmac.c | 5 +++--
 net/netfilter/ipset/ip_set_hash_ipmac.c   | 9 +++++----
 net/netfilter/ipset/ip_set_hash_mac.c     | 5 +++--
 net/netfilter/nf_log_syslog.c             | 8 +++++++-
 net/netfilter/xt_mac.c                    | 4 +---
 6 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/netfilter/ip6t_eui64.c b/net/ipv6/netfilter/ip6t_eui64.c
index da69a27e8332..bbb684f9964c 100644
--- a/net/ipv6/netfilter/ip6t_eui64.c
+++ b/net/ipv6/netfilter/ip6t_eui64.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/ipv6.h>
+#include <linux/if_arp.h>
 #include <linux/if_ether.h>
 
 #include <linux/netfilter/x_tables.h>
@@ -21,8 +22,10 @@ eui64_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	unsigned char eui64[8];
 
-	if (!(skb_mac_header(skb) >= skb->head &&
-	      skb_mac_header(skb) + ETH_HLEN <= skb->data)) {
+	if (!skb->dev || skb->dev->type != ARPHRD_ETHER)
+		return false;
+
+	if (!skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN) {
 		par->hotdrop = true;
 		return false;
 	}
diff --git a/net/netfilter/ipset/ip_set_bitmap_ipmac.c b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
index 2c625e0f49ec..752f59ef8744 100644
--- a/net/netfilter/ipset/ip_set_bitmap_ipmac.c
+++ b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
@@ -11,6 +11,7 @@
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
 #include <linux/errno.h>
+#include <linux/if_arp.h>
 #include <linux/if_ether.h>
 #include <linux/netlink.h>
 #include <linux/jiffies.h>
@@ -220,8 +221,8 @@ bitmap_ipmac_kadt(struct ip_set *set, const struct sk_buff *skb,
 		return -IPSET_ERR_BITMAP_RANGE;
 
 	/* Backward compatibility: we don't check the second flag */
-	if (skb_mac_header(skb) < skb->head ||
-	    (skb_mac_header(skb) + ETH_HLEN) > skb->data)
+	if (!skb->dev || skb->dev->type != ARPHRD_ETHER ||
+	    !skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN)
 		return -EINVAL;
 
 	e.id = ip_to_id(map, ip);
diff --git a/net/netfilter/ipset/ip_set_hash_ipmac.c b/net/netfilter/ipset/ip_set_hash_ipmac.c
index 467c59a83c0a..b9a2681e2488 100644
--- a/net/netfilter/ipset/ip_set_hash_ipmac.c
+++ b/net/netfilter/ipset/ip_set_hash_ipmac.c
@@ -11,6 +11,7 @@
 #include <linux/skbuff.h>
 #include <linux/errno.h>
 #include <linux/random.h>
+#include <linux/if_arp.h>
 #include <linux/if_ether.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
@@ -89,8 +90,8 @@ hash_ipmac4_kadt(struct ip_set *set, const struct sk_buff *skb,
 	struct hash_ipmac4_elem e = { .ip = 0, { .foo[0] = 0, .foo[1] = 0 } };
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
 
-	if (skb_mac_header(skb) < skb->head ||
-	    (skb_mac_header(skb) + ETH_HLEN) > skb->data)
+	if (!skb->dev || skb->dev->type != ARPHRD_ETHER ||
+	    !skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN)
 		return -EINVAL;
 
 	if (opt->flags & IPSET_DIM_TWO_SRC)
@@ -205,8 +206,8 @@ hash_ipmac6_kadt(struct ip_set *set, const struct sk_buff *skb,
 	};
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
 
-	if (skb_mac_header(skb) < skb->head ||
-	    (skb_mac_header(skb) + ETH_HLEN) > skb->data)
+	if (!skb->dev || skb->dev->type != ARPHRD_ETHER ||
+	    !skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN)
 		return -EINVAL;
 
 	if (opt->flags & IPSET_DIM_TWO_SRC)
diff --git a/net/netfilter/ipset/ip_set_hash_mac.c b/net/netfilter/ipset/ip_set_hash_mac.c
index 718814730acf..41a122591fe2 100644
--- a/net/netfilter/ipset/ip_set_hash_mac.c
+++ b/net/netfilter/ipset/ip_set_hash_mac.c
@@ -8,6 +8,7 @@
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
 #include <linux/errno.h>
+#include <linux/if_arp.h>
 #include <linux/if_ether.h>
 #include <net/netlink.h>
 
@@ -77,8 +78,8 @@ hash_mac4_kadt(struct ip_set *set, const struct sk_buff *skb,
 	struct hash_mac4_elem e = { { .foo[0] = 0, .foo[1] = 0 } };
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
 
-	if (skb_mac_header(skb) < skb->head ||
-	    (skb_mac_header(skb) + ETH_HLEN) > skb->data)
+	if (!skb->dev || skb->dev->type != ARPHRD_ETHER ||
+	    !skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN)
 		return -EINVAL;
 
 	if (opt->flags & IPSET_DIM_ONE_SRC)
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index 0507d67cad27..7a8952b049d1 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -78,7 +78,10 @@ dump_arp_packet(struct nf_log_buf *m,
 	else
 		logflags = NF_LOG_DEFAULT_MASK;
 
-	if (logflags & NF_LOG_MACDECODE) {
+	if ((logflags & NF_LOG_MACDECODE) &&
+	    skb->dev && skb->dev->type == ARPHRD_ETHER &&
+	    skb_mac_header_was_set(skb) &&
+	    skb_mac_header_len(skb) >= ETH_HLEN) {
 		nf_log_buf_add(m, "MACSRC=%pM MACDST=%pM ",
 			       eth_hdr(skb)->h_source, eth_hdr(skb)->h_dest);
 		nf_log_dump_vlan(m, skb);
@@ -797,6 +800,9 @@ static void dump_mac_header(struct nf_log_buf *m,
 
 	switch (dev->type) {
 	case ARPHRD_ETHER:
+		if (!skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN)
+			return;
+
 		nf_log_buf_add(m, "MACSRC=%pM MACDST=%pM ",
 			       eth_hdr(skb)->h_source, eth_hdr(skb)->h_dest);
 		nf_log_dump_vlan(m, skb);
diff --git a/net/netfilter/xt_mac.c b/net/netfilter/xt_mac.c
index 81649da57ba5..4798cd2ca26e 100644
--- a/net/netfilter/xt_mac.c
+++ b/net/netfilter/xt_mac.c
@@ -29,9 +29,7 @@ static bool mac_mt(const struct sk_buff *skb, struct xt_action_param *par)
 
 	if (skb->dev == NULL || skb->dev->type != ARPHRD_ETHER)
 		return false;
-	if (skb_mac_header(skb) < skb->head)
-		return false;
-	if (skb_mac_header(skb) + ETH_HLEN > skb->data)
+	if (!skb_mac_header_was_set(skb) || skb_mac_header_len(skb) < ETH_HLEN)
 		return false;
 	ret  = ether_addr_equal(eth_hdr(skb)->h_source, info->srcaddr);
 	ret ^= info->invert;
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH net-next v3 04/12] net: airoha: Rely on net_device pointer in HTB callbacks
From: Benjamin Larsson @ 2026-04-10 11:29 UTC (permalink / raw)
  To: Jakub Kicinski, lorenzo
  Cc: andrew+netdev, davem, edumazet, pabeni, robh, krzk+dt, conor+dt,
	ansuelsmth, linux-arm-kernel, linux-mediatek, netdev, devicetree
In-Reply-To: <20260410025722.1809973-1-kuba@kernel.org>

On 10/04/2026 04:57, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email says "For your consideration."
> ---
> net: airoha: Introduce WAN device flag
> 
> This commit introduces a WAN flag to specify whether a device is used to
> transmit or receive WAN or LAN traffic. It enables specifying multiple LAN
> devices but limits the configuration to a single WAN device.
> 

[...]

>> +		fallthrough;
>> +	}
>> +	case AIROHA_GDM2_IDX:
>> +		/* GDM2 is always used as wan */
>> +		dev->flags |= PRIV_FLAG_WAN;
>>   		break;
>>   	default:
>>   		break;
>>   	}
>>   
>> -	for (i = 0; i < eth->soc->num_ppe; i++)
>> -		airoha_ppe_set_cpu_port(dev, i, airoha_get_fe_port(dev));
>> +	airoha_dev_set_qdma(dev);
>> +	airoha_set_macaddr(dev, netdev->dev_addr);
> 
> Since the commit message notes it is possible to specify multiple LAN
> devices, does unconditionally calling airoha_set_macaddr() for each
> device overwrite the single shared hardware LAN MAC register?
> 
> If so, could this cause only the last initialized LAN device to have its MAC
> correctly programmed into the hardware, leading to dropped unicast
> packets for the other LAN interfaces?
> 
>> +
>> +	if (!airoha_is_lan_gdm_dev(dev) &&
>> +	    (port->id == AIROHA_GDM3_IDX || port->id == AIROHA_GDM4_IDX)) {
>> +		int err;
> 
> [ ... ]

Hi, the frame engine can be configured with a range of wan mac addresses 
and a range of lan mac addresses via registers that set the top 24 bits 
and then 2 registers that form a range of the low 24 bits.

Documentation says that packets that fall into this mac address range 
either on the lan side or the wan side are to be treated as layer 3 
packages and if a packet is not then it will be handled as a layer 2 packet.

The exact implication of this and if it actually matters is unknown. But 
traffic that comes in on an interface that is not matched by an 
acceleration flow is usually forwarded to the cpu for further processing.

MvH
Benjamin Larsson

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox