Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net 1/1] 8021q: free cleared egress QoS mappings safely
From: Eric Dumazet @ 2026-04-15 16:25 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, andrew+netdev, davem, kuba, pabeni, horms, kees,
	yifanwucs, tomapufckgml, yuantan098, bird, ylong030
In-Reply-To: <b877895cd02d35254b5c05d3c40abbf130cd87eb.1776039122.git.ylong030@ucr.edu>

On Mon, Apr 13, 2026 at 2:08 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> From: Longxuan Yu <ylong030@ucr.edu>
>
> vlan_dev_set_egress_priority() leaves cleared egress priority mapping
> nodes in the hash until device teardown. Repeated set/clear cycles with
> distinct skb priorities therefore allocate an unbounded number of
> vlan_priority_tci_mapping objects and leak memory.
>
> Delete mappings when vlan_prio is cleared instead of keeping
> tombstones. The TX fast path and reporting paths walk the lists without
> RTNL, so convert the egress mapping lists to RCU-protected pointers and
> defer freeing removed nodes until after a grace period.
>
> Cc: stable@kernel.org
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
> Suggested-by: Xin Liu <bird@lzu.edu.cn>
> Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> ---
>  include/linux/if_vlan.h  | 23 +++++++++++--------
>  net/8021q/vlan_dev.c     | 48 +++++++++++++++++++++++-----------------
>  net/8021q/vlan_netlink.c |  9 +++-----
>  net/8021q/vlanproc.c     | 12 ++++++----

>
> @@ -604,11 +606,17 @@ void vlan_dev_free_egress_priority(const struct net_device *dev)
>         int i;
>
>         for (i = 0; i < ARRAY_SIZE(vlan->egress_priority_map); i++) {
> -               while ((pm = vlan->egress_priority_map[i]) != NULL) {
> -                       vlan->egress_priority_map[i] = pm->next;
> -                       kfree(pm);
> +               pm = rtnl_dereference(vlan->egress_priority_map[i]);
> +               RCU_INIT_POINTER(vlan->egress_priority_map[i], NULL);
> +               while (pm) {
> +                       struct vlan_priority_tci_mapping *next;
> +
> +                       next = rtnl_dereference(pm->next);
> +                       kfree_rcu_mightsleep(pm);

Please avoid kfree_rcu_mightsleep().

Embed  instead one rcu_head in the object.

> +                       pm = next;
>                 }
>         }
> +       vlan->nr_egress_mappings = 0;

^ permalink raw reply

* Re: [PATCH v4] nfc: hci: fix out-of-bounds read in HCP header parsing
From: Simon Horman @ 2026-04-15 16:26 UTC (permalink / raw)
  To: Ashutosh Desai; +Cc: netdev, kuba, edumazet, davem, pabeni, linux-kernel
In-Reply-To: <177614425081.3600288.2536320552978506086@gmail.com>

On Tue, Apr 14, 2026 at 05:24:10AM -0000, Ashutosh Desai wrote:
> nfc_hci_recv_from_llc() and nci_hci_data_received_cb() cast skb->data
> to struct hcp_packet and read the message header byte without checking
> that enough data is present in the linear sk_buff area. A malicious NFC
> peer can send a 1-byte HCP frame that passes through the SHDLC layer
> and reaches these functions, causing an out-of-bounds heap read.
> 
> Fix this by adding pskb_may_pull() before each cast to ensure the full
> 2-byte HCP header is pulled into the linear area before it is accessed.
> 
> Fixes: 8b8d2e08bf0d ("NFC: HCI support")
> Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>

Unfortunately this patch seems to be whitespace-damaged
and does not apply. Please address that and repost.

-- 
pw-bot: changes-requested

^ permalink raw reply

* Re: [PATCH net v5] net: stmmac: Prevent NULL deref when RX memory exhausted
From: Russell King (Oracle) @ 2026-04-15 16:28 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, Alexandre Torgue, Maxime Chevallier,
	Ovidiu Panait, Vladimir Oltean, Baruch Siach, Serge Semin,
	Giuseppe Cavallaro, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, stable
In-Reply-To: <ad-LAB08-_rpmMzK@shell.armlinux.org.uk>

On Wed, Apr 15, 2026 at 01:56:32PM +0100, Russell King (Oracle) wrote:
> On Tue, Apr 14, 2026 at 07:39:47PM -0700, Sam Edwards wrote:
> > The CPU receives frames from the MAC through conventional DMA: the CPU
> > allocates buffers for the MAC, then the MAC fills them and returns
> > ownership to the CPU. For each hardware RX queue, the CPU and MAC
> > coordinate through a shared ring array of DMA descriptors: one
> > descriptor per DMA buffer. Each descriptor includes the buffer's
> > physical address and a status flag ("OWN") indicating which side owns
> > the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set
> > the flag and the MAC is only allowed to clear it, and both must move
> > through the ring in sequence: thus the ring is used for both
> > "submissions" and "completions."
> > 
> > In the stmmac driver, stmmac_rx() bookmarks its position in the ring
> > with the `cur_rx` index. The main receive loop in that function checks
> > for rx_descs[cur_rx].own=0, gives the corresponding buffer to the
> > network stack (NULLing the pointer), and increments `cur_rx` modulo the
> > ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its
> > position with `dirty_rx`, allocates fresh buffers and rearms the
> > descriptors (setting OWN=1). If it fails any allocation, it simply stops
> > early (leaving OWN=0) and will retry where it left off when next called.
> > 
> > This means descriptors have a three-stage lifecycle (terms my own):
> > - `empty` (OWN=1, buffer valid)
> > - `full` (OWN=0, buffer valid and populated)
> > - `dirty` (OWN=0, buffer NULL)
> > 
> > But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In
> > the past (see 'Fixes:'), there was a bug where the loop could cycle
> > `cur_rx` all the way back to the first descriptor it dirtied, resulting
> > in a NULL dereference when mistaken for `full`. The aforementioned
> > commit resolved that *specific* failure by capping the loop's iteration
> > limit at `dma_rx_size - 1`, but this is only a partial fix: if the
> > previous stmmac_rx_refill() didn't complete, then there are leftover
> > `dirty` descriptors that the loop might encounter without needing to
> > cycle fully around. The current code therefore panics (see 'Closes:')
> > when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to
> > catch up to `dirty_rx`.
> > 
> > Fix this by further tightening the clamp from `dma_rx_size - 1` to
> > `dma_rx_size - stmmac_rx_dirty() - 1`, subtracting any remnant dirty
> > entries and limiting the loop so that `cur_rx` cannot catch back up to
> > `dirty_rx`. This carries no risk of arithmetic underflow: since the
> > maximum possible return value of stmmac_rx_dirty() is `dma_rx_size - 1`,
> > the worst the clamp can do is prevent the loop from running at all.
> > 
> > Fixes: b6cb4541853c7 ("net: stmmac: avoid rx queue overrun")
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> 
> Locally, while debugging my issues, I used this to prevent cur_rx
> catching up with dirty_rx:
> 
>                 status = stmmac_rx_status(priv, &priv->xstats, p);
>                 /* check if managed by the DMA otherwise go ahead */
>                 if (unlikely(status & dma_own))
>                         break;
> 
>                 next_entry = STMMAC_NEXT_ENTRY(rx_q->cur_rx,
>                                                priv->dma_conf.dma_rx_size);
>                 if (unlikely(next_entry == rx_q->dirty_rx))
>                         break;
> 
>                 rx_q->cur_rx = next_entry;
> 
> If we care about the cost of reloading rx_q->dirty_rx on every
> iteration, then I'd suggest that the cost we already incur reading and
> writing rx_q->cur_rx is something that should be addressed, and
> eliminating that would counter the cost of reading rx_q->dirty_rx. I
> suspect, however, that the cost is minimal, as cur_tx and dirty_rx are
> likely in the same cache line.
> 
> It looks like any fix to stmmac_rx() will also need a corresponding
> fix for stmmac_rx_zc().

I have some further information, but a new curveball has just been
chucked... and I've no idea what this will mean at this stage. Just
take it that I won't be responding for a while.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH iwl-net] ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw
From: Simon Horman @ 2026-04-15 16:30 UTC (permalink / raw)
  To: Petr Oros
  Cc: netdev, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Aleksandr Loktionov, Nikolay Aleksandrov, Daniel Zahka,
	Paul Greenwalt, Dave Ertman, Michal Swiatkowski, jacob.e.keller,
	intel-wired-lan, linux-kernel
In-Reply-To: <20260413191420.3524013-1-poros@redhat.com>

On Mon, Apr 13, 2026 at 09:14:20PM +0200, Petr Oros wrote:
> On certain E810 configurations where firmware supports Tx scheduler
> topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo()
> may need to apply a new 5-layer or 9-layer topology from the DDP
> package. If the AQ command to set the topology fails (e.g. due to
> invalid DDP data or firmware limitations), the global configuration
> lock must still be cleared via a CORER reset.
> 
> Commit 86aae43f21cf ("ice: don't leave device non-functional if Tx
> scheduler config fails") correctly fixed this by refactoring
> ice_cfg_tx_topo() to always trigger CORER after acquiring the global
> lock and re-initialize hardware via ice_init_hw() afterwards.
> 
> However, commit 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end
> of deinit paths") later moved ice_init_dev_hw() into ice_init_hw(),
> breaking the reinit path introduced by 86aae43f21cf. This creates an
> infinite recursive call chain:
> 
>   ice_init_hw()
>     ice_init_dev_hw()
>       ice_cfg_tx_topo()         # topology change needed
>         ice_deinit_hw()
>         ice_init_hw()           # reinit after CORER
>           ice_init_dev_hw()     # recurse
>             ice_cfg_tx_topo()
>               ...               # stack overflow
> 
> Fix by moving ice_init_dev_hw() back out of ice_init_hw() and calling
> it explicitly from ice_probe() and ice_devlink_reinit_up(). The third
> caller, ice_cfg_tx_topo(), intentionally does not need ice_init_dev_hw()
> during its reinit, it only needs the core HW reinitialization. This
> breaks the recursion cleanly without adding flags or guards.
> 
> The deinit ordering changes from commit 8a37f9e2ff40 ("ice: move
> ice_deinit_dev() to the end of deinit paths") which fixed slow rmmod
> are preserved, only the init-side placement of ice_init_dev_hw() is
> reverted.
> 
> Fixes: 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end of deinit paths")
> Signed-off-by: Petr Oros <poros@redhat.com>

Hi Petr,

I don't intended to delay this patch.
But could you follow-up by looking over the AI generated
review of this patch on sashiko.dev?

Thanks!

^ permalink raw reply

* Re: [PATCH net] ixgbevf: fix use-after-free in VEPA multicast source pruning
From: Michael Bommarito @ 2026-04-15 16:30 UTC (permalink / raw)
  To: Simon Horman
  Cc: intel-wired-lan, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, stable, linux-kernel
In-Reply-To: <20260415161720.GN772670@horms.kernel.org>

On Wed, Apr 15, 2026 at 12:17 PM Simon Horman <horms@kernel.org> wrote:
> Sashiko flags a number of issues in the same function that
> do not seem related to your patch.
>
> I'd suggest looking over them if you are interested in
> follow-up work in this area.

Sure, I'd be happy to keep going here if you're open to more hardening
patches.

Two Qs for you:

1. Do you want smaller patches for each or bigger method-level patches?

2. Anything on my list below that you would *not* want me touching?
I'll combine with anything I can find from your Sashiko items

    1. line 104
       rule:   semgrep bug-on-in-net-code (CWE-617)
       match:  BUG_ON(!test_bit(__IXGBEVF_SERVICE_SCHED,
&adapter->state))
       where:  ixgbevf_service_event_schedule()
       status: untriaged

    2. lines 1219-1225
       rule:   net-drop-continue-in-loop + scan_drop_continue_loops.py
       match:  VEPA multicast pruning kfree_skb + continue (UAF)
       where:  ixgbevf_clean_rx_irq()
       status: SHIPPED as commit ca62ac02b30d (this patch)

    3. line 2769
       rule:   semgrep signed-int-as-size-param-kmalloc
       match:  q_vector = kzalloc(size, GFP_KERNEL)  (signed size)
       status: untriaged

    4. line 3452
       rule:   semgrep signed-int-as-size-param-kmalloc
       match:  tx_ring->tx_buffer_info = vmalloc(size)  (signed size)
       status: untriaged

    5. line 3530
       rule:   semgrep signed-int-as-size-param-kmalloc
       match:  rx_ring->rx_buffer_info = vmalloc(size)  (signed size)
       status: untriaged

    6. line 4114
       rule:   semgrep narrow-accumulator-overflow
       match:  i += tx_ring->count;
       status: untriaged

    7. line 4189
       rule:   semgrep narrow-accumulator-overflow
       match:  count += TXD_USE_COUNT(skb_frag_size(frag));
       status: untriaged

    8. line 4192
       rule:   semgrep narrow-accumulator-overflow
       match:  count += skb_shinfo(skb)->nr_frags;
       status: untriaged

    9. line 4695
       rule:   coccinelle cancel_work.cocci
       match:  INIT_WORK(&adapter->service_task, ixgbevf_service_task)
               with no matching cancel_work_sync on teardown path
       status: untriaged

   10. line 4752
       rule:   coccinelle null_after_free.cocci
       where:  ixgbevf_probe() err_dma path
       status: untriaged

   11. line 4795
       rule:   coccinelle null_after_free.cocci
       where:  ixgbevf_remove()
       status: untriaged

^ permalink raw reply

* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Simon Horman @ 2026-04-15 16:46 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022@kernel.org>

On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> 
> Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
> Changes in v2:
> - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
>   order to avoid any possible NULL pointer dereference in
>   airoha_qdma_cleanup_tx_queue()

This seems to be a separate issue.
If so, I think it should be split out into a separate patch.

> - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org

I think it was covered in the review Jakub forwarded for v1.  But FTR,
Sashiko has some feedback on this patch in the form of an existing bug
(that should almost certainly be handled separately from this patch).

> ---
>  drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
>  1 file changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 9e995094c32a..3c1a2bc68c42 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
>  	dma_addr_t dma_addr;
>  
>  	spin_lock_init(&q->lock);
> -	q->ndesc = size;
>  	q->qdma = qdma;
>  	q->free_thr = 1 + MAX_SKB_FRAGS;
>  	INIT_LIST_HEAD(&q->tx_list);
>  
> -	q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> +	q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
>  				GFP_KERNEL);
>  	if (!q->entry)
>  		return -ENOMEM;
>  
> -	q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> +	q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
>  				      &dma_addr, GFP_KERNEL);
>  	if (!q->desc)
>  		return -ENOMEM;
>  
> -	for (i = 0; i < q->ndesc; i++) {
> +	for (i = 0; i < size; i++) {
>  		u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
>  
>  		list_add_tail(&q->entry[i].list, &q->tx_list);
>  		WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
>  	}
> +	q->ndesc = size;
>  
>  	/* xmit ring drop default setting */
>  	airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
>  
>  static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
>  {
> -	struct airoha_eth *eth = q->qdma->eth;
> -	int i;
> +	struct airoha_qdma *qdma = q->qdma;
> +	struct airoha_eth *eth = qdma->eth;
> +	int i, qid = q - &qdma->q_tx[0];
> +	struct airoha_queue_entry *e;
> +	u16 index = 0;
>  
>  	spin_lock_bh(&q->lock);
>  	for (i = 0; i < q->ndesc; i++) {
> -		struct airoha_queue_entry *e = &q->entry[i];

super nit: In v2 e is always used within a block (here and in the hunk below).
           So I would lean towards declaring e in the blocks where it is
	   used.

	   No need to repost just for this!

> +		struct airoha_qdma_desc *desc = &q->desc[i];
>  
> +		e = &q->entry[i];
>  		if (!e->dma_addr)
>  			continue;
>  
> @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
>  		e->dma_addr = 0;
>  		e->skb = NULL;
>  		list_add_tail(&e->list, &q->tx_list);
> +
> +		/* Reset DMA descriptor */
> +		WRITE_ONCE(desc->ctrl, 0);
> +		WRITE_ONCE(desc->addr, 0);
> +		WRITE_ONCE(desc->data, 0);
> +		WRITE_ONCE(desc->msg0, 0);
> +		WRITE_ONCE(desc->msg1, 0);
> +		WRITE_ONCE(desc->msg2, 0);
> +
>  		q->queued--;
>  	}
> +
> +	if (!list_empty(&q->tx_list)) {
> +		e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> +				     list);
> +		index = e - q->entry;
> +	}
> +	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> +	 * empty.
> +	 */
> +	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> +			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> +	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> +			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> +
>  	spin_unlock_bh(&q->lock);
>  }
>  
> 
> ---
> base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
> 
> Best regards,
> -- 
> Lorenzo Bianconi <lorenzo@kernel.org>
> 

^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: use RCU-safe list primitives for basechain hook list
From: Pablo Neira Ayuso @ 2026-04-15 16:54 UTC (permalink / raw)
  To: Weiming Shi
  Cc: Florian Westphal, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Phil Sutter, Simon Horman, netfilter-devel, coreteam,
	netdev, linux-kernel, Xiang Mei
In-Reply-To: <20260410101321.915190-2-bestswngs@gmail.com>

On Fri, Apr 10, 2026 at 06:13:22PM +0800, Weiming Shi wrote:
> NFT_MSG_GETCHAIN runs as an NFNL_CB_RCU callback, so chain dumps
> traverse basechain->hook_list under rcu_read_lock() without holding
> commit_mutex. Meanwhile, nft_delchain_hook() mutates that same live
> hook_list with plain list_move() and list_splice(), and the commit/abort
> paths splice hooks back with plain list_splice(). None of these are
> RCU-safe list operations.
> 
> A concurrent GETCHAIN dump can observe partially updated list pointers,
> follow them into stack-local or transaction-private list heads, and
> crash when container_of() produces a bogus struct nft_hook pointer.

For the record, v1 of proposed series to fix this is here:

https://patchwork.ozlabs.org/project/netfilter-devel/list/?series=499757

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net] hv_sock: Report EOF instead of -EIO for FIN
From: Dexuan Cui @ 2026-04-15 16:55 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, niuxuewei.nxw@antgroup.com,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, Ben Hillis, Mitchell Levy
In-Reply-To: <ad9pPrji1uYSgNir@sgarzare-redhat>

> From: Stefano Garzarella <sgarzare@redhat.com>
> Sent: Wednesday, April 15, 2026 3:38 AM
> >@@ -703,8 +703,22 @@ static s64 hvs_stream_has_data(struct vsock_sock
> *vsk)
> > 	switch (hvs_channel_readable_payload(hvs->chan)) {
> > 	case 1:
> > 		need_refill = !hvs->recv_desc;
> >-		if (!need_refill)
> >-			return -EIO;
> >+		if (!need_refill) {
> 
> Can we drop `need_refill` entirly and just check `hvs->recv_desc` here?

OK. Will post v2 later today.

> Mainly because now the comment we are adding is confusing me about what
> `need_refill` means.
> 
> The rest LGTM.
> 
> Thanks,
> Stefano

Thanks for the review!


^ permalink raw reply

* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-15 16:58 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260415164643.GQ772670@horms.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]

On Apr 15, Simon Horman wrote:
> On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> > Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> > airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> > TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> > 
> > Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> > Changes in v2:
> > - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> >   order to avoid any possible NULL pointer dereference in
> >   airoha_qdma_cleanup_tx_queue()
> 
> This seems to be a separate issue.
> If so, I think it should be split out into a separate patch.
> 
> > - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> > - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
> 
> I think it was covered in the review Jakub forwarded for v1.  But FTR,
> Sashiko has some feedback on this patch in the form of an existing bug
> (that should almost certainly be handled separately from this patch).

Hi Simon,

I took a look to the Sashiko's report [0] but this issue is not introduced by
this patch and, even if it would be a better approach, I guess the hw is
capable of managing out-of-order TX descriptors. So I guess this patch is fine
in this way, agree?

[0] https://sashiko.dev/#/patchset/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022%40kernel.org

> 
> > ---
> >  drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
> >  1 file changed, 34 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 9e995094c32a..3c1a2bc68c42 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
> >  	dma_addr_t dma_addr;
> >  
> >  	spin_lock_init(&q->lock);
> > -	q->ndesc = size;
> >  	q->qdma = qdma;
> >  	q->free_thr = 1 + MAX_SKB_FRAGS;
> >  	INIT_LIST_HEAD(&q->tx_list);
> >  
> > -	q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> > +	q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
> >  				GFP_KERNEL);
> >  	if (!q->entry)
> >  		return -ENOMEM;
> >  
> > -	q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> > +	q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
> >  				      &dma_addr, GFP_KERNEL);
> >  	if (!q->desc)
> >  		return -ENOMEM;
> >  
> > -	for (i = 0; i < q->ndesc; i++) {
> > +	for (i = 0; i < size; i++) {
> >  		u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
> >  
> >  		list_add_tail(&q->entry[i].list, &q->tx_list);
> >  		WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
> >  	}
> > +	q->ndesc = size;
> >  
> >  	/* xmit ring drop default setting */
> >  	airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> > @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> >  
> >  static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> >  {
> > -	struct airoha_eth *eth = q->qdma->eth;
> > -	int i;
> > +	struct airoha_qdma *qdma = q->qdma;
> > +	struct airoha_eth *eth = qdma->eth;
> > +	int i, qid = q - &qdma->q_tx[0];
> > +	struct airoha_queue_entry *e;
> > +	u16 index = 0;
> >  
> >  	spin_lock_bh(&q->lock);
> >  	for (i = 0; i < q->ndesc; i++) {
> > -		struct airoha_queue_entry *e = &q->entry[i];
> 
> super nit: In v2 e is always used within a block (here and in the hunk below).
>            So I would lean towards declaring e in the blocks where it is
> 	   used.
> 
> 	   No need to repost just for this!

I can fix it if I need to repost.

Regards,
Lorenzo

> 
> > +		struct airoha_qdma_desc *desc = &q->desc[i];
> >  
> > +		e = &q->entry[i];
> >  		if (!e->dma_addr)
> >  			continue;
> >  
> > @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> >  		e->dma_addr = 0;
> >  		e->skb = NULL;
> >  		list_add_tail(&e->list, &q->tx_list);
> > +
> > +		/* Reset DMA descriptor */
> > +		WRITE_ONCE(desc->ctrl, 0);
> > +		WRITE_ONCE(desc->addr, 0);
> > +		WRITE_ONCE(desc->data, 0);
> > +		WRITE_ONCE(desc->msg0, 0);
> > +		WRITE_ONCE(desc->msg1, 0);
> > +		WRITE_ONCE(desc->msg2, 0);
> > +
> >  		q->queued--;
> >  	}
> > +
> > +	if (!list_empty(&q->tx_list)) {
> > +		e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> > +				     list);
> > +		index = e - q->entry;
> > +	}
> > +	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> > +	 * empty.
> > +	 */
> > +	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> > +			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> > +	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> > +			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> > +
> >  	spin_unlock_bh(&q->lock);
> >  }
> >  
> > 
> > ---
> > base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> > change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
> > 
> > Best regards,
> > -- 
> > Lorenzo Bianconi <lorenzo@kernel.org>
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH nf,v2 1/3] rculist: add list_splice_rcu() for private lists
From: Pablo Neira Ayuso @ 2026-04-15 17:08 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms, joelagnelf,
	josh, boqun, urezki, rostedt, mathieu.desnoyers, jiangshanlai,
	qiang.zhang, rcu

This patch adds a helper function, list_splice_rcu(), to safely splice
a private (non-RCU-protected) list into an RCU-protected list.

The function ensures that only the pointer visible to RCU readers
(prev->next) is updated using rcu_assign_pointer(), while the rest of
the list manipulations are performed with regular assignments, as the
source list is private and not visible to concurrent RCU readers.

This is useful for moving elements from a private list into a global
RCU-protected list, ensuring safe publication for RCU readers.
Subsystems with some sort of batching mechanism from userspace can
benefit from this new function.

The function __list_splice_rcu() has been added for clarity and to
follow the same pattern as in the existing list_splice*() interfaces,
where there is a check to ensure that that the list to splice is not
empty. Note that __list_splice_rcu() has no documentation for this
reason.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
v2: including comments by Paul McKenney.

    Except, I have deliberately keep back the suggestion to squash
    __list_splice_rcu() into list_splice_rcu(), I instead removed
    the documentation for __list_splice_rcu(). I am looking
    at other existing list_splice*() function in list.h and rculist.h
    to get this aligned with __list_splice(), which also has no users
    in the tree and no documentation. I find it easier to read with
    __list_splice(), but if this explaination is not sound so...

    @Paul: I can post v3 squashing __list_splice_rcu(), just let me
           know.

    Thanks!

 include/linux/rculist.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 2abba7552605..e3bc44225692 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -261,6 +261,35 @@ static inline void list_replace_rcu(struct list_head *old,
 	old->prev = LIST_POISON2;
 }
 
+static inline void __list_splice_rcu(struct list_head *list,
+				     struct list_head *prev,
+				     struct list_head *next)
+{
+	struct list_head *first = list->next;
+	struct list_head *last = list->prev;
+
+	last->next = next;
+	first->prev = prev;
+	next->prev = last;
+	rcu_assign_pointer(list_next_rcu(prev), first);
+}
+
+/**
+ * list_splice_rcu - splice a non-RCU list into an RCU-protected list,
+ *                   designed for stacks.
+ * @list:	the non RCU-protected list to splice
+ * @head:	the place in the existing RCU-protected list to splice
+ *
+ * The list pointed to by @head can be RCU-read traversed concurrently with
+ * this function.
+ */
+static inline void list_splice_rcu(struct list_head *list,
+				   struct list_head *head)
+{
+	if (!list_empty(list))
+		__list_splice_rcu(list, head, head->next);
+}
+
 /**
  * __list_splice_init_rcu - join an RCU-protected list into an existing list.
  * @list:	the RCU-protected list to splice
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH 1/1] xskmap: reject TX-only AF_XDP sockets
From: Yuan Tan @ 2026-04-15 17:22 UTC (permalink / raw)
  To: Jason Xing, Linpu Yu
  Cc: yuantan098, magnus.karlsson, maciej.fijalkowski, netdev, bpf, sdf,
	davem, edumazet, kuba, pabeni, horms, ast, daniel, hawk,
	john.fastabend, bjorn, linux-kernel, yifanwucs
In-Reply-To: <CAL+tcoAy5WQRsL6z=YinPaiBNvd_=WB7qsR4amf1x4=qVw7AAg@mail.gmail.com>


On 4/15/2026 1:43 AM, Jason Xing wrote:
> On Mon, Mar 30, 2026 at 3:33 AM Linpu Yu <linpu5433@gmail.com> wrote:
>> Reject TX-only AF_XDP sockets from XSKMAP updates. Redirected
>> packets always enter the Rx path, where the kernel expects the
>> selected socket to have an Rx ring. A TX-only socket can
>> currently be inserted into an XSKMAP, and redirecting a packet
>> to it crashes the kernel in xsk_generic_rcv().
>>
>> Keep TX-only AF_XDP sockets valid for pure Tx use, but prevent
>> them from being published through XSKMAP.
>>
>> Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>> Reported-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Xin Liu <bird@lzu.edu.cn>
>> Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
>> Signed-off-by: Linpu Yu <linpu5433@gmail.com>
> Hi Linpu,
>
> Any plan to post a v2 with our questions resolved?
>
> Thanks,
> Jason

Hi Jason, Linpu has an exam to take this week. He told he can try
preparing the v2 patch this weekend. Best,

Yuan


^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: tcp: Reject TCP_NODELAY from BPF hdr opt callbacks
From: Martin KaFai Lau @ 2026-04-15 17:31 UTC (permalink / raw)
  To: KaFai Wan
  Cc: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
	ast, daniel, andrii, eddyz87, memxor, song, yonghong.song, jolsa,
	shuah, sdf, netdev, linux-kernel, bpf, linux-kselftest, Quan Sun,
	Yinhao Hu, Kaiyan Mei
In-Reply-To: <20260414112310.1285783-2-kafai.wan@linux.dev>

On Tue, Apr 14, 2026 at 07:23:09PM +0800, KaFai Wan wrote:
> A BPF_SOCK_OPS program can enable
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG and then call
> bpf_setsockopt(TCP_NODELAY) from BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> 
> That reaches __tcp_sock_set_nodelay(), which may call
> tcp_push_pending_frames(). The transmit path then computes TCP
> options again, re-enters bpf_skops_hdr_opt_len(), and invokes the
> same BPF callback recursively. This can loop until the kernel
> stack overflows.
> 
> TCP_NODELAY is not safe from the header option callback context.
> Reject it with -EOPNOTSUPP when TCP header option callbacks are
> enabled on the socket, so the callback cannot recurse back into
> tcp_push_pending_frames() through do_tcp_setsockopt().
> 
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 7e41df5dbba2 ("bpf: Add a few optnames to bpf_setsockopt")
> Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
> ---
>  net/ipv4/tcp.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 202a4e57a218..7ac4c98be19d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4004,7 +4004,10 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
>  
>  	switch (optname) {
>  	case TCP_NODELAY:
> -		__tcp_sock_set_nodelay(sk, val);
> +		if (val && BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))

It will break the syscall setsockopt and also break the existing bpf prog
that calls bpf_setsockopt(TCP_NODELAY) in CB other than the
BPF_SOCK_OPS_HDR_OPT_LEN_CB/BPF_SOCK_OPS_WRITE_HDR_OPT_CB.

Lets brainstorm other options suggested on the list that have smaller
blast radius.

pw-bot: cr

> +			err = -EOPNOTSUPP;
> +		else
> +			__tcp_sock_set_nodelay(sk, val);
>  		break;
>  
>  	case TCP_THIN_LINEAR_TIMEOUTS:
> -- 
> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH net-next] net: stmmac: enable RPS and RBU interrupts
From: Sam Edwards @ 2026-04-15 17:38 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Jakub Kicinski, Andrew Lunn, Alexandre Torgue, Andrew Lunn,
	David S. Miller, Eric Dumazet,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	linux-stm32, Linux Network Development Mailing List, Paolo Abeni
In-Reply-To: <ad-ID2WaPgPJqdsa@shell.armlinux.org.uk>

On Wed, Apr 15, 2026 at 5:44 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Tue, Apr 14, 2026 at 07:12:34PM -0700, Sam Edwards wrote:
> > On Tue, Apr 14, 2026 at 6:19 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > > Okay, just a quick note to say that nvidia's 5.10.216-tegra kernel
> > > survives iperf3 -c -R to the imx6.
> >
> > Hi Russell,
> >
> > Aw, you beat me to it! I was about to report that 5.10.104-tegra is
> > unaffected. And my iperf3 server is a multi-GbE amd64 machine.
> >
> > > Dumping the registers and comparing, and then forcing the RQS and TQS
> > > values to 0x23 (+1 = 36, *256 = 9216 bytes) and 0x8f (+1 = 144,
> > > *256 = 36864 ytes) respectively seems to solve the problem. Under
> > > net-next, these both end up being 0xff (+1 = 256, *256 = 65536 bytes.)
> > > Suspiciously, 36 * 4 = 144, and I also see that this kernel programs
> > > all four of the MTL receive operation mode registers, but only the
> > > first MTL transmit operation mode register. However, DMA channels 1-3
> > > aren't initialised.
> >
> > Wow, great! I wonder if the problem is that the MTL FIFOs are smaller
> > than that, so when the DMA suffers a momentary hiccup, the FIFOs are
> > allowed to overflow, putting the hardware in a bad state.
> >
> > Though I suspect this is only half of the problem: do you still see
> > RBUs? Everything you've shared so far suggests the DMA failures are
> > _not_ because the rx ring is drying up.
>
> Yes. Note that RBUs will happen not because of DMA failures, but if
> the kernel fails to keep up with the packet rate. RBU means "we read
> the next descriptor, and it wasn't owned by hardware".

Are you speaking from observation, documentation, or understanding?
I'd define RBU the same way, but you reported:

```
[   55.766199] dwc-eth-dwmac 2490000.ethernet eth0: q0: receive buffer
unavailable: cur_rx=309 dirty_rx=309 last_cur_rx=245
last_cur_rx_post=309 last_dirty_rx=245 count=64 budget=64

cur_rx == dirty_rx _should_ mean that we fully refilled the ring. [...]
[...]
Every ring entry contains the same RDES3 value, so it really is
completely full at the point RBU fires (bit 31 clear means software
owns the descriptor, and it's basically saying first/last segment,
RDES1 valid, buffer 1 length of 1518.
```

It would seem* that the kernel isn't really failing to keep up with
the packet rate. If RBU is firing with a ring that's not even close to
empty, that tells me there's another way for it to fire. So I suspect
the hardware designers implemented it to mean:
"We couldn't read the next descriptor, _or_ it wasn't owned by hardware."

(* However, if bit 31 is clear everywhere, wouldn't that mean the ring
is actually completely depleted, not full? If count==budget, wouldn't
that mean the whole ring hasn't been visited, so we only refilled 64
entries and not necessarily the entire ring? Maybe the kernel isn't
keeping up after all.)

> That has:
>
>         const nveu32_t rx_fifo_sz[2U][OSI_EQOS_MAX_NUM_QUEUES] = {
>                 { FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U),
>                   FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U) },
>                 { FIFO_SZ(36U), FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(2U),
>                   FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(16U) },
>         };
>         const nveu32_t tx_fifo_sz[2U][OSI_EQOS_MAX_NUM_QUEUES] = {
>                 { FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U),
>                   FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U) },
>                 { FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U),
>                   FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U) },
>         };
>
> where each of those values is the RQS/TQS value to use in KiB:
>
> #define FIFO_SZ(x)              ((((x) * 1024U) / 256U) - 1U)
>
> This doesn't correspond with the values I'm seeing programmed into
> the hardware under the 5.10.216-tegra kernel. I'm seeing TQS = 143
> (36KiB), and RQS = 35 (9KiB). Yes, these values exist in the tables
> above from a quick look, but they're not in the right place!

True, but:
a) I doubt 5.10.216-tegra includes exactly the same version of the
driver found in this random GitHub mirror. (My intent was only to
point out that they don't use 5.10's stmmac; I should have been more
clear that I wasn't trying to link the same version, sorry!)
b) This is vendor code; I don't know how good their testing/review
process is. It might not run the way it looks. The intent seems to be
for RQS > TQS (which makes intuitive sense), but as you're seeing the
registers programmed the other way 'round, they might have gotten them
subtly mixed up.

> Now, as for FIFO sizes, if we sum up all the entries, then we
> get:
>
> SUM(rx_fifo_size[0][]) = 60KiB
> SUM(rx_fifo_size[1][]) = 64KiB
> SUM(tx_fifo_size[0][]) = 60KiB
> SUM(tx_fifo_size[1][]) = 64KiB

I follow the math with 64KiB, but surely the 60KiB should be
9+9+9+9+1+1+1+1=40KiB? This seems to me that the "legacy EQOS" simply
shifts with smaller FIFOs. Since dwmac is licensed as a soft IP core,
perhaps the FIFO size is an elaboration parameter? That would mean
this isn't an issue with dwmac 5.0 broadly, but with Nvidia's specific
instantiation of it.

^ permalink raw reply

* Re: [PATCH nf,v2 1/3] rculist: add list_splice_rcu() for private lists
From: Paul E. McKenney @ 2026-04-15 17:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, davem, netdev, kuba, pabeni, edumazet, fw, horms,
	joelagnelf, josh, boqun, urezki, rostedt, mathieu.desnoyers,
	jiangshanlai, qiang.zhang, rcu
In-Reply-To: <20260415170844.41355-1-pablo@netfilter.org>

On Wed, Apr 15, 2026 at 07:08:44PM +0200, Pablo Neira Ayuso wrote:
> This patch adds a helper function, list_splice_rcu(), to safely splice
> a private (non-RCU-protected) list into an RCU-protected list.
> 
> The function ensures that only the pointer visible to RCU readers
> (prev->next) is updated using rcu_assign_pointer(), while the rest of
> the list manipulations are performed with regular assignments, as the
> source list is private and not visible to concurrent RCU readers.
> 
> This is useful for moving elements from a private list into a global
> RCU-protected list, ensuring safe publication for RCU readers.
> Subsystems with some sort of batching mechanism from userspace can
> benefit from this new function.
> 
> The function __list_splice_rcu() has been added for clarity and to
> follow the same pattern as in the existing list_splice*() interfaces,
> where there is a check to ensure that that the list to splice is not
> empty. Note that __list_splice_rcu() has no documentation for this
> reason.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> v2: including comments by Paul McKenney.
> 
>     Except, I have deliberately keep back the suggestion to squash
>     __list_splice_rcu() into list_splice_rcu(), I instead removed
>     the documentation for __list_splice_rcu(). I am looking
>     at other existing list_splice*() function in list.h and rculist.h
>     to get this aligned with __list_splice(), which also has no users
>     in the tree and no documentation. I find it easier to read with
>     __list_splice(), but if this explaination is not sound so...
> 
>     @Paul: I can post v3 squashing __list_splice_rcu(), just let me
>            know.

Removing the comment addresses most of my concerns.  I do have a slight
but not overwhelming preference for the squashed version, but either way:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Or if you want this to go in via RCU, please let us know.  My guess is
that it would be easier for you to take it in with the code using it.

							Thanx, Paul

>     Thanks!
> 
>  include/linux/rculist.h | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 2abba7552605..e3bc44225692 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -261,6 +261,35 @@ static inline void list_replace_rcu(struct list_head *old,
>  	old->prev = LIST_POISON2;
>  }
>  
> +static inline void __list_splice_rcu(struct list_head *list,
> +				     struct list_head *prev,
> +				     struct list_head *next)
> +{
> +	struct list_head *first = list->next;
> +	struct list_head *last = list->prev;
> +
> +	last->next = next;
> +	first->prev = prev;
> +	next->prev = last;
> +	rcu_assign_pointer(list_next_rcu(prev), first);
> +}
> +
> +/**
> + * list_splice_rcu - splice a non-RCU list into an RCU-protected list,
> + *                   designed for stacks.
> + * @list:	the non RCU-protected list to splice
> + * @head:	the place in the existing RCU-protected list to splice
> + *
> + * The list pointed to by @head can be RCU-read traversed concurrently with
> + * this function.
> + */
> +static inline void list_splice_rcu(struct list_head *list,
> +				   struct list_head *head)
> +{
> +	if (!list_empty(list))
> +		__list_splice_rcu(list, head, head->next);
> +}
> +
>  /**
>   * __list_splice_init_rcu - join an RCU-protected list into an existing list.
>   * @list:	the RCU-protected list to splice
> -- 
> 2.47.3
> 
> 

^ permalink raw reply

* Re: [PATCH net v5] net: stmmac: Prevent NULL deref when RX memory exhausted
From: Sam Edwards @ 2026-04-15 17:53 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, Alexandre Torgue, Maxime Chevallier,
	Ovidiu Panait, Vladimir Oltean, Baruch Siach, Serge Semin,
	Giuseppe Cavallaro, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, stable
In-Reply-To: <ad-8q4OrOm-VtGrO@shell.armlinux.org.uk>

On Wed, Apr 15, 2026 at 9:28 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Wed, Apr 15, 2026 at 01:56:32PM +0100, Russell King (Oracle) wrote:
> > Locally, while debugging my issues, I used this to prevent cur_rx
> > catching up with dirty_rx:
> >
> >                 status = stmmac_rx_status(priv, &priv->xstats, p);
> >                 /* check if managed by the DMA otherwise go ahead */
> >                 if (unlikely(status & dma_own))
> >                         break;
> >
> >                 next_entry = STMMAC_NEXT_ENTRY(rx_q->cur_rx,
> >                                                priv->dma_conf.dma_rx_size);
> >                 if (unlikely(next_entry == rx_q->dirty_rx))
> >                         break;
> >
> >                 rx_q->cur_rx = next_entry;
> >
> > If we care about the cost of reloading rx_q->dirty_rx on every
> > iteration, then I'd suggest that the cost we already incur reading and
> > writing rx_q->cur_rx is something that should be addressed, and
> > eliminating that would counter the cost of reading rx_q->dirty_rx. I
> > suspect, however, that the cost is minimal, as cur_tx and dirty_rx are
> > likely in the same cache line.

No, no, I like your approach better. :) It also removes the need for
the `limit` clamp at the top of the function, so later code can assume
limit==budget.

> > It looks like any fix to stmmac_rx() will also need a corresponding
> > fix for stmmac_rx_zc().

I agree that stmmac_rx_zc() is likely also broken (in a similar way,
but not similar enough to permit a "corresponding" fix), but I don't
agree that there's a dependency relationship here. This patch is
addressing #221010, which affects the generic/non-ZC codepath; I'm
afraid the ZC codepath warrants its own investigation.

> I have some further information, but a new curveball has just been
> chucked... and I've no idea what this will mean at this stage. Just
> take it that I won't be responding for a while.

I think I follow your meaning. Good luck getting it straightened out!

^ permalink raw reply

* Re: [PATCH net v3 2/5] bonding: 3ad: fix carrier when no valid slaves
From: Louis Scalbert @ 2026-04-15 17:53 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev, andrew+netdev, edumazet, kuba, pabeni, fbl, andy,
	shemminger, maheshb
In-Reply-To: <707939.1776099707@famine>

Hello Jay,

Thank you very much for this detailed review.

Le lun. 13 avr. 2026 à 19:01, Jay Vosburgh <jv@jvosburgh.net> a écrit :
>
> Louis Scalbert <louis.scalbert@6wind.com> wrote:
>
> >Apply the "lacp_fallback" configuration from the previous commit.
> >
> >"lacp_fallback" mode "strict" asserts that the bonding master carrier
> >only when at least 'min_links' slaves are in the collecting/distributing
> >state (or collecting only if the coupled_control default behavior is
> >disabled).
> >
> >Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
> >Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
> >---
> > drivers/net/bonding/bond_3ad.c     | 26 ++++++++++++++++++++++++--
> > drivers/net/bonding/bond_options.c |  1 +
> > 2 files changed, 25 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> >index af7f74cfdc08..b79a76296966 100644
> >--- a/drivers/net/bonding/bond_3ad.c
> >+++ b/drivers/net/bonding/bond_3ad.c
> >@@ -745,6 +745,22 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val)
> >       }
> > }
> >
> >+static int __agg_valid_ports(struct aggregator *agg)
> >+{
> >+      struct port *port;
> >+      int valid = 0;
> >+
> >+      for (port = agg->lag_ports; port;
> >+           port = port->next_port_in_aggregator) {
> >+              if (port->actor_oper_port_state & LACP_STATE_COLLECTING &&
> >+                  (!port->slave->bond->params.coupled_control ||
> >+                   port->actor_oper_port_state & LACP_STATE_DISTRIBUTING))
> >+                      valid++;
>
>         Do we need to test coupled_control?  I.e., can the test be

With coupled_control enabled (default), the actor allows traffic from
the partner only when it reaches both the COLLECTING and DISTRIBUTING
states, i.e., in the AD_MUX_COLLECTING_DISTRIBUTING Mux state.

With coupled_control disabled, the actor allows traffic from the
partner as soon as it reaches the COLLECTING state, regardless of the
DISTRIBUTING flag. In this case, COLLECTING is set in the
AD_MUX_COLLECTING state, while DISTRIBUTING is only set later in the
AD_MUX_DISTRIBUTING state.

From the perspective of upper-layer processes, a carrier up state
indicates that the link is fully operational and capable of both
collecting and distributing traffic. These processes are not aware of
the distinction between COLLECTING and COLLECTING & DISTRIBUTING
states.

>
>                 if ((port->actor_oper_port_state & LACP_STATE_COLLECTING) &&
>                     (port->actor_oper_port_state & LACP_STATE_DISTRIBUTING))
>

If we want to allow collection to start without waiting for
distribution to be enabled, then the carrier must be asserted as soon
as the COLLECTING state is reached.
In that case, this test would not be valid.

In practice, I can only test for LACP_STATE_COLLECTING, because
LACP_STATE_DISTRIBUTING is always set together with
LACP_STATE_COLLECTING.

>         To my reading, ad_mux_machine will set _COLLECTING and
> _DISTRIBUTING appropriately regardless of the coupled_control selection.

I don't agree. See:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/bonding/bond_3ad.c?id=v7.0#n1090
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/bonding/bond_3ad.c?id=v7.0#n1202
>
> >+      }
> >+
> >+      return valid;
> >+}
> >+
> > static int __agg_active_ports(struct aggregator *agg)
> > {
> >       struct port *port;
> >@@ -2120,6 +2136,7 @@ static void ad_enable_collecting_distributing(struct port *port,
> >                         port->actor_port_number,
> >                         port->aggregator->aggregator_identifier);
> >               __enable_port(port);
> >+              bond_3ad_set_carrier(port->slave->bond);
> >               /* Slave array needs update */
> >               *update_slave_arr = true;
> >               /* Should notify peers if possible */
> >@@ -2141,6 +2158,7 @@ static void ad_disable_collecting_distributing(struct port *port,
> >                         port->actor_port_number,
> >                         port->aggregator->aggregator_identifier);
> >               __disable_port(port);
> >+              bond_3ad_set_carrier(port->slave->bond);
> >               /* Slave array needs an update */
> >               *update_slave_arr = true;
> >       }
> >@@ -2819,8 +2837,12 @@ int bond_3ad_set_carrier(struct bonding *bond)
> >       }
> >       active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator));
> >       if (active) {
> >-              /* are enough slaves available to consider link up? */
> >-              if (__agg_active_ports(active) < bond->params.min_links) {
> >+              /* are enough slaves in collecting (and distributing) state to consider
> >+               * link up?
> >+               */
> >+              if ((bond->params.lacp_fallback ? __agg_valid_ports(active)
> >+                                      : __agg_active_ports(active)) <
> >+                  bond->params.min_links) {
>
>         I think the original comment is better; if the new option is
> off, it doesn't require collecting / distributing state.
>
>         -J
>
> >                       if (netif_carrier_ok(bond->dev)) {
> >                               netif_carrier_off(bond->dev);
> >                               goto out;
> >diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
> >index b672b8a881bb..d64a5d2f80b6 100644
> >--- a/drivers/net/bonding/bond_options.c
> >+++ b/drivers/net/bonding/bond_options.c
> >@@ -1706,6 +1706,7 @@ static int bond_option_lacp_fallback_set(struct bonding *bond,
> >       netdev_dbg(bond->dev, "Setting LACP fallback to %s (%llu)\n",
> >                  newval->string, newval->value);
> >       bond->params.lacp_fallback = newval->value;
> >+      bond_3ad_set_carrier(bond);
> >
> >       return 0;
> > }
> >--
> >2.39.2
> >
>
> ---
>         -Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply

* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-15 17:58 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260415164643.GQ772670@horms.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5012 bytes --]

> On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> > Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> > airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> > TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> > 
> > Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> > Changes in v2:
> > - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> >   order to avoid any possible NULL pointer dereference in
> >   airoha_qdma_cleanup_tx_queue()
> 
> This seems to be a separate issue.
> If so, I think it should be split out into a separate patch.

Sorry, I missed this comment.
Ack. I will repost splitting this in two separated patches.

Regards,
Lorenzo

> 
> > - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> > - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
> 
> I think it was covered in the review Jakub forwarded for v1.  But FTR,
> Sashiko has some feedback on this patch in the form of an existing bug
> (that should almost certainly be handled separately from this patch).
> 
> > ---
> >  drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
> >  1 file changed, 34 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 9e995094c32a..3c1a2bc68c42 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
> >  	dma_addr_t dma_addr;
> >  
> >  	spin_lock_init(&q->lock);
> > -	q->ndesc = size;
> >  	q->qdma = qdma;
> >  	q->free_thr = 1 + MAX_SKB_FRAGS;
> >  	INIT_LIST_HEAD(&q->tx_list);
> >  
> > -	q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> > +	q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
> >  				GFP_KERNEL);
> >  	if (!q->entry)
> >  		return -ENOMEM;
> >  
> > -	q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> > +	q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
> >  				      &dma_addr, GFP_KERNEL);
> >  	if (!q->desc)
> >  		return -ENOMEM;
> >  
> > -	for (i = 0; i < q->ndesc; i++) {
> > +	for (i = 0; i < size; i++) {
> >  		u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
> >  
> >  		list_add_tail(&q->entry[i].list, &q->tx_list);
> >  		WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
> >  	}
> > +	q->ndesc = size;
> >  
> >  	/* xmit ring drop default setting */
> >  	airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> > @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> >  
> >  static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> >  {
> > -	struct airoha_eth *eth = q->qdma->eth;
> > -	int i;
> > +	struct airoha_qdma *qdma = q->qdma;
> > +	struct airoha_eth *eth = qdma->eth;
> > +	int i, qid = q - &qdma->q_tx[0];
> > +	struct airoha_queue_entry *e;
> > +	u16 index = 0;
> >  
> >  	spin_lock_bh(&q->lock);
> >  	for (i = 0; i < q->ndesc; i++) {
> > -		struct airoha_queue_entry *e = &q->entry[i];
> 
> super nit: In v2 e is always used within a block (here and in the hunk below).
>            So I would lean towards declaring e in the blocks where it is
> 	   used.
> 
> 	   No need to repost just for this!
> 
> > +		struct airoha_qdma_desc *desc = &q->desc[i];
> >  
> > +		e = &q->entry[i];
> >  		if (!e->dma_addr)
> >  			continue;
> >  
> > @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> >  		e->dma_addr = 0;
> >  		e->skb = NULL;
> >  		list_add_tail(&e->list, &q->tx_list);
> > +
> > +		/* Reset DMA descriptor */
> > +		WRITE_ONCE(desc->ctrl, 0);
> > +		WRITE_ONCE(desc->addr, 0);
> > +		WRITE_ONCE(desc->data, 0);
> > +		WRITE_ONCE(desc->msg0, 0);
> > +		WRITE_ONCE(desc->msg1, 0);
> > +		WRITE_ONCE(desc->msg2, 0);
> > +
> >  		q->queued--;
> >  	}
> > +
> > +	if (!list_empty(&q->tx_list)) {
> > +		e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> > +				     list);
> > +		index = e - q->entry;
> > +	}
> > +	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> > +	 * empty.
> > +	 */
> > +	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> > +			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> > +	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> > +			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> > +
> >  	spin_unlock_bh(&q->lock);
> >  }
> >  
> > 
> > ---
> > base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> > change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
> > 
> > Best regards,
> > -- 
> > Lorenzo Bianconi <lorenzo@kernel.org>
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net v3 3/5] bonding: 3ad: fix mux port state on oper down
From: Louis Scalbert @ 2026-04-15 18:03 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev, andrew+netdev, edumazet, kuba, pabeni, fbl, andy,
	shemminger, maheshb
In-Reply-To: <707642.1776098994@famine>

Le lun. 13 avr. 2026 à 18:49, Jay Vosburgh <jv@jvosburgh.net> a écrit :
>
> Louis Scalbert <louis.scalbert@6wind.com> wrote:
>
> >When the bonding interface has carrier down due to the absence of
> >valid slaves and a slave transitions from down to up, the bonding
> >interface briefly goes carrier up, then down again, and finally up
> >once LACP negotiates collecting and distributing on the port.
>
>         Instead of "valid," I would suggest "usable."
>
> >The interface should not transition to carrier up until LACP
> >negotiation is complete.

I agree to take lacp_strict as the new parameter name.
I should write instead:

*With lacp_strict enabled,* the interface should not transition...

>
>         If the new option is off, i.e., does not require successful LACP
> negotiation, it should wait some time before asserting carrier up.  If
> negotiation fails, however, how long should it wait?
>

I am not sure I understand your point.

When lacp_strict is disabled, the existing behavior should remain
unchanged. The purpose of this patch is only to correct the
Collecting/Distributing state when the port is not operational.

This state is not used to determine carrier assertion when lacp_strict
is disabled.

I will double-check that the timing remains unchanged before posting
the next version of the patch.

> >This happens because the actor and partner port states remain in
> >collecting (and distributing) when the port goes down. When the port
> >comes back up, it temporarily remains in this state until LACP
> >renegotiation occurs.
> >
> >Previously this was mostly cosmetic, but since the bonding carrier
> >state now depends on the LACP negotiation state, it causes the
> >interface to flap.
>
>         "now depends" -> "may depend"
>
> >Fix this by unsetting the SELECTED flag when a port goes down so that
> >the mux state machine transitions through ATTACHED and DETACHED,
> >which clears the actor collecting and distributing flags. Do not
> >attempt to set the SELECTED flag if the port is still disabled.
> >
> >Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
> >Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
> >---
> > drivers/net/bonding/bond_3ad.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> >index b79a76296966..3a94fbcbf721 100644
> >--- a/drivers/net/bonding/bond_3ad.c
> >+++ b/drivers/net/bonding/bond_3ad.c
> >@@ -1570,6 +1570,12 @@ static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)
> >       struct slave *slave;
> >       int found = 0;
> >
> >+      /* Disabled ports cannot be SELECTED.
> >+       * Do not attempt to set the SELECTED flag if the port is still disabled.
> >+       */
> >+      if (!port->is_enabled)
> >+              return;
> >+
>
>         I think the change is fine, but the comment seems redundant to
> me.
>
>         -J
>
> >       /* if the port is already Selected, do nothing */
> >       if (port->sm_vars & AD_PORT_SELECTED)
> >               return;
> >@@ -2794,6 +2800,7 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
> >               /* link has failed */
> >               port->is_enabled = false;
> >               ad_update_actor_keys(port, true);
> >+              port->sm_vars &= ~AD_PORT_SELECTED;
> >       }
> >       agg = __get_first_agg(port);
> >       ad_agg_selection_logic(agg, &dummy);
> >--
> >2.39.2
> >
>
> ---
>         -Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply

* Re: [PATCH net v3 4/5] bonding: 3ad: fix stuck negotiation on recovery
From: Louis Scalbert @ 2026-04-15 18:08 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev, andrew+netdev, edumazet, kuba, pabeni, fbl, andy,
	shemminger, maheshb
In-Reply-To: <710787.1776105588@famine>

Le lun. 13 avr. 2026 à 20:39, Jay Vosburgh <jv@jvosburgh.net> a écrit :
>
> Louis Scalbert <louis.scalbert@6wind.com> wrote:
>
> >The previous commit introduced a side effect caused by clearing the
> >SELECTED flag on disabled ports. After all ports in an aggregator go
> >down, if only a subset of ports comes back up, those ports can no
> >longer renegotiate LACP unless all aggregator ports come back up.
> >
> >1. All aggregator ports go down
> >  - The SELECTED flag is cleared on all of them.
> >2. One port comes back up
> >  - Its SELECTED flag is set again.
> >  - It enters the WAITING state and gets its READY_N flag.
> >  - The remaining ports stay UNSELECTED. Because of that, they cannot
> >  enter the WAITING state and therefore never get READY_N.
>
>         This is the part that I think we may be doing something else
> incorrectly.  If the port is UNSELECTED, then that means that no
> aggregator is currently selected for that port, and therefore it
> shouldn't be assigned to an aggregator with other ports (per
> 802.1AX-2014 6.4.8, "Selected").
>
>         I'm not seeing anything in the 6.4.14 Selection Logic that makes
> me think a port that is down (port_enabled == FALSE) is disallowed from
> being SELECTED.

Yes. Clearing SELECTED when port_enabled == FALSE is incorrect.
That means my previous patch was wrong.

>
>         Looking at the Receive machine state diagram (Figure 6-18), I
> tend to think that in this case the port would transition to
> PORT_DISABLED state, as we're not asserting a BEGIN (reinitialization of
> the LACP protocol entity), so the port variables can remain unchanged.
> There's even some language that suggests this is intentional:
>
>         "If the Aggregation Port becomes inoperable and the BEGIN
>         variable is not asserted, the state machine enters the
>         PORT_DISABLED state. [...] This state allows the current
>         Selection state to remain undisturbed, so that, in the event
>         that the Aggregation Port is still connected to the same Partner
>         and Partner Aggregation Port when it becomes operable again,
>         there will be no disturbance caused to higher layers by
>         unnecessary re-configuration.
>
>         So, perhaps the actual bug is that these ports are attached to
> the aggregator but not SELECTED.

Yes. According to the Mux state machine diagram, the correct behavior
when port_enabled == FALSE is to transition to the WAITING state,
except when already in the DETACHED state.

The WAITING state should then be responsible for clearing the
Synchronization, Collecting, and Distributing flags.

I will address this in the previous patch, so this one will no longer
be necessary.

best regards,

Louis Scalbert


>
>         -J
>
>
> >  - __agg_ports_are_ready() returns 0 because it finds a port without
> >  READY_N.
> >  - As a result, __set_agg_ports_ready() keeps the READY flag cleared on
> >  all ports.
> >  - The port that came back up is therefore not marked READY and cannot
> >  transition to ATTACHED.
> >  - LACP negotiation becomes stuck, and the port cannot be used.
> >3. All aggregator ports come back up
> >  - They all regain SELECTED and READY_N.
> >  - __agg_ports_are_ready() now returns 1.
> >  - __set_agg_ports_ready() sets READY on all ports.
> >  - They can then transition to ATTACHED.
> >  - Negotiation resumes and the aggregator becomes operational again.
> >
> >Consider only ports currently in the WAITING mux state for READY_N in
> >order to avoid __agg_ports_are_ready() to return 0 because of a disabled
> >port. That matches 802.3ad, which states: "The Selection Logic asserts
> >Ready TRUE when the values of Ready_N for all ports that are waiting to
> >attach to a given Aggregator are TRUE.".
> >
> >Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
> >Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
> >---
> > drivers/net/bonding/bond_3ad.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> >index 3a94fbcbf721..3f56d892b101 100644
> >--- a/drivers/net/bonding/bond_3ad.c
> >+++ b/drivers/net/bonding/bond_3ad.c
> >@@ -700,7 +700,8 @@ static void __update_ntt(struct lacpdu *lacpdu, struct port *port)
> > }
> >
> > /**
> >- * __agg_ports_are_ready - check if all ports in an aggregator are ready
> >+ * __agg_ports_are_ready - check if all ports in an aggregator that are in
> >+ * the WAITING state are ready
> >  * @aggregator: the aggregator we're looking at
> >  *
> >  */
> >@@ -716,6 +717,8 @@ static int __agg_ports_are_ready(struct aggregator *aggregator)
> >               for (port = aggregator->lag_ports;
> >                    port;
> >                    port = port->next_port_in_aggregator) {
> >+                      if (port->sm_mux_state != AD_MUX_WAITING)
> >+                              continue;
> >                       if (!(port->sm_vars & AD_PORT_READY_N)) {
> >                               retval = 0;
> >                               break;
> >--
> >2.39.2
> >
>
> ---
>         -Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply

* Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
From: Simon Schippers @ 2026-04-15 18:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization
In-Reply-To: <b9d84d88-46d5-4fd3-a5b2-d914f54766f6@tu-dortmund.de>

I'd appreciate your feedback when you have time. 

Thanks!


^ permalink raw reply

* [RFC PATCH net-next v2 0/2] udp: fix FOU/GUE over multicast
From: Anton Danilov @ 2026-04-15 18:48 UTC (permalink / raw)
  To: netdev
  Cc: willemdebruijn.kernel, davem, dsahern, edumazet, kuba, pabeni,
	horms, shuah, linux-kselftest

UDP encapsulation (FOU, GUE) has never worked correctly with multicast
destination addresses. When a FOU-encapsulated packet arrives at a
multicast address, it enters __udp4_lib_mcast_deliver() which calls
consume_skb() on packets that need resubmission to the inner protocol
handler, silently dropping them instead.

The unicast delivery path and the GSO segmentation path both handle
this correctly, but the multicast path was never updated to support
UDP encapsulation resubmit.

This causes silent packet loss for FOU/GRETAP tunnels configured with
multicast remote addresses. The loss ratio depends on the early demux
cache hit rate - packets that hit early demux bypass the multicast path
and work correctly, masking the issue.

Reproducing the issue:

  ip netns add ns_a && ip netns add ns_b
  ip link add veth0 type veth peer name veth1
  ip link set veth0 netns ns_a && ip link set veth1 netns ns_b

  ip -n ns_a addr add 10.0.0.1/24 dev veth0 && ip -n ns_a link set veth0 up
  ip -n ns_b addr add 10.0.0.2/24 dev veth1 && ip -n ns_b link set veth1 up

  # Multicast routes
  ip -n ns_a route add 239.0.0.0/8 dev veth0
  ip -n ns_b route add 239.0.0.0/8 dev veth1

  # Disable early demux to expose the issue (otherwise it's partially masked)
  ip netns exec ns_b sysctl -w net.ipv4.ip_early_demux=0

  # Join multicast group on receiver
  ip -n ns_b addr add 239.0.0.1/32 dev veth1 autojoin

  # Sender: FOU + GRETAP with multicast remote
  ip netns exec ns_a ip fou add port 4797 ipproto 47
  ip -n ns_a link add eoudp0 type gretap \
      remote 239.0.0.1 local 10.0.0.1 \
      encap fou encap-sport 4797 encap-dport 4797 key 239.0.0.1
  ip -n ns_a link set eoudp0 up
  ip -n ns_a addr add 192.168.99.1/24 dev eoudp0

  # Receiver: FOU + GRETAP with multicast remote
  ip netns exec ns_b ip fou add port 4797 ipproto 47
  ip -n ns_b link add eoudp0 type gretap \
      remote 239.0.0.1 local 10.0.0.2 \
      encap fou encap-sport 4797 encap-dport 4797 key 239.0.0.1
  ip -n ns_b link set eoudp0 up
  ip -n ns_b addr add 192.168.99.2/24 dev eoudp0

  # Test: ping through the FOU/GRETAP tunnel
  ip netns exec ns_a ping -c 100 192.168.99.2
  # -> without this patch: 0 packets received on eoudp0
  # -> with this patch: all packets received on eoudp0

AI assistance (Claude, claude-opus-4-6) was used during root cause
analysis of the kernel source code (tracing the call chain from
udp_queue_rcv_skb through encap_rcv to ip_protocol_deliver_rcu,
comparing unicast/GSO/multicast paths) and during patch and selftest
authoring. The fix approach was identified by observing that the
unicast path (udp_unicast_rcv_skb) and the GSO path
(udp_queue_rcv_skb) both already handle encap resubmit correctly,
while the multicast path did not.

Changes since v1 (RFC):
  - Moved inline Python packet generator from the shell script into a
    separate helper (fou_mcast_encap.py) and added it to TEST_FILES
    in the Makefile

Anton Danilov (2):
  udp: fix encapsulation packet resubmit in multicast deliver
  selftests: net: add FOU multicast encapsulation resubmit test

 net/ipv4/udp.c                                |  13 ++-
 net/ipv6/udp.c                                |  13 ++-
 tools/testing/selftests/net/Makefile          |   2 +
 .../testing/selftests/net/fou_mcast_encap.py  |  81 ++++++++++++++
 .../testing/selftests/net/fou_mcast_encap.sh  | 105 ++++++++++++++++++
 5 files changed, 206 insertions(+), 8 deletions(-)
 create mode 100755 tools/testing/selftests/net/fou_mcast_encap.py
 create mode 100755 tools/testing/selftests/net/fou_mcast_encap.sh

--
2.47.3

^ permalink raw reply

* [PATCH v1 net] af_unix: Drop all SCM attributes for SOCKMAP.
From: Kuniyuki Iwashima @ 2026-04-15 18:48 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Cong Wang, Jiang Wang, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, Xingyu Jin

SOCKMAP can hide inflight fd from AF_UNIX GC.

When a socket in SOCKMAP receives skb with inflight fd,
sk_psock_verdict_data_ready() looks up the mapped socket and
enqueue skb to its psock->ingress_skb.

Since neither the old nor the new GC can inspect the psock
queue, the hidden skb leaks the inflight sockets.  Note that
this cannot be detected via kmemleak because inflight sockets
are linked to a global list.

In addition, SOCKMAP redirect breaks the Tarjan-based GC's
assumption that unix_edge.successor is always alive, which
is no longer true once skb is redirected, resulting in
use-after-free below. [0]

Moreover, SOCKMAP does not call scm_stat_del() properly,
so unix_show_fdinfo() could report an incorrect fd count.

sk_msg_recvmsg() does not support any SCM attributes in the
first place.

Let's drop all SCM attributes before passing skb to the
SOCKMAP layer.

[0]:
BUG: KASAN: slab-use-after-free in unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251)
Read of size 8 at addr ffff888125362670 by task kworker/56:1/496

CPU: 56 UID: 0 PID: 496 Comm: kworker/56:1 Not tainted 7.0.0-rc7-00263-gb9d8b856689d #3 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
Workqueue: events sk_psock_backlog
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:379)
 kasan_report (mm/kasan/report.c:597)
 unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251)
 unix_destroy_fpl (net/unix/garbage.c:317)
 unix_destruct_scm (./include/net/scm.h:80 ./include/net/scm.h:86 net/unix/af_unix.c:1976)
 sk_psock_backlog (./include/linux/skbuff.h:?)
 process_scheduled_works (kernel/workqueue.c:?)
 worker_thread (kernel/workqueue.c:?)
 kthread (kernel/kthread.c:438)
 ret_from_fork (arch/x86/kernel/process.c:164)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
 </TASK>

Allocated by task 955:
 kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
 __kasan_slab_alloc (mm/kasan/common.c:369)
 kmem_cache_alloc_noprof (mm/slub.c:4539)
 sk_prot_alloc (net/core/sock.c:2240)
 sk_alloc (net/core/sock.c:2301)
 unix_create1 (net/unix/af_unix.c:1099)
 unix_create (net/unix/af_unix.c:1169)
 __sock_create (net/socket.c:1606)
 __sys_socketpair (net/socket.c:1811)
 __x64_sys_socketpair (net/socket.c:1863 net/socket.c:1860 net/socket.c:1860)
 do_syscall_64 (arch/x86/entry/syscall_64.c:?)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 496:
 kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
 kasan_save_free_info (mm/kasan/generic.c:587)
 __kasan_slab_free (mm/kasan/common.c:287)
 kmem_cache_free (mm/slub.c:6165)
 __sk_destruct (net/core/sock.c:2282 net/core/sock.c:2384)
 sk_psock_destroy (./include/net/sock.h:?)
 process_scheduled_works (kernel/workqueue.c:?)
 worker_thread (kernel/workqueue.c:?)
 kthread (kernel/kthread.c:438)
 ret_from_fork (arch/x86/kernel/process.c:164)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:258)

Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Fixes: 77462de14a43 ("af_unix: Add read_sock for stream socket types")
Reported-by: Xingyu Jin <xingyuj@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/unix/af_unix.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b23c33df8b46..91a03c8a4281 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1964,16 +1964,19 @@ static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
 
 static void unix_destruct_scm(struct sk_buff *skb)
 {
-	struct scm_cookie scm;
+	struct scm_cookie scm = {};
+
+	swap(scm.pid, UNIXCB(skb).pid);
 
-	memset(&scm, 0, sizeof(scm));
-	scm.pid = UNIXCB(skb).pid;
 	if (UNIXCB(skb).fp)
 		unix_detach_fds(&scm, skb);
 
-	/* Alas, it calls VFS */
-	/* So fscking what? fput() had been SMP-safe since the last Summer */
 	scm_destroy(&scm);
+}
+
+static void unix_wfree(struct sk_buff *skb)
+{
+	unix_destruct_scm(skb);
 	sock_wfree(skb);
 }
 
@@ -1989,7 +1992,7 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
 	if (scm->fp && send_fds)
 		err = unix_attach_fds(scm, skb);
 
-	skb->destructor = unix_destruct_scm;
+	skb->destructor = unix_wfree;
 	return err;
 }
 
@@ -2066,6 +2069,13 @@ static void scm_stat_del(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
+static void unix_orphan_scm(struct sock *sk, struct sk_buff *skb)
+{
+	scm_stat_del(sk, skb);
+	unix_destruct_scm(skb);
+	skb->destructor = sock_wfree;
+}
+
 /*
  *	Send AF_UNIX data.
  */
@@ -2679,10 +2689,16 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 	int err;
 
 	mutex_lock(&u->iolock);
+
 	skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err);
-	mutex_unlock(&u->iolock);
-	if (!skb)
+	if (!skb) {
+		mutex_unlock(&u->iolock);
 		return err;
+	}
+
+	unix_orphan_scm(sk, skb);
+
+	mutex_unlock(&u->iolock);
 
 	return recv_actor(sk, skb);
 }
@@ -2882,6 +2898,9 @@ static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 #endif
 
 	spin_unlock(&queue->lock);
+
+	unix_orphan_scm(sk, skb);
+
 	mutex_unlock(&u->iolock);
 
 	return recv_actor(sk, skb);
-- 
2.54.0.rc1.513.gad8abe7a5a-goog


^ permalink raw reply related

* [RFC PATCH net-next v2 1/2] udp: fix encapsulation packet resubmit in multicast deliver
From: Anton Danilov @ 2026-04-15 18:50 UTC (permalink / raw)
  To: netdev
  Cc: willemdebruijn.kernel, davem, dsahern, edumazet, kuba, pabeni,
	horms, shuah, linux-kselftest
In-Reply-To: <ad_dal164gVmImWl@dau-home-pc>

When a UDP encapsulation socket (e.g., FOU) receives a multicast
packet, __udp4_lib_mcast_deliver() and __udp6_lib_mcast_deliver()
incorrectly call consume_skb() when udp_queue_rcv_skb() returns a
positive value. A positive return value from udp_queue_rcv_skb()
indicates that the encap_rcv handler (e.g., fou_udp_recv) has
consumed the UDP header and wants the packet to be resubmitted to
the IP protocol handler for further processing (e.g., as a GRE
packet).

The unicast path in udp_unicast_rcv_skb() handles this correctly by
returning -ret, which propagates up to ip_protocol_deliver_rcu() for
resubmission. The GSO path in udp_queue_rcv_skb() also handles this
correctly by calling ip_protocol_deliver_rcu() directly. However, the
multicast path destroys the packet via consume_skb() instead of
resubmitting it, causing silent packet loss.

This affects any UDP encapsulation (FOU, GUE) combined with multicast
destination addresses. In practice, it causes ~50% packet loss on
FOU/GRETAP tunnels configured with multicast remote addresses, with
the exact ratio depending on the early demux cache hit rate (packets
that hit early demux take the unicast path and are handled correctly).

Fix this by calling ip_protocol_deliver_rcu() (IPv4) or
ip6_protocol_deliver_rcu() (IPv6) instead of consume_skb() when the
return value is positive, matching the behavior of the GSO path.

Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com>
Assisted-by: Claude:claude-opus-4-6
---
 net/ipv4/udp.c | 13 +++++++++----
 net/ipv6/udp.c | 13 +++++++++----
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e9e2ce9522ef..8c2d4367cba2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2467,6 +2467,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	struct udp_hslot *hslot;
 	struct sk_buff *nskb;
 	bool use_hash2;
+	int ret;

 	hash2_any = 0;
 	hash2 = 0;
@@ -2500,8 +2501,10 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 			__UDP_INC_STATS(net, UDP_MIB_INERRORS);
 			continue;
 		}
-		if (udp_queue_rcv_skb(sk, nskb) > 0)
-			consume_skb(nskb);
+		ret = udp_queue_rcv_skb(sk, nskb);
+		if (ret > 0)
+			ip_protocol_deliver_rcu(dev_net(nskb->dev), nskb,
+						ret);
 	}

 	/* Also lookup *:port if we are using hash2 and haven't done so yet. */
@@ -2511,8 +2514,10 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	}

 	if (first) {
-		if (udp_queue_rcv_skb(first, skb) > 0)
-			consume_skb(skb);
+		ret = udp_queue_rcv_skb(first, skb);
+		if (ret > 0)
+			ip_protocol_deliver_rcu(dev_net(skb->dev), skb,
+						ret);
 	} else {
 		kfree_skb(skb);
 		__UDP_INC_STATS(net, UDP_MIB_IGNOREDMULTI);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 15e032194ecc..f74935d9f7d7 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -949,6 +949,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	struct udp_hslot *hslot;
 	struct sk_buff *nskb;
 	bool use_hash2;
+	int ret;

 	hash2_any = 0;
 	hash2 = 0;
@@ -987,8 +988,10 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 			continue;
 		}

-		if (udpv6_queue_rcv_skb(sk, nskb) > 0)
-			consume_skb(nskb);
+		ret = udpv6_queue_rcv_skb(sk, nskb);
+		if (ret > 0)
+			ip6_protocol_deliver_rcu(dev_net(nskb->dev), nskb,
+						 ret, true);
 	}

 	/* Also lookup *:port if we are using hash2 and haven't done so yet. */
@@ -998,8 +1001,10 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	}

 	if (first) {
-		if (udpv6_queue_rcv_skb(first, skb) > 0)
-			consume_skb(skb);
+		ret = udpv6_queue_rcv_skb(first, skb);
+		if (ret > 0)
+			ip6_protocol_deliver_rcu(dev_net(skb->dev), skb,
+						 ret, true);
 	} else {
 		kfree_skb(skb);
 		__UDP6_INC_STATS(net, UDP_MIB_IGNOREDMULTI);
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH net-next v2 2/2] selftests: net: add FOU multicast encapsulation resubmit test
From: Anton Danilov @ 2026-04-15 18:52 UTC (permalink / raw)
  To: netdev
  Cc: willemdebruijn.kernel, davem, dsahern, edumazet, kuba, pabeni,
	horms, shuah, linux-kselftest
In-Reply-To: <ad_dal164gVmImWl@dau-home-pc>

Add a selftest to verify that FOU-encapsulated packets addressed to a
multicast destination are correctly resubmitted to the inner protocol
handler (GRE) via the UDP multicast delivery path.

The test creates two network namespaces connected by a veth pair. The
receiver namespace has a FOU/GRETAP tunnel with a multicast remote
address (239.0.0.1). A Python helper script (fou_mcast_encap.py)
crafts GRE-over-UDP packets and sends them to the multicast address.

The early demux optimization (net.ipv4.ip_early_demux) is disabled on
the receiver to force packets through __udp4_lib_mcast_deliver(),
which is the code path that was previously broken.

Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com>
Assisted-by: Claude:claude-opus-4-6
---
 tools/testing/selftests/net/Makefile          |   2 +
 .../testing/selftests/net/fou_mcast_encap.py  |  81 ++++++++++++++
 .../testing/selftests/net/fou_mcast_encap.sh  | 105 ++++++++++++++++++
 3 files changed, 188 insertions(+)
 create mode 100755 tools/testing/selftests/net/fou_mcast_encap.py
 create mode 100755 tools/testing/selftests/net/fou_mcast_encap.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index a275ed584026..2b463dda5ac6 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -38,6 +38,7 @@ TEST_PROGS := \
 	fib_rule_tests.sh \
 	fib_tests.sh \
 	fin_ack_lat.sh \
+	fou_mcast_encap.sh \
 	fq_band_pktlimit.sh \
 	gre_gso.sh \
 	gre_ipv6_lladdr.sh \
@@ -195,6 +196,7 @@ TEST_GEN_PROGS := \
 
 TEST_FILES := \
 	fcnal-test.sh \
+	fou_mcast_encap.py \
 	in_netns.sh \
 	lib.sh \
 	settings \
diff --git a/tools/testing/selftests/net/fou_mcast_encap.py b/tools/testing/selftests/net/fou_mcast_encap.py
new file mode 100755
index 000000000000..0fd836c454a8
--- /dev/null
+++ b/tools/testing/selftests/net/fou_mcast_encap.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""Send FOU/GRE encapsulated packets to a multicast destination.
+
+Build and send GRE-over-UDP (FOU) packets with a multicast outer
+destination IP. Used by fou_mcast_encap.sh to test that the UDP
+multicast delivery path correctly resubmits encapsulated packets
+to the inner protocol handler.
+"""
+
+import argparse
+import socket
+import struct
+
+
+def checksum(data):
+    """Compute Internet checksum (RFC 1071)."""
+    csum = 0
+    for i in range(0, len(data) - 1, 2):
+        csum += (data[i] << 8) + data[i + 1]
+    if len(data) % 2:
+        csum += data[-1] << 8
+    while csum >> 16:
+        csum = (csum & 0xFFFF) + (csum >> 16)
+    return ~csum & 0xFFFF
+
+
+def build_gre_encap_packet(dst_addr):
+    """Build a GRE/Ethernet/IP/ICMP payload for FOU encapsulation."""
+    gre_key = socket.inet_aton(dst_addr)
+    gre_hdr = struct.pack("!HH", 0x2000, 0x6558) + gre_key
+
+    dst_mac = b"\xff\xff\xff\xff\xff\xff"
+    src_mac = b"\x02\x00\x00\x00\x00\x01"
+    eth_hdr = dst_mac + src_mac + struct.pack("!H", 0x0800)
+
+    inner_ip_src = socket.inet_aton("192.168.99.1")
+    inner_ip_dst = socket.inet_aton("192.168.99.2")
+
+    icmp_payload = b"TESTFOU!" * 4
+    icmp_hdr = struct.pack("!BBHHH", 8, 0, 0, 0x1234, 1) + icmp_payload
+    icmp_csum = checksum(icmp_hdr)
+    icmp_hdr = struct.pack("!BBHHH", 8, 0, icmp_csum, 0x1234, 1) + icmp_payload
+
+    ip_len = 20 + len(icmp_hdr)
+    ip_hdr = struct.pack("!BBHHHBBH", 0x45, 0, ip_len, 0x1234, 0, 64, 1, 0)
+    ip_hdr += inner_ip_src + inner_ip_dst
+    ip_csum = checksum(ip_hdr)
+    ip_hdr = ip_hdr[:10] + struct.pack("!H", ip_csum) + ip_hdr[12:]
+
+    return gre_hdr + eth_hdr + ip_hdr + icmp_hdr
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Send FOU/GRE encapsulated packets to a multicast address"
+    )
+    parser.add_argument(
+        "-c", "--count", type=int, required=True,
+        help="number of packets to send",
+    )
+    parser.add_argument(
+        "-d", "--dst", default="239.0.0.1",
+        help="destination multicast address (default: 239.0.0.1)",
+    )
+    parser.add_argument(
+        "-p", "--port", type=int, default=4797,
+        help="destination UDP port (default: 4797)",
+    )
+    args = parser.parse_args()
+
+    payload = build_gre_encap_packet(args.dst)
+
+    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+    for _ in range(args.count):
+        sock.sendto(payload, (args.dst, args.port))
+    sock.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tools/testing/selftests/net/fou_mcast_encap.sh b/tools/testing/selftests/net/fou_mcast_encap.sh
new file mode 100755
index 000000000000..1f9dbe7092f3
--- /dev/null
+++ b/tools/testing/selftests/net/fou_mcast_encap.sh
@@ -0,0 +1,105 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test that UDP encapsulation (FOU) correctly handles packet resubmit
+# when packets are delivered via the multicast UDP delivery path.
+#
+# When a FOU-encapsulated packet arrives with a multicast destination IP,
+# __udp4_lib_mcast_deliver() must resubmit it to the inner protocol
+# handler (e.g., GRE) rather than consuming it. This test verifies that
+# by creating a FOU/GRETAP tunnel with a multicast remote address, sending
+# encapsulated packets, and checking that they are correctly decapsulated.
+#
+# The early demux optimization can mask this issue by routing packets via
+# the unicast path (udp_unicast_rcv_skb), so we disable it to force
+# packets through __udp4_lib_mcast_deliver().
+
+source lib.sh
+
+NSENDER=""
+NRECV=""
+
+cleanup() {
+	cleanup_all_ns
+}
+
+trap cleanup EXIT
+
+setup() {
+	setup_ns NSENDER NRECV
+
+	ip link add veth_s type veth peer name veth_r
+	ip link set veth_s netns "$NSENDER"
+	ip link set veth_r netns "$NRECV"
+
+	ip -n "$NSENDER" addr add 10.0.0.1/24 dev veth_s
+	ip -n "$NSENDER" link set veth_s up
+
+	ip -n "$NRECV" addr add 10.0.0.2/24 dev veth_r
+	ip -n "$NRECV" link set veth_r up
+
+	# Disable early demux to force multicast delivery path
+	ip netns exec "$NRECV" sysctl -wq net.ipv4.ip_early_demux=0
+
+	# Join multicast group on receiver
+	ip -n "$NRECV" addr add 239.0.0.1/32 dev veth_r autojoin
+
+	# Multicast routes
+	ip -n "$NRECV" route add 239.0.0.0/8 dev veth_r
+	ip -n "$NSENDER" route add 239.0.0.0/8 dev veth_s
+
+	# FOU listener
+	ip netns exec "$NRECV" ip fou add port 4797 ipproto 47
+
+	# GRETAP with multicast remote - this triggers __udp4_lib_mcast_deliver
+	ip -n "$NRECV" link add eoudp0 type gretap \
+		remote 239.0.0.1 local 10.0.0.2 \
+		encap fou encap-sport 4797 encap-dport 4797 \
+		key 239.0.0.1
+	ip -n "$NRECV" link set eoudp0 up
+	ip -n "$NRECV" addr add 192.168.99.2/24 dev eoudp0
+}
+
+send_fou_gre_packets() {
+	local count=$1
+	local script_dir
+	script_dir=$(dirname "$(readlink -f "$0")")
+
+	ip netns exec "$NSENDER" python3 "$script_dir/fou_mcast_encap.py" \
+		-c "$count" -d 239.0.0.1 -p 4797
+}
+
+get_rx_packets() {
+	ip -n "$NRECV" -s link show eoudp0 | awk '/RX:/{getline; print $2}'
+}
+
+test_fou_mcast_encap() {
+	local count=100
+	local rx_before
+	local rx_after
+	local rx_delta
+
+	rx_before=$(get_rx_packets)
+	send_fou_gre_packets $count
+	sleep 1
+	rx_after=$(get_rx_packets)
+
+	rx_delta=$((rx_after - rx_before))
+
+	if [ "$rx_delta" -ge "$count" ]; then
+		echo "PASS: received $rx_delta/$count packets via multicast FOU/GRETAP"
+		return "$ksft_pass"
+	elif [ "$rx_delta" -gt 0 ]; then
+		echo "FAIL: only $rx_delta/$count packets received (partial delivery)"
+		return "$ksft_fail"
+	else
+		echo "FAIL: 0/$count packets received (multicast encap resubmit broken)"
+		return "$ksft_fail"
+	fi
+}
+
+echo "TEST: FOU/GRETAP multicast encapsulation resubmit"
+
+setup
+test_fou_mcast_encap
+exit $?
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: Martin KaFai Lau @ 2026-04-15 18:55 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	netdev, linux-doc, linux-kernel
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>

On Tue, Apr 14, 2026 at 06:57:00PM +0800, Jiayuan Chen wrote:
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
> 
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
> 
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
> 	-> tcp_push_pending_frames()
> 	  -> tcp_current_mss()
> 		-> tcp_established_options()
> 		  -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
> 			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> 
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
> 
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/

Thanks for the report and fixes suggested across different threads.

Using has_current_bpf_ctx() to avoid tcp_push_pending_frames() should
work but it may change the expectation for bpf_setsockopt(TCP_NODELAY).
e.g. A bpf_tcp_iter does bpf_setsockopt(TCP_NODELAY).

Adding another bit in the tcp_sock is not ideal either. I agree with
Alexei that it is better to reuse the existing bit if we go down this path.
We also need to audit more closely if there are cases that two different
type of bpf progs may call bpf_setsockopt(). e.g.
bpf_tcp_iter does bpf_setsockopt(TCP_CONGESTION) to switch
to a bpf_tcp_cc and the new bpf_tcp_cc->init() will also do
bpf_setsockopt(xxx) which then will be rejected.

Another fix could be, the bpf_setsockopt(TCP_NODELAY) is always broken
for BPF_SOCK_OPS_HDR_OPT_LEN_CB and BPF_SOCK_OPS_WRITE_HDR_OPT_CB unless
the bpf prog is doing some maneuver to avoid the recursion. Thus,
this use case is basically broken as is and I don't see a use case
for bpf_setsockopt(TCP_NODELAY) when writing header also.
How about checking the bpf_sock->op, level, and optname in
bpf_sock_ops_setsockopt() and return -EOPNOTSUPP?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox