* Re: [PATCH net v4] ipvs: fix MTU check for GSO packets in tunnel mode
From: Julian Anastasov @ 2026-04-15 15:35 UTC (permalink / raw)
To: Yingnan Zhang
Cc: pablo, coreteam, davem, edumazet, fw, horms, kuba, linux-kernel,
lvs-devel, netdev, netfilter-devel, pabeni, phil
In-Reply-To: <tencent_7F7B107ECA750C095D05C19C3B723AFFA60A@qq.com>
Hello,
On Wed, 15 Apr 2026, Yingnan Zhang wrote:
> Currently, IPVS skips MTU checks for GSO packets by excluding them with
> the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel
> mode encapsulates GSO packets with IPIP headers.
>
> The issue manifests in two ways:
>
> 1. MTU violation after encapsulation:
> When a GSO packet passes through IPVS tunnel mode, the original MTU
> check is bypassed. After adding the IPIP tunnel header, the packet
> size may exceed the outgoing interface MTU, leading to unexpected
> fragmentation at the IP layer.
>
> 2. Fragmentation with problematic IP IDs:
> When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments
> is fragmented after encapsulation, each segment gets a sequentially
> incremented IP ID (0, 1, 2, ...). This happens because:
>
> a) The GSO packet bypasses MTU check and gets encapsulated
> b) At __ip_finish_output, the oversized GSO packet is split into
> separate SKBs (one per segment), with IP IDs incrementing
> c) Each SKB is then fragmented again based on the actual MTU
>
> This sequential IP ID allocation differs from the expected behavior
> and can cause issues with fragment reassembly and packet tracking.
>
> Fix this by properly validating GSO packets using
> skb_gso_validate_network_len(). This function correctly validates
> whether the GSO segments will fit within the MTU after segmentation. If
> validation fails, send an ICMP Fragmentation Needed message to enable
> proper PMTU discovery.
>
> Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling")
> Signed-off-by: Yingnan Zhang <342144303@qq.com>
Looks good to me for the nf tree, thanks!
Acked-by: Julian Anastasov <ja@ssi.bg>
> ---
> v4:
> - Introduce a new helper function ip_vs_exceeds_mtu() to improve readability (reviewer feedback)
>
> v3: https://lore.kernel.org/netdev/tencent_73010FBD5FA1C05C3BC23A07A50B11CEC90A@qq.com/
> v2: https://lore.kernel.org/netdev/tencent_CA2C1C219C99D315086BE55E8654AF7E6009@qq.com/
> v1: https://lore.kernel.org/netdev/tencent_4A3E1C339C75D359093BE4F08648AFAA6009@qq.com/
> ---
> ---
> net/netfilter/ipvs/ip_vs_xmit.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 0fb5162992e5..64dfdf8b00c4 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -102,6 +102,18 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
> return dest_dst;
> }
>
> +/* Based on ip_exceeds_mtu(). */
> +static bool ip_vs_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
> +{
> + if (skb->len <= mtu)
> + return false;
> +
> + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> + return false;
> +
> + return true;
> +}
> +
> static inline bool
> __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> {
> @@ -112,7 +124,7 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> if (IP6CB(skb)->frag_max_size > mtu)
> return true; /* largest fragment violate MTU */
> }
> - else if (skb->len > mtu && !skb_is_gso(skb)) {
> + else if (ip_vs_exceeds_mtu(skb, mtu)) {
> return true; /* Packet size violate MTU size */
> }
> return false;
> @@ -232,7 +244,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> return true;
>
> if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
> - skb->len > mtu && !skb_is_gso(skb) &&
> + ip_vs_exceeds_mtu(skb, mtu) &&
> !ip_vs_iph_icmp(ipvsh))) {
> icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
> htonl(mtu));
> --
> 2.51.0.windows.1
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: [PATCH net v4 2/3] vsock/test: fix MSG_PEEK handling in recv_buf()
From: Stefano Garzarella @ 2026-04-15 15:40 UTC (permalink / raw)
To: Luigi Leonardi
Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Eugenio Pérez, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Arseniy Krasnov, kvm, virtualization,
netdev, linux-kernel
In-Reply-To: <20260415-fix_peek-v4-2-8207e872759e@redhat.com>
On Wed, Apr 15, 2026 at 05:09:29PM +0200, Luigi Leonardi wrote:
>`recv_buf` does not handle the MSG_PEEK flag correctly: it keeps calling
>`recv` until all requested bytes are available or an error occurs.
>
>The problem is how it calculates the number of bytes read: MSG_PEEK
>doesn't consume any bytes and will re-read the same bytes from the buffer
>head, so summing the return value every time is wrong.
>
>Moreover, MSG_PEEK doesn't consume the bytes in the buffer, so if more
>bytes are requested than are available, the loop will never terminate,
>because `recv` will never return EOF. For this reason, we need to compare
>the number of bytes read with the number of bytes expected.
>
>Add a check: if the MSG_PEEK flag is present, update the byte counter and
>break out of the loop only after at least the expected number of bytes
>have been received; otherwise, retry after a short delay to avoid
>consuming too many CPU cycles.
>
>This allows us to simplify the `test_stream_credit_update_test` by
>reusing `recv_buf`, like some other tests already do.
>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
>---
> tools/testing/vsock/util.c | 15 +++++++++++++++
> tools/testing/vsock/vsock_test.c | 13 +------------
> 2 files changed, 16 insertions(+), 12 deletions(-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply
* Re: [PATCH net] net: ax25: fix integer overflow in ax25_rx_fragment()
From: Simon Horman @ 2026-04-15 15:56 UTC (permalink / raw)
To: mashiro.chen
Cc: 'Simon Horman', netdev, davem, edumazet, kuba, pabeni,
jreuter, linux-hams, linux-kernel, stable
In-Reply-To: <20260408172521.281365-1-mashiro.chen@mailbox.org>
From: 'Simon Horman' <horms@kernel.org>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: ax25: fix integer overflow in ax25_rx_fragment()
This commit addresses an integer overflow in the ax25_cb fragmentation
reassembly accumulator. It adds a check to ensure the combined payload
of fragments does not exceed USHRT_MAX. If the limit is exceeded, the
fragment queue is purged to prevent allocating a tiny buffer that leads
to a panic.
> Every skb_put() call in the copy loop
> that follows writes far beyond the allocated headroom, corrupting
> the kernel heap.
Is this description accurate?
skb_put() explicitly performs strict bounds checking and triggers a
deterministic kernel panic via skb_over_panic() if exceeded, rather than
silently corrupting the heap.
> Fix mirrors the identical bug fixed in NET/ROM (nr_in.c):
This isn't a bug, but is this the correct file reference? The check for
USHRT_MAX actually appears to exist in X.25 (x25_in.c), not NET/ROM.
> diff --git a/net/ax25/ax25_in.c b/net/ax25/ax25_in.c
> index d75b3e9ed93de..68202c19b19e3 100644
> --- a/net/ax25/ax25_in.c
> +++ b/net/ax25/ax25_in.c
> @@ -41,6 +41,11 @@ static int ax25_rx_fragment(ax25_cb *ax25, struct sk_buff *skb)
> /* Enqueue fragment */
> ax25->fragno = *skb->data & AX25_SEG_REM;
> skb_pull(skb, 1); /* skip fragno */
> + if ((unsigned int)ax25->fraglen + skb->len > USHRT_MAX) {
Is there a race condition with ax25->fraglen here?
If fragments for the same connection arrive concurrently on different CPUs,
both could evaluate this bounds check as safe and then concurrently add
their lengths.
> + skb_queue_purge(&ax25->frag_queue);
> + ax25->fragno = 0;
> + return 1;
Does this code leak the skb?
When returning 1 here, the caller might skip freeing the skb because it
assumes the packet was successfully consumed or queued.
Could we explicitly call kfree_skb(skb) before returning 1, or return 0 to
instruct the caller to free it?
> + }
> ax25->fraglen += skb->len;
> skb_queue_tail(&ax25->frag_queue, skb);
[ ... ]
> /* Copy data from the fragments */
> while ((skbo = skb_dequeue(&ax25->frag_queue)) != NULL) {
> skb_copy_from_linear_data(skbo,
> skb_put(skbn, skbo->len),
> skbo->len);
Could this read past the end of the skbo->data buffer if the fragment is
non-linear?
If the skb has paged data, skbo->len exceeds the linearly mapped data size,
which might trigger a panic or leak adjacent memory into the reassembled
packet.
Should this use skb_copy_bits() instead?
[ ... ]
> } else {
> /* First fragment received */
> if (*skb->data & AX25_SEG_FIRST) {
> skb_queue_purge(&ax25->frag_queue);
> ax25->fragno = *skb->data & AX25_SEG_REM;
> skb_pull(skb, 1); /* skip fragno */
> ax25->fraglen = skb->len;
Is there a possibility of truncating the length here?
Since ax25->fraglen is an unsigned short, if a first fragment arrives with
skb->len > USHRT_MAX, it might be silently truncated.
Subsequent continuation fragments will be validated against this artificially
small fraglen, bypassing the new USHRT_MAX check.
Should the bounds check be applied to the first fragment as well?
> skb_queue_tail(&ax25->frag_queue, skb);
> return 1;
> }
> }
^ permalink raw reply
* Re: [PATCH] rose: Fix rose_find_socket() returning without sock_hold()
From: kernel test robot @ 2026-04-15 16:12 UTC (permalink / raw)
To: Dudu Lu, netdev
Cc: llvm, oe-kbuild-all, davem, edumazet, kuba, pabeni, Dudu Lu
In-Reply-To: <20260413090420.79932-1-phx0fer@gmail.com>
Hi Dudu,
kernel test robot noticed the following build errors:
[auto build test ERROR on net/main]
[also build test ERROR on net-next/main linus/master horms-ipvs/master v7.0 next-20260414]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Dudu-Lu/rose-Fix-rose_find_socket-returning-without-sock_hold/20260414-194608
base: net/main
patch link: https://lore.kernel.org/r/20260413090420.79932-1-phx0fer%40gmail.com
patch subject: [PATCH] rose: Fix rose_find_socket() returning without sock_hold()
config: i386-randconfig-012-20260415 (https://download.01.org/0day-ci/archive/20260416/202604160039.PLn74vyE-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260416/202604160039.PLn74vyE-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604160039.PLn74vyE-lkp@intel.com/
All errors (new ones prefixed by >>):
>> net/rose/af_rose.c:1:2: error: expected identifier or '('
1 | if (s)
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
98 | return (set->sig[3] | set->sig[2] |
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
98 | return (set->sig[3] | set->sig[2] |
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
114 | return (set1->sig[3] == set2->sig[3]) &&
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
114 | return (set1->sig[3] == set2->sig[3]) &&
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
115 | (set1->sig[2] == set2->sig[2]) &&
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:115:21: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
115 | (set1->sig[2] == set2->sig[2]) &&
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
157 | _SIG_SET_BINOP(sigorsets, _sig_or)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/signal.h:138:8: note: expanded from macro '_SIG_SET_BINOP'
138 | a3 = a->sig[3]; a2 = a->sig[2]; \
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:157:1: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
157 | _SIG_SET_BINOP(sigorsets, _sig_or)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/signal.h:138:24: note: expanded from macro '_SIG_SET_BINOP'
138 | a3 = a->sig[3]; a2 = a->sig[2]; \
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
157 | _SIG_SET_BINOP(sigorsets, _sig_or)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/signal.h:139:8: note: expanded from macro '_SIG_SET_BINOP'
139 | b3 = b->sig[3]; b2 = b->sig[2]; \
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:157:1: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
157 | _SIG_SET_BINOP(sigorsets, _sig_or)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/signal.h:139:24: note: expanded from macro '_SIG_SET_BINOP'
139 | b3 = b->sig[3]; b2 = b->sig[2]; \
| ^ ~
arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
24 | unsigned long sig[_NSIG_WORDS];
| ^
In file included from net/rose/af_rose.c:21:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
157 | _SIG_SET_BINOP(sigorsets, _sig_or)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/signal.h:140:3: note: expanded from macro '_SIG_SET_BINOP'
vim +1 net/rose/af_rose.c
> 1 if (s)
2 sock_hold(s);// SPDX-License-Identifier: GPL-2.0-or-later
3 /*
4 *
5 * Copyright (C) Jonathan Naylor G4KLX (g4klx@g4klx.demon.co.uk)
6 * Copyright (C) Alan Cox GW4PTS (alan@lxorguk.ukuu.org.uk)
7 * Copyright (C) Terry Dawson VK2KTJ (terry@animats.net)
8 * Copyright (C) Tomi Manninen OH2BNS (oh2bns@sral.fi)
9 */
10
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH net] ixgbevf: fix use-after-free in VEPA multicast source pruning
From: Simon Horman @ 2026-04-15 16:17 UTC (permalink / raw)
To: Michael Bommarito
Cc: intel-wired-lan, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev, stable, linux-kernel
In-Reply-To: <20260413182427.298513-1-michael.bommarito@gmail.com>
On Mon, Apr 13, 2026 at 02:24:27PM -0400, Michael Bommarito wrote:
> ixgbevf_clean_rx_irq() prunes frames whose source MAC matches the VF's
> own address (VEPA multicast workaround) by freeing the skb and
> continuing to the next descriptor:
>
> dev_kfree_skb_irq(skb);
> continue;
>
> The skb pointer is declared outside the while loop and persists across
> iterations. Because the continue skips the "skb = NULL" reset at the
> bottom of the loop, the next iteration enters the "else if (skb)" path
> and calls ixgbevf_add_rx_frag() on the freed skb, dereferencing
> skb_shinfo(skb)->nr_frags — a use-after-free in NAPI softirq context.
>
> The sibling driver iavf already handles this correctly by nulling the
> pointer before continuing. Apply the same pattern here.
>
> I do not have ixgbevf hardware; the bug was found by static analysis
> (scan_drop_continue_loops.py + semgrep drop_continue_in_loop, multi-tool
> corroboration with the highest score in the scan). The UAF was confirmed
> under KASAN by loading a test module that reproduces the exact code
> pattern (alloc skb, kfree_skb, then read skb_shinfo(skb)->nr_frags):
>
> BUG: KASAN: slab-use-after-free in ixgbevf_uaf_test_init+0x100/0x1000
> Read of size 8 at addr 000000006163ae78 by task insmod/30
> freed 208-byte region [000000006163adc0, 000000006163ae90)
>
> QEMU emulates igb (82576) but not ixgbe (82599), and the igbvf VF
> driver does not include the VEPA source pruning path, so a full
> end-to-end reproduction with emulated hardware was not possible.
>
> Fixes: bad17234ba70 ("ixgbevf: Change receive model to use double buffered page based receives")
> Cc: stable@vger.kernel.org
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: Codex:gpt-5-4
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Sashiko flags a number of issues in the same function that
do not seem related to your patch.
I'd suggest looking over them if you are interested in
follow-up work in this area.
...
^ permalink raw reply
* Re: [RFC PATCH net-next 2/2] selftests: net: add FOU multicast encapsulation resubmit test
From: Jakub Kicinski @ 2026-04-15 16:18 UTC (permalink / raw)
To: Anton Danilov
Cc: Breno Leitao, netdev, willemdebruijn.kernel, davem, dsahern,
edumazet, pabeni, horms, shuah, linux-kselftest
In-Reply-To: <ad9hkJXAnlv2ZUm6@gmail.com>
On Wed, 15 Apr 2026 03:25:59 -0700 Breno Leitao wrote:
> On Wed, Apr 15, 2026 at 02:28:06AM +0300, Anton Danilov wrote:
> > +send_fou_gre_packets() {
> > + local count=$1
> > +
> > + ip netns exec "$NSENDER" python3 -c "
>
> Having Python code embedded directly in the shell function makes this
> difficult to review and maintain. Could you extract the Python script to
> a separate file? This would simplify the code to just:
>
> ip netns exec "$NSENDER" python3 my_python_script.py
Or just rewrite the whole thing in Python (no preference)
^ permalink raw reply
* Re: [PATCH] netfilter: xt_realm: fix null-ptr-deref in realm_mt()
From: Pablo Neira Ayuso @ 2026-04-15 16:21 UTC (permalink / raw)
To: Florian Westphal
Cc: Kito Xu (veritas501), phil, davem, edumazet, kuba, pabeni, horms,
jengelh, kaber, netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <ad9d52dQWrS1H_ju@strlen.de>
On Wed, Apr 15, 2026 at 11:44:07AM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Wed, Apr 15, 2026 at 11:02:15AM +0200, Florian Westphal wrote:
> > > Kito Xu (veritas501) <hxzene@gmail.com> wrote:
> > > > realm_mt() unconditionally dereferences skb_dst(skb) without a NULL
> > > > check. The xt_realm match registers with .family = NFPROTO_UNSPEC,
> > > > making it available to all netfilter protocol families. Through the
> > > > nftables compat layer (nft_compat), an unprivileged user inside a
> > > > user/net namespace can load this match into a bridge-family chain.
> > >
> > > I do not think this bug is related to nft_compat.
> > > You can also use ebtables setsockopt api to request xt_realm, no?
> > >
> > > > Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
> > >
> > > Looks correct. Alternatively we could revert the xt_realm.c change.
> > > But I don't have a strong opinion here, patch looks correct.
> >
> > Maybe partial revert makes sense, since in ab4f21e6fb1c:
> >
> > - xt_MARK: OK
> > - xt_NOTRACK: OK
> > - xt_comment: OK
>
> Agree.
>
> > - xt_mac: There is a better way to do this in bridge.
>
> Right.
>
> > - xt_owner, no sockets in bridge.
>
> Output/postrouting maybe?
>
> > - xt_physdev, which makes no sense in bridge, this is for br_netfilter
> > only.
>
> Agree.
>
> > - xt_realm (as already mentioned).
> > That is, a partial revert of this patch for:
> >
> > - xt_mac
> > - xt_owner
> > - xt_physdev
> > - xt_realm
>
> I'm ok with that too.
For the record, this patch has been replaced by:
https://patchwork.ozlabs.org/project/netfilter-devel/patch/20260415113334.61008-1-pablo@netfilter.org/
^ permalink raw reply
* [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support
From: syzbot ci @ 2026-04-15 16:22 UTC (permalink / raw)
To: nogikh, hawk, linux-kernel, netdev, syzbot, syzkaller-bugs
Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260415130533.849053-1-nogikh@google.com>
syzbot ci has tested the suggested fix patch on top of the following series:
[v2] veth: add Byte Queue Limits (BQL) support
https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org
Patch: https://ci.syzbot.org/jobs/4a19c4e7-8505-49e5-b80f-6107406612b0/patch
The patch testing request could not be completed:
Testing failed due to an infrastructure error.
Testing results:
* [build 0] Build Patched: error
Full report is available here:
https://ci.syzbot.org/session/67022682-86d9-4483-a528-4d95990f8038
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply
* Re: [PATCH net 1/1] 8021q: free cleared egress QoS mappings safely
From: Eric Dumazet @ 2026-04-15 16:25 UTC (permalink / raw)
To: Ren Wei
Cc: netdev, andrew+netdev, davem, kuba, pabeni, horms, kees,
yifanwucs, tomapufckgml, yuantan098, bird, ylong030
In-Reply-To: <b877895cd02d35254b5c05d3c40abbf130cd87eb.1776039122.git.ylong030@ucr.edu>
On Mon, Apr 13, 2026 at 2:08 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> From: Longxuan Yu <ylong030@ucr.edu>
>
> vlan_dev_set_egress_priority() leaves cleared egress priority mapping
> nodes in the hash until device teardown. Repeated set/clear cycles with
> distinct skb priorities therefore allocate an unbounded number of
> vlan_priority_tci_mapping objects and leak memory.
>
> Delete mappings when vlan_prio is cleared instead of keeping
> tombstones. The TX fast path and reporting paths walk the lists without
> RTNL, so convert the egress mapping lists to RCU-protected pointers and
> defer freeing removed nodes until after a grace period.
>
> Cc: stable@kernel.org
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
> Suggested-by: Xin Liu <bird@lzu.edu.cn>
> Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> ---
> include/linux/if_vlan.h | 23 +++++++++++--------
> net/8021q/vlan_dev.c | 48 +++++++++++++++++++++++-----------------
> net/8021q/vlan_netlink.c | 9 +++-----
> net/8021q/vlanproc.c | 12 ++++++----
>
> @@ -604,11 +606,17 @@ void vlan_dev_free_egress_priority(const struct net_device *dev)
> int i;
>
> for (i = 0; i < ARRAY_SIZE(vlan->egress_priority_map); i++) {
> - while ((pm = vlan->egress_priority_map[i]) != NULL) {
> - vlan->egress_priority_map[i] = pm->next;
> - kfree(pm);
> + pm = rtnl_dereference(vlan->egress_priority_map[i]);
> + RCU_INIT_POINTER(vlan->egress_priority_map[i], NULL);
> + while (pm) {
> + struct vlan_priority_tci_mapping *next;
> +
> + next = rtnl_dereference(pm->next);
> + kfree_rcu_mightsleep(pm);
Please avoid kfree_rcu_mightsleep().
Embed instead one rcu_head in the object.
> + pm = next;
> }
> }
> + vlan->nr_egress_mappings = 0;
^ permalink raw reply
* Re: [PATCH v4] nfc: hci: fix out-of-bounds read in HCP header parsing
From: Simon Horman @ 2026-04-15 16:26 UTC (permalink / raw)
To: Ashutosh Desai; +Cc: netdev, kuba, edumazet, davem, pabeni, linux-kernel
In-Reply-To: <177614425081.3600288.2536320552978506086@gmail.com>
On Tue, Apr 14, 2026 at 05:24:10AM -0000, Ashutosh Desai wrote:
> nfc_hci_recv_from_llc() and nci_hci_data_received_cb() cast skb->data
> to struct hcp_packet and read the message header byte without checking
> that enough data is present in the linear sk_buff area. A malicious NFC
> peer can send a 1-byte HCP frame that passes through the SHDLC layer
> and reaches these functions, causing an out-of-bounds heap read.
>
> Fix this by adding pskb_may_pull() before each cast to ensure the full
> 2-byte HCP header is pulled into the linear area before it is accessed.
>
> Fixes: 8b8d2e08bf0d ("NFC: HCI support")
> Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
Unfortunately this patch seems to be whitespace-damaged
and does not apply. Please address that and repost.
--
pw-bot: changes-requested
^ permalink raw reply
* Re: [PATCH net v5] net: stmmac: Prevent NULL deref when RX memory exhausted
From: Russell King (Oracle) @ 2026-04-15 16:28 UTC (permalink / raw)
To: Sam Edwards
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Maxime Coquelin, Alexandre Torgue, Maxime Chevallier,
Ovidiu Panait, Vladimir Oltean, Baruch Siach, Serge Semin,
Giuseppe Cavallaro, netdev, linux-stm32, linux-arm-kernel,
linux-kernel, stable
In-Reply-To: <ad-LAB08-_rpmMzK@shell.armlinux.org.uk>
On Wed, Apr 15, 2026 at 01:56:32PM +0100, Russell King (Oracle) wrote:
> On Tue, Apr 14, 2026 at 07:39:47PM -0700, Sam Edwards wrote:
> > The CPU receives frames from the MAC through conventional DMA: the CPU
> > allocates buffers for the MAC, then the MAC fills them and returns
> > ownership to the CPU. For each hardware RX queue, the CPU and MAC
> > coordinate through a shared ring array of DMA descriptors: one
> > descriptor per DMA buffer. Each descriptor includes the buffer's
> > physical address and a status flag ("OWN") indicating which side owns
> > the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set
> > the flag and the MAC is only allowed to clear it, and both must move
> > through the ring in sequence: thus the ring is used for both
> > "submissions" and "completions."
> >
> > In the stmmac driver, stmmac_rx() bookmarks its position in the ring
> > with the `cur_rx` index. The main receive loop in that function checks
> > for rx_descs[cur_rx].own=0, gives the corresponding buffer to the
> > network stack (NULLing the pointer), and increments `cur_rx` modulo the
> > ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its
> > position with `dirty_rx`, allocates fresh buffers and rearms the
> > descriptors (setting OWN=1). If it fails any allocation, it simply stops
> > early (leaving OWN=0) and will retry where it left off when next called.
> >
> > This means descriptors have a three-stage lifecycle (terms my own):
> > - `empty` (OWN=1, buffer valid)
> > - `full` (OWN=0, buffer valid and populated)
> > - `dirty` (OWN=0, buffer NULL)
> >
> > But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In
> > the past (see 'Fixes:'), there was a bug where the loop could cycle
> > `cur_rx` all the way back to the first descriptor it dirtied, resulting
> > in a NULL dereference when mistaken for `full`. The aforementioned
> > commit resolved that *specific* failure by capping the loop's iteration
> > limit at `dma_rx_size - 1`, but this is only a partial fix: if the
> > previous stmmac_rx_refill() didn't complete, then there are leftover
> > `dirty` descriptors that the loop might encounter without needing to
> > cycle fully around. The current code therefore panics (see 'Closes:')
> > when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to
> > catch up to `dirty_rx`.
> >
> > Fix this by further tightening the clamp from `dma_rx_size - 1` to
> > `dma_rx_size - stmmac_rx_dirty() - 1`, subtracting any remnant dirty
> > entries and limiting the loop so that `cur_rx` cannot catch back up to
> > `dirty_rx`. This carries no risk of arithmetic underflow: since the
> > maximum possible return value of stmmac_rx_dirty() is `dma_rx_size - 1`,
> > the worst the clamp can do is prevent the loop from running at all.
> >
> > Fixes: b6cb4541853c7 ("net: stmmac: avoid rx queue overrun")
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
>
> Locally, while debugging my issues, I used this to prevent cur_rx
> catching up with dirty_rx:
>
> status = stmmac_rx_status(priv, &priv->xstats, p);
> /* check if managed by the DMA otherwise go ahead */
> if (unlikely(status & dma_own))
> break;
>
> next_entry = STMMAC_NEXT_ENTRY(rx_q->cur_rx,
> priv->dma_conf.dma_rx_size);
> if (unlikely(next_entry == rx_q->dirty_rx))
> break;
>
> rx_q->cur_rx = next_entry;
>
> If we care about the cost of reloading rx_q->dirty_rx on every
> iteration, then I'd suggest that the cost we already incur reading and
> writing rx_q->cur_rx is something that should be addressed, and
> eliminating that would counter the cost of reading rx_q->dirty_rx. I
> suspect, however, that the cost is minimal, as cur_tx and dirty_rx are
> likely in the same cache line.
>
> It looks like any fix to stmmac_rx() will also need a corresponding
> fix for stmmac_rx_zc().
I have some further information, but a new curveball has just been
chucked... and I've no idea what this will mean at this stage. Just
take it that I won't be responding for a while.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
* Re: [PATCH iwl-net] ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw
From: Simon Horman @ 2026-04-15 16:30 UTC (permalink / raw)
To: Petr Oros
Cc: netdev, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Aleksandr Loktionov, Nikolay Aleksandrov, Daniel Zahka,
Paul Greenwalt, Dave Ertman, Michal Swiatkowski, jacob.e.keller,
intel-wired-lan, linux-kernel
In-Reply-To: <20260413191420.3524013-1-poros@redhat.com>
On Mon, Apr 13, 2026 at 09:14:20PM +0200, Petr Oros wrote:
> On certain E810 configurations where firmware supports Tx scheduler
> topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo()
> may need to apply a new 5-layer or 9-layer topology from the DDP
> package. If the AQ command to set the topology fails (e.g. due to
> invalid DDP data or firmware limitations), the global configuration
> lock must still be cleared via a CORER reset.
>
> Commit 86aae43f21cf ("ice: don't leave device non-functional if Tx
> scheduler config fails") correctly fixed this by refactoring
> ice_cfg_tx_topo() to always trigger CORER after acquiring the global
> lock and re-initialize hardware via ice_init_hw() afterwards.
>
> However, commit 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end
> of deinit paths") later moved ice_init_dev_hw() into ice_init_hw(),
> breaking the reinit path introduced by 86aae43f21cf. This creates an
> infinite recursive call chain:
>
> ice_init_hw()
> ice_init_dev_hw()
> ice_cfg_tx_topo() # topology change needed
> ice_deinit_hw()
> ice_init_hw() # reinit after CORER
> ice_init_dev_hw() # recurse
> ice_cfg_tx_topo()
> ... # stack overflow
>
> Fix by moving ice_init_dev_hw() back out of ice_init_hw() and calling
> it explicitly from ice_probe() and ice_devlink_reinit_up(). The third
> caller, ice_cfg_tx_topo(), intentionally does not need ice_init_dev_hw()
> during its reinit, it only needs the core HW reinitialization. This
> breaks the recursion cleanly without adding flags or guards.
>
> The deinit ordering changes from commit 8a37f9e2ff40 ("ice: move
> ice_deinit_dev() to the end of deinit paths") which fixed slow rmmod
> are preserved, only the init-side placement of ice_init_dev_hw() is
> reverted.
>
> Fixes: 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end of deinit paths")
> Signed-off-by: Petr Oros <poros@redhat.com>
Hi Petr,
I don't intended to delay this patch.
But could you follow-up by looking over the AI generated
review of this patch on sashiko.dev?
Thanks!
^ permalink raw reply
* Re: [PATCH net] ixgbevf: fix use-after-free in VEPA multicast source pruning
From: Michael Bommarito @ 2026-04-15 16:30 UTC (permalink / raw)
To: Simon Horman
Cc: intel-wired-lan, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev, stable, linux-kernel
In-Reply-To: <20260415161720.GN772670@horms.kernel.org>
On Wed, Apr 15, 2026 at 12:17 PM Simon Horman <horms@kernel.org> wrote:
> Sashiko flags a number of issues in the same function that
> do not seem related to your patch.
>
> I'd suggest looking over them if you are interested in
> follow-up work in this area.
Sure, I'd be happy to keep going here if you're open to more hardening
patches.
Two Qs for you:
1. Do you want smaller patches for each or bigger method-level patches?
2. Anything on my list below that you would *not* want me touching?
I'll combine with anything I can find from your Sashiko items
1. line 104
rule: semgrep bug-on-in-net-code (CWE-617)
match: BUG_ON(!test_bit(__IXGBEVF_SERVICE_SCHED,
&adapter->state))
where: ixgbevf_service_event_schedule()
status: untriaged
2. lines 1219-1225
rule: net-drop-continue-in-loop + scan_drop_continue_loops.py
match: VEPA multicast pruning kfree_skb + continue (UAF)
where: ixgbevf_clean_rx_irq()
status: SHIPPED as commit ca62ac02b30d (this patch)
3. line 2769
rule: semgrep signed-int-as-size-param-kmalloc
match: q_vector = kzalloc(size, GFP_KERNEL) (signed size)
status: untriaged
4. line 3452
rule: semgrep signed-int-as-size-param-kmalloc
match: tx_ring->tx_buffer_info = vmalloc(size) (signed size)
status: untriaged
5. line 3530
rule: semgrep signed-int-as-size-param-kmalloc
match: rx_ring->rx_buffer_info = vmalloc(size) (signed size)
status: untriaged
6. line 4114
rule: semgrep narrow-accumulator-overflow
match: i += tx_ring->count;
status: untriaged
7. line 4189
rule: semgrep narrow-accumulator-overflow
match: count += TXD_USE_COUNT(skb_frag_size(frag));
status: untriaged
8. line 4192
rule: semgrep narrow-accumulator-overflow
match: count += skb_shinfo(skb)->nr_frags;
status: untriaged
9. line 4695
rule: coccinelle cancel_work.cocci
match: INIT_WORK(&adapter->service_task, ixgbevf_service_task)
with no matching cancel_work_sync on teardown path
status: untriaged
10. line 4752
rule: coccinelle null_after_free.cocci
where: ixgbevf_probe() err_dma path
status: untriaged
11. line 4795
rule: coccinelle null_after_free.cocci
where: ixgbevf_remove()
status: untriaged
^ permalink raw reply
* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Simon Horman @ 2026-04-15 16:46 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022@kernel.org>
On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
>
> Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
> Changes in v2:
> - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> order to avoid any possible NULL pointer dereference in
> airoha_qdma_cleanup_tx_queue()
This seems to be a separate issue.
If so, I think it should be split out into a separate patch.
> - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
I think it was covered in the review Jakub forwarded for v1. But FTR,
Sashiko has some feedback on this patch in the form of an existing bug
(that should almost certainly be handled separately from this patch).
> ---
> drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
> 1 file changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 9e995094c32a..3c1a2bc68c42 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
> dma_addr_t dma_addr;
>
> spin_lock_init(&q->lock);
> - q->ndesc = size;
> q->qdma = qdma;
> q->free_thr = 1 + MAX_SKB_FRAGS;
> INIT_LIST_HEAD(&q->tx_list);
>
> - q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> + q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
> GFP_KERNEL);
> if (!q->entry)
> return -ENOMEM;
>
> - q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> + q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
> &dma_addr, GFP_KERNEL);
> if (!q->desc)
> return -ENOMEM;
>
> - for (i = 0; i < q->ndesc; i++) {
> + for (i = 0; i < size; i++) {
> u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
>
> list_add_tail(&q->entry[i].list, &q->tx_list);
> WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
> }
> + q->ndesc = size;
>
> /* xmit ring drop default setting */
> airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
>
> static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> {
> - struct airoha_eth *eth = q->qdma->eth;
> - int i;
> + struct airoha_qdma *qdma = q->qdma;
> + struct airoha_eth *eth = qdma->eth;
> + int i, qid = q - &qdma->q_tx[0];
> + struct airoha_queue_entry *e;
> + u16 index = 0;
>
> spin_lock_bh(&q->lock);
> for (i = 0; i < q->ndesc; i++) {
> - struct airoha_queue_entry *e = &q->entry[i];
super nit: In v2 e is always used within a block (here and in the hunk below).
So I would lean towards declaring e in the blocks where it is
used.
No need to repost just for this!
> + struct airoha_qdma_desc *desc = &q->desc[i];
>
> + e = &q->entry[i];
> if (!e->dma_addr)
> continue;
>
> @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> e->dma_addr = 0;
> e->skb = NULL;
> list_add_tail(&e->list, &q->tx_list);
> +
> + /* Reset DMA descriptor */
> + WRITE_ONCE(desc->ctrl, 0);
> + WRITE_ONCE(desc->addr, 0);
> + WRITE_ONCE(desc->data, 0);
> + WRITE_ONCE(desc->msg0, 0);
> + WRITE_ONCE(desc->msg1, 0);
> + WRITE_ONCE(desc->msg2, 0);
> +
> q->queued--;
> }
> +
> + if (!list_empty(&q->tx_list)) {
> + e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> + list);
> + index = e - q->entry;
> + }
> + /* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> + * empty.
> + */
> + airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> + FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> + airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> + FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> +
> spin_unlock_bh(&q->lock);
> }
>
>
> ---
> base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
>
> Best regards,
> --
> Lorenzo Bianconi <lorenzo@kernel.org>
>
^ permalink raw reply
* Re: [PATCH nf] netfilter: nf_tables: use RCU-safe list primitives for basechain hook list
From: Pablo Neira Ayuso @ 2026-04-15 16:54 UTC (permalink / raw)
To: Weiming Shi
Cc: Florian Westphal, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Phil Sutter, Simon Horman, netfilter-devel, coreteam,
netdev, linux-kernel, Xiang Mei
In-Reply-To: <20260410101321.915190-2-bestswngs@gmail.com>
On Fri, Apr 10, 2026 at 06:13:22PM +0800, Weiming Shi wrote:
> NFT_MSG_GETCHAIN runs as an NFNL_CB_RCU callback, so chain dumps
> traverse basechain->hook_list under rcu_read_lock() without holding
> commit_mutex. Meanwhile, nft_delchain_hook() mutates that same live
> hook_list with plain list_move() and list_splice(), and the commit/abort
> paths splice hooks back with plain list_splice(). None of these are
> RCU-safe list operations.
>
> A concurrent GETCHAIN dump can observe partially updated list pointers,
> follow them into stack-local or transaction-private list heads, and
> crash when container_of() produces a bogus struct nft_hook pointer.
For the record, v1 of proposed series to fix this is here:
https://patchwork.ozlabs.org/project/netfilter-devel/list/?series=499757
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net] hv_sock: Report EOF instead of -EIO for FIN
From: Dexuan Cui @ 2026-04-15 16:55 UTC (permalink / raw)
To: Stefano Garzarella
Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, horms@kernel.org, niuxuewei.nxw@antgroup.com,
linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
stable@vger.kernel.org, Ben Hillis, Mitchell Levy
In-Reply-To: <ad9pPrji1uYSgNir@sgarzare-redhat>
> From: Stefano Garzarella <sgarzare@redhat.com>
> Sent: Wednesday, April 15, 2026 3:38 AM
> >@@ -703,8 +703,22 @@ static s64 hvs_stream_has_data(struct vsock_sock
> *vsk)
> > switch (hvs_channel_readable_payload(hvs->chan)) {
> > case 1:
> > need_refill = !hvs->recv_desc;
> >- if (!need_refill)
> >- return -EIO;
> >+ if (!need_refill) {
>
> Can we drop `need_refill` entirly and just check `hvs->recv_desc` here?
OK. Will post v2 later today.
> Mainly because now the comment we are adding is confusing me about what
> `need_refill` means.
>
> The rest LGTM.
>
> Thanks,
> Stefano
Thanks for the review!
^ permalink raw reply
* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-15 16:58 UTC (permalink / raw)
To: Simon Horman
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260415164643.GQ772670@horms.kernel.org>
[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]
On Apr 15, Simon Horman wrote:
> On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> > Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> > airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> > TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> >
> > Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> > Changes in v2:
> > - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> > order to avoid any possible NULL pointer dereference in
> > airoha_qdma_cleanup_tx_queue()
>
> This seems to be a separate issue.
> If so, I think it should be split out into a separate patch.
>
> > - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> > - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
>
> I think it was covered in the review Jakub forwarded for v1. But FTR,
> Sashiko has some feedback on this patch in the form of an existing bug
> (that should almost certainly be handled separately from this patch).
Hi Simon,
I took a look to the Sashiko's report [0] but this issue is not introduced by
this patch and, even if it would be a better approach, I guess the hw is
capable of managing out-of-order TX descriptors. So I guess this patch is fine
in this way, agree?
[0] https://sashiko.dev/#/patchset/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022%40kernel.org
>
> > ---
> > drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
> > 1 file changed, 34 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 9e995094c32a..3c1a2bc68c42 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
> > dma_addr_t dma_addr;
> >
> > spin_lock_init(&q->lock);
> > - q->ndesc = size;
> > q->qdma = qdma;
> > q->free_thr = 1 + MAX_SKB_FRAGS;
> > INIT_LIST_HEAD(&q->tx_list);
> >
> > - q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> > + q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
> > GFP_KERNEL);
> > if (!q->entry)
> > return -ENOMEM;
> >
> > - q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> > + q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
> > &dma_addr, GFP_KERNEL);
> > if (!q->desc)
> > return -ENOMEM;
> >
> > - for (i = 0; i < q->ndesc; i++) {
> > + for (i = 0; i < size; i++) {
> > u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
> >
> > list_add_tail(&q->entry[i].list, &q->tx_list);
> > WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
> > }
> > + q->ndesc = size;
> >
> > /* xmit ring drop default setting */
> > airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> > @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> >
> > static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> > {
> > - struct airoha_eth *eth = q->qdma->eth;
> > - int i;
> > + struct airoha_qdma *qdma = q->qdma;
> > + struct airoha_eth *eth = qdma->eth;
> > + int i, qid = q - &qdma->q_tx[0];
> > + struct airoha_queue_entry *e;
> > + u16 index = 0;
> >
> > spin_lock_bh(&q->lock);
> > for (i = 0; i < q->ndesc; i++) {
> > - struct airoha_queue_entry *e = &q->entry[i];
>
> super nit: In v2 e is always used within a block (here and in the hunk below).
> So I would lean towards declaring e in the blocks where it is
> used.
>
> No need to repost just for this!
I can fix it if I need to repost.
Regards,
Lorenzo
>
> > + struct airoha_qdma_desc *desc = &q->desc[i];
> >
> > + e = &q->entry[i];
> > if (!e->dma_addr)
> > continue;
> >
> > @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> > e->dma_addr = 0;
> > e->skb = NULL;
> > list_add_tail(&e->list, &q->tx_list);
> > +
> > + /* Reset DMA descriptor */
> > + WRITE_ONCE(desc->ctrl, 0);
> > + WRITE_ONCE(desc->addr, 0);
> > + WRITE_ONCE(desc->data, 0);
> > + WRITE_ONCE(desc->msg0, 0);
> > + WRITE_ONCE(desc->msg1, 0);
> > + WRITE_ONCE(desc->msg2, 0);
> > +
> > q->queued--;
> > }
> > +
> > + if (!list_empty(&q->tx_list)) {
> > + e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> > + list);
> > + index = e - q->entry;
> > + }
> > + /* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> > + * empty.
> > + */
> > + airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> > + FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> > + airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> > + FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> > +
> > spin_unlock_bh(&q->lock);
> > }
> >
> >
> > ---
> > base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> > change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
> >
> > Best regards,
> > --
> > Lorenzo Bianconi <lorenzo@kernel.org>
> >
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH nf,v2 1/3] rculist: add list_splice_rcu() for private lists
From: Pablo Neira Ayuso @ 2026-04-15 17:08 UTC (permalink / raw)
To: netfilter-devel
Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms, joelagnelf,
josh, boqun, urezki, rostedt, mathieu.desnoyers, jiangshanlai,
qiang.zhang, rcu
This patch adds a helper function, list_splice_rcu(), to safely splice
a private (non-RCU-protected) list into an RCU-protected list.
The function ensures that only the pointer visible to RCU readers
(prev->next) is updated using rcu_assign_pointer(), while the rest of
the list manipulations are performed with regular assignments, as the
source list is private and not visible to concurrent RCU readers.
This is useful for moving elements from a private list into a global
RCU-protected list, ensuring safe publication for RCU readers.
Subsystems with some sort of batching mechanism from userspace can
benefit from this new function.
The function __list_splice_rcu() has been added for clarity and to
follow the same pattern as in the existing list_splice*() interfaces,
where there is a check to ensure that that the list to splice is not
empty. Note that __list_splice_rcu() has no documentation for this
reason.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
v2: including comments by Paul McKenney.
Except, I have deliberately keep back the suggestion to squash
__list_splice_rcu() into list_splice_rcu(), I instead removed
the documentation for __list_splice_rcu(). I am looking
at other existing list_splice*() function in list.h and rculist.h
to get this aligned with __list_splice(), which also has no users
in the tree and no documentation. I find it easier to read with
__list_splice(), but if this explaination is not sound so...
@Paul: I can post v3 squashing __list_splice_rcu(), just let me
know.
Thanks!
include/linux/rculist.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 2abba7552605..e3bc44225692 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -261,6 +261,35 @@ static inline void list_replace_rcu(struct list_head *old,
old->prev = LIST_POISON2;
}
+static inline void __list_splice_rcu(struct list_head *list,
+ struct list_head *prev,
+ struct list_head *next)
+{
+ struct list_head *first = list->next;
+ struct list_head *last = list->prev;
+
+ last->next = next;
+ first->prev = prev;
+ next->prev = last;
+ rcu_assign_pointer(list_next_rcu(prev), first);
+}
+
+/**
+ * list_splice_rcu - splice a non-RCU list into an RCU-protected list,
+ * designed for stacks.
+ * @list: the non RCU-protected list to splice
+ * @head: the place in the existing RCU-protected list to splice
+ *
+ * The list pointed to by @head can be RCU-read traversed concurrently with
+ * this function.
+ */
+static inline void list_splice_rcu(struct list_head *list,
+ struct list_head *head)
+{
+ if (!list_empty(list))
+ __list_splice_rcu(list, head, head->next);
+}
+
/**
* __list_splice_init_rcu - join an RCU-protected list into an existing list.
* @list: the RCU-protected list to splice
--
2.47.3
^ permalink raw reply related
* Re: [PATCH 1/1] xskmap: reject TX-only AF_XDP sockets
From: Yuan Tan @ 2026-04-15 17:22 UTC (permalink / raw)
To: Jason Xing, Linpu Yu
Cc: yuantan098, magnus.karlsson, maciej.fijalkowski, netdev, bpf, sdf,
davem, edumazet, kuba, pabeni, horms, ast, daniel, hawk,
john.fastabend, bjorn, linux-kernel, yifanwucs
In-Reply-To: <CAL+tcoAy5WQRsL6z=YinPaiBNvd_=WB7qsR4amf1x4=qVw7AAg@mail.gmail.com>
On 4/15/2026 1:43 AM, Jason Xing wrote:
> On Mon, Mar 30, 2026 at 3:33 AM Linpu Yu <linpu5433@gmail.com> wrote:
>> Reject TX-only AF_XDP sockets from XSKMAP updates. Redirected
>> packets always enter the Rx path, where the kernel expects the
>> selected socket to have an Rx ring. A TX-only socket can
>> currently be inserted into an XSKMAP, and redirecting a packet
>> to it crashes the kernel in xsk_generic_rcv().
>>
>> Keep TX-only AF_XDP sockets valid for pure Tx use, but prevent
>> them from being published through XSKMAP.
>>
>> Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>> Reported-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Xin Liu <bird@lzu.edu.cn>
>> Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
>> Signed-off-by: Linpu Yu <linpu5433@gmail.com>
> Hi Linpu,
>
> Any plan to post a v2 with our questions resolved?
>
> Thanks,
> Jason
Hi Jason, Linpu has an exam to take this week. He told he can try
preparing the v2 patch this weekend. Best,
Yuan
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: tcp: Reject TCP_NODELAY from BPF hdr opt callbacks
From: Martin KaFai Lau @ 2026-04-15 17:31 UTC (permalink / raw)
To: KaFai Wan
Cc: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
ast, daniel, andrii, eddyz87, memxor, song, yonghong.song, jolsa,
shuah, sdf, netdev, linux-kernel, bpf, linux-kselftest, Quan Sun,
Yinhao Hu, Kaiyan Mei
In-Reply-To: <20260414112310.1285783-2-kafai.wan@linux.dev>
On Tue, Apr 14, 2026 at 07:23:09PM +0800, KaFai Wan wrote:
> A BPF_SOCK_OPS program can enable
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG and then call
> bpf_setsockopt(TCP_NODELAY) from BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>
> That reaches __tcp_sock_set_nodelay(), which may call
> tcp_push_pending_frames(). The transmit path then computes TCP
> options again, re-enters bpf_skops_hdr_opt_len(), and invokes the
> same BPF callback recursively. This can loop until the kernel
> stack overflows.
>
> TCP_NODELAY is not safe from the header option callback context.
> Reject it with -EOPNOTSUPP when TCP header option callbacks are
> enabled on the socket, so the callback cannot recurse back into
> tcp_push_pending_frames() through do_tcp_setsockopt().
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 7e41df5dbba2 ("bpf: Add a few optnames to bpf_setsockopt")
> Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
> ---
> net/ipv4/tcp.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 202a4e57a218..7ac4c98be19d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4004,7 +4004,10 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
>
> switch (optname) {
> case TCP_NODELAY:
> - __tcp_sock_set_nodelay(sk, val);
> + if (val && BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))
It will break the syscall setsockopt and also break the existing bpf prog
that calls bpf_setsockopt(TCP_NODELAY) in CB other than the
BPF_SOCK_OPS_HDR_OPT_LEN_CB/BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
Lets brainstorm other options suggested on the list that have smaller
blast radius.
pw-bot: cr
> + err = -EOPNOTSUPP;
> + else
> + __tcp_sock_set_nodelay(sk, val);
> break;
>
> case TCP_THIN_LINEAR_TIMEOUTS:
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: enable RPS and RBU interrupts
From: Sam Edwards @ 2026-04-15 17:38 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Jakub Kicinski, Andrew Lunn, Alexandre Torgue, Andrew Lunn,
David S. Miller, Eric Dumazet,
moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
linux-stm32, Linux Network Development Mailing List, Paolo Abeni
In-Reply-To: <ad-ID2WaPgPJqdsa@shell.armlinux.org.uk>
On Wed, Apr 15, 2026 at 5:44 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Tue, Apr 14, 2026 at 07:12:34PM -0700, Sam Edwards wrote:
> > On Tue, Apr 14, 2026 at 6:19 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > > Okay, just a quick note to say that nvidia's 5.10.216-tegra kernel
> > > survives iperf3 -c -R to the imx6.
> >
> > Hi Russell,
> >
> > Aw, you beat me to it! I was about to report that 5.10.104-tegra is
> > unaffected. And my iperf3 server is a multi-GbE amd64 machine.
> >
> > > Dumping the registers and comparing, and then forcing the RQS and TQS
> > > values to 0x23 (+1 = 36, *256 = 9216 bytes) and 0x8f (+1 = 144,
> > > *256 = 36864 ytes) respectively seems to solve the problem. Under
> > > net-next, these both end up being 0xff (+1 = 256, *256 = 65536 bytes.)
> > > Suspiciously, 36 * 4 = 144, and I also see that this kernel programs
> > > all four of the MTL receive operation mode registers, but only the
> > > first MTL transmit operation mode register. However, DMA channels 1-3
> > > aren't initialised.
> >
> > Wow, great! I wonder if the problem is that the MTL FIFOs are smaller
> > than that, so when the DMA suffers a momentary hiccup, the FIFOs are
> > allowed to overflow, putting the hardware in a bad state.
> >
> > Though I suspect this is only half of the problem: do you still see
> > RBUs? Everything you've shared so far suggests the DMA failures are
> > _not_ because the rx ring is drying up.
>
> Yes. Note that RBUs will happen not because of DMA failures, but if
> the kernel fails to keep up with the packet rate. RBU means "we read
> the next descriptor, and it wasn't owned by hardware".
Are you speaking from observation, documentation, or understanding?
I'd define RBU the same way, but you reported:
```
[ 55.766199] dwc-eth-dwmac 2490000.ethernet eth0: q0: receive buffer
unavailable: cur_rx=309 dirty_rx=309 last_cur_rx=245
last_cur_rx_post=309 last_dirty_rx=245 count=64 budget=64
cur_rx == dirty_rx _should_ mean that we fully refilled the ring. [...]
[...]
Every ring entry contains the same RDES3 value, so it really is
completely full at the point RBU fires (bit 31 clear means software
owns the descriptor, and it's basically saying first/last segment,
RDES1 valid, buffer 1 length of 1518.
```
It would seem* that the kernel isn't really failing to keep up with
the packet rate. If RBU is firing with a ring that's not even close to
empty, that tells me there's another way for it to fire. So I suspect
the hardware designers implemented it to mean:
"We couldn't read the next descriptor, _or_ it wasn't owned by hardware."
(* However, if bit 31 is clear everywhere, wouldn't that mean the ring
is actually completely depleted, not full? If count==budget, wouldn't
that mean the whole ring hasn't been visited, so we only refilled 64
entries and not necessarily the entire ring? Maybe the kernel isn't
keeping up after all.)
> That has:
>
> const nveu32_t rx_fifo_sz[2U][OSI_EQOS_MAX_NUM_QUEUES] = {
> { FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U),
> FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U) },
> { FIFO_SZ(36U), FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(2U),
> FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(16U) },
> };
> const nveu32_t tx_fifo_sz[2U][OSI_EQOS_MAX_NUM_QUEUES] = {
> { FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U),
> FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U) },
> { FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U),
> FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U) },
> };
>
> where each of those values is the RQS/TQS value to use in KiB:
>
> #define FIFO_SZ(x) ((((x) * 1024U) / 256U) - 1U)
>
> This doesn't correspond with the values I'm seeing programmed into
> the hardware under the 5.10.216-tegra kernel. I'm seeing TQS = 143
> (36KiB), and RQS = 35 (9KiB). Yes, these values exist in the tables
> above from a quick look, but they're not in the right place!
True, but:
a) I doubt 5.10.216-tegra includes exactly the same version of the
driver found in this random GitHub mirror. (My intent was only to
point out that they don't use 5.10's stmmac; I should have been more
clear that I wasn't trying to link the same version, sorry!)
b) This is vendor code; I don't know how good their testing/review
process is. It might not run the way it looks. The intent seems to be
for RQS > TQS (which makes intuitive sense), but as you're seeing the
registers programmed the other way 'round, they might have gotten them
subtly mixed up.
> Now, as for FIFO sizes, if we sum up all the entries, then we
> get:
>
> SUM(rx_fifo_size[0][]) = 60KiB
> SUM(rx_fifo_size[1][]) = 64KiB
> SUM(tx_fifo_size[0][]) = 60KiB
> SUM(tx_fifo_size[1][]) = 64KiB
I follow the math with 64KiB, but surely the 60KiB should be
9+9+9+9+1+1+1+1=40KiB? This seems to me that the "legacy EQOS" simply
shifts with smaller FIFOs. Since dwmac is licensed as a soft IP core,
perhaps the FIFO size is an elaboration parameter? That would mean
this isn't an issue with dwmac 5.0 broadly, but with Nvidia's specific
instantiation of it.
^ permalink raw reply
* Re: [PATCH nf,v2 1/3] rculist: add list_splice_rcu() for private lists
From: Paul E. McKenney @ 2026-04-15 17:39 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: netfilter-devel, davem, netdev, kuba, pabeni, edumazet, fw, horms,
joelagnelf, josh, boqun, urezki, rostedt, mathieu.desnoyers,
jiangshanlai, qiang.zhang, rcu
In-Reply-To: <20260415170844.41355-1-pablo@netfilter.org>
On Wed, Apr 15, 2026 at 07:08:44PM +0200, Pablo Neira Ayuso wrote:
> This patch adds a helper function, list_splice_rcu(), to safely splice
> a private (non-RCU-protected) list into an RCU-protected list.
>
> The function ensures that only the pointer visible to RCU readers
> (prev->next) is updated using rcu_assign_pointer(), while the rest of
> the list manipulations are performed with regular assignments, as the
> source list is private and not visible to concurrent RCU readers.
>
> This is useful for moving elements from a private list into a global
> RCU-protected list, ensuring safe publication for RCU readers.
> Subsystems with some sort of batching mechanism from userspace can
> benefit from this new function.
>
> The function __list_splice_rcu() has been added for clarity and to
> follow the same pattern as in the existing list_splice*() interfaces,
> where there is a check to ensure that that the list to splice is not
> empty. Note that __list_splice_rcu() has no documentation for this
> reason.
>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> v2: including comments by Paul McKenney.
>
> Except, I have deliberately keep back the suggestion to squash
> __list_splice_rcu() into list_splice_rcu(), I instead removed
> the documentation for __list_splice_rcu(). I am looking
> at other existing list_splice*() function in list.h and rculist.h
> to get this aligned with __list_splice(), which also has no users
> in the tree and no documentation. I find it easier to read with
> __list_splice(), but if this explaination is not sound so...
>
> @Paul: I can post v3 squashing __list_splice_rcu(), just let me
> know.
Removing the comment addresses most of my concerns. I do have a slight
but not overwhelming preference for the squashed version, but either way:
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Or if you want this to go in via RCU, please let us know. My guess is
that it would be easier for you to take it in with the code using it.
Thanx, Paul
> Thanks!
>
> include/linux/rculist.h | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 2abba7552605..e3bc44225692 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -261,6 +261,35 @@ static inline void list_replace_rcu(struct list_head *old,
> old->prev = LIST_POISON2;
> }
>
> +static inline void __list_splice_rcu(struct list_head *list,
> + struct list_head *prev,
> + struct list_head *next)
> +{
> + struct list_head *first = list->next;
> + struct list_head *last = list->prev;
> +
> + last->next = next;
> + first->prev = prev;
> + next->prev = last;
> + rcu_assign_pointer(list_next_rcu(prev), first);
> +}
> +
> +/**
> + * list_splice_rcu - splice a non-RCU list into an RCU-protected list,
> + * designed for stacks.
> + * @list: the non RCU-protected list to splice
> + * @head: the place in the existing RCU-protected list to splice
> + *
> + * The list pointed to by @head can be RCU-read traversed concurrently with
> + * this function.
> + */
> +static inline void list_splice_rcu(struct list_head *list,
> + struct list_head *head)
> +{
> + if (!list_empty(list))
> + __list_splice_rcu(list, head, head->next);
> +}
> +
> /**
> * __list_splice_init_rcu - join an RCU-protected list into an existing list.
> * @list: the RCU-protected list to splice
> --
> 2.47.3
>
>
^ permalink raw reply
* Re: [PATCH net v5] net: stmmac: Prevent NULL deref when RX memory exhausted
From: Sam Edwards @ 2026-04-15 17:53 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Maxime Coquelin, Alexandre Torgue, Maxime Chevallier,
Ovidiu Panait, Vladimir Oltean, Baruch Siach, Serge Semin,
Giuseppe Cavallaro, netdev, linux-stm32, linux-arm-kernel,
linux-kernel, stable
In-Reply-To: <ad-8q4OrOm-VtGrO@shell.armlinux.org.uk>
On Wed, Apr 15, 2026 at 9:28 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Wed, Apr 15, 2026 at 01:56:32PM +0100, Russell King (Oracle) wrote:
> > Locally, while debugging my issues, I used this to prevent cur_rx
> > catching up with dirty_rx:
> >
> > status = stmmac_rx_status(priv, &priv->xstats, p);
> > /* check if managed by the DMA otherwise go ahead */
> > if (unlikely(status & dma_own))
> > break;
> >
> > next_entry = STMMAC_NEXT_ENTRY(rx_q->cur_rx,
> > priv->dma_conf.dma_rx_size);
> > if (unlikely(next_entry == rx_q->dirty_rx))
> > break;
> >
> > rx_q->cur_rx = next_entry;
> >
> > If we care about the cost of reloading rx_q->dirty_rx on every
> > iteration, then I'd suggest that the cost we already incur reading and
> > writing rx_q->cur_rx is something that should be addressed, and
> > eliminating that would counter the cost of reading rx_q->dirty_rx. I
> > suspect, however, that the cost is minimal, as cur_tx and dirty_rx are
> > likely in the same cache line.
No, no, I like your approach better. :) It also removes the need for
the `limit` clamp at the top of the function, so later code can assume
limit==budget.
> > It looks like any fix to stmmac_rx() will also need a corresponding
> > fix for stmmac_rx_zc().
I agree that stmmac_rx_zc() is likely also broken (in a similar way,
but not similar enough to permit a "corresponding" fix), but I don't
agree that there's a dependency relationship here. This patch is
addressing #221010, which affects the generic/non-ZC codepath; I'm
afraid the ZC codepath warrants its own investigation.
> I have some further information, but a new curveball has just been
> chucked... and I've no idea what this will mean at this stage. Just
> take it that I won't be responding for a while.
I think I follow your meaning. Good luck getting it straightened out!
^ permalink raw reply
* Re: [PATCH net v3 2/5] bonding: 3ad: fix carrier when no valid slaves
From: Louis Scalbert @ 2026-04-15 17:53 UTC (permalink / raw)
To: Jay Vosburgh
Cc: netdev, andrew+netdev, edumazet, kuba, pabeni, fbl, andy,
shemminger, maheshb
In-Reply-To: <707939.1776099707@famine>
Hello Jay,
Thank you very much for this detailed review.
Le lun. 13 avr. 2026 à 19:01, Jay Vosburgh <jv@jvosburgh.net> a écrit :
>
> Louis Scalbert <louis.scalbert@6wind.com> wrote:
>
> >Apply the "lacp_fallback" configuration from the previous commit.
> >
> >"lacp_fallback" mode "strict" asserts that the bonding master carrier
> >only when at least 'min_links' slaves are in the collecting/distributing
> >state (or collecting only if the coupled_control default behavior is
> >disabled).
> >
> >Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
> >Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
> >---
> > drivers/net/bonding/bond_3ad.c | 26 ++++++++++++++++++++++++--
> > drivers/net/bonding/bond_options.c | 1 +
> > 2 files changed, 25 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> >index af7f74cfdc08..b79a76296966 100644
> >--- a/drivers/net/bonding/bond_3ad.c
> >+++ b/drivers/net/bonding/bond_3ad.c
> >@@ -745,6 +745,22 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val)
> > }
> > }
> >
> >+static int __agg_valid_ports(struct aggregator *agg)
> >+{
> >+ struct port *port;
> >+ int valid = 0;
> >+
> >+ for (port = agg->lag_ports; port;
> >+ port = port->next_port_in_aggregator) {
> >+ if (port->actor_oper_port_state & LACP_STATE_COLLECTING &&
> >+ (!port->slave->bond->params.coupled_control ||
> >+ port->actor_oper_port_state & LACP_STATE_DISTRIBUTING))
> >+ valid++;
>
> Do we need to test coupled_control? I.e., can the test be
With coupled_control enabled (default), the actor allows traffic from
the partner only when it reaches both the COLLECTING and DISTRIBUTING
states, i.e., in the AD_MUX_COLLECTING_DISTRIBUTING Mux state.
With coupled_control disabled, the actor allows traffic from the
partner as soon as it reaches the COLLECTING state, regardless of the
DISTRIBUTING flag. In this case, COLLECTING is set in the
AD_MUX_COLLECTING state, while DISTRIBUTING is only set later in the
AD_MUX_DISTRIBUTING state.
From the perspective of upper-layer processes, a carrier up state
indicates that the link is fully operational and capable of both
collecting and distributing traffic. These processes are not aware of
the distinction between COLLECTING and COLLECTING & DISTRIBUTING
states.
>
> if ((port->actor_oper_port_state & LACP_STATE_COLLECTING) &&
> (port->actor_oper_port_state & LACP_STATE_DISTRIBUTING))
>
If we want to allow collection to start without waiting for
distribution to be enabled, then the carrier must be asserted as soon
as the COLLECTING state is reached.
In that case, this test would not be valid.
In practice, I can only test for LACP_STATE_COLLECTING, because
LACP_STATE_DISTRIBUTING is always set together with
LACP_STATE_COLLECTING.
> To my reading, ad_mux_machine will set _COLLECTING and
> _DISTRIBUTING appropriately regardless of the coupled_control selection.
I don't agree. See:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/bonding/bond_3ad.c?id=v7.0#n1090
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/bonding/bond_3ad.c?id=v7.0#n1202
>
> >+ }
> >+
> >+ return valid;
> >+}
> >+
> > static int __agg_active_ports(struct aggregator *agg)
> > {
> > struct port *port;
> >@@ -2120,6 +2136,7 @@ static void ad_enable_collecting_distributing(struct port *port,
> > port->actor_port_number,
> > port->aggregator->aggregator_identifier);
> > __enable_port(port);
> >+ bond_3ad_set_carrier(port->slave->bond);
> > /* Slave array needs update */
> > *update_slave_arr = true;
> > /* Should notify peers if possible */
> >@@ -2141,6 +2158,7 @@ static void ad_disable_collecting_distributing(struct port *port,
> > port->actor_port_number,
> > port->aggregator->aggregator_identifier);
> > __disable_port(port);
> >+ bond_3ad_set_carrier(port->slave->bond);
> > /* Slave array needs an update */
> > *update_slave_arr = true;
> > }
> >@@ -2819,8 +2837,12 @@ int bond_3ad_set_carrier(struct bonding *bond)
> > }
> > active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator));
> > if (active) {
> >- /* are enough slaves available to consider link up? */
> >- if (__agg_active_ports(active) < bond->params.min_links) {
> >+ /* are enough slaves in collecting (and distributing) state to consider
> >+ * link up?
> >+ */
> >+ if ((bond->params.lacp_fallback ? __agg_valid_ports(active)
> >+ : __agg_active_ports(active)) <
> >+ bond->params.min_links) {
>
> I think the original comment is better; if the new option is
> off, it doesn't require collecting / distributing state.
>
> -J
>
> > if (netif_carrier_ok(bond->dev)) {
> > netif_carrier_off(bond->dev);
> > goto out;
> >diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
> >index b672b8a881bb..d64a5d2f80b6 100644
> >--- a/drivers/net/bonding/bond_options.c
> >+++ b/drivers/net/bonding/bond_options.c
> >@@ -1706,6 +1706,7 @@ static int bond_option_lacp_fallback_set(struct bonding *bond,
> > netdev_dbg(bond->dev, "Setting LACP fallback to %s (%llu)\n",
> > newval->string, newval->value);
> > bond->params.lacp_fallback = newval->value;
> >+ bond_3ad_set_carrier(bond);
> >
> > return 0;
> > }
> >--
> >2.39.2
> >
>
> ---
> -Jay Vosburgh, jv@jvosburgh.net
^ permalink raw reply
* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-15 17:58 UTC (permalink / raw)
To: Simon Horman
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260415164643.GQ772670@horms.kernel.org>
[-- Attachment #1: Type: text/plain, Size: 5012 bytes --]
> On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> > Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> > airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> > TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> >
> > Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> > Changes in v2:
> > - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> > order to avoid any possible NULL pointer dereference in
> > airoha_qdma_cleanup_tx_queue()
>
> This seems to be a separate issue.
> If so, I think it should be split out into a separate patch.
Sorry, I missed this comment.
Ack. I will repost splitting this in two separated patches.
Regards,
Lorenzo
>
> > - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> > - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
>
> I think it was covered in the review Jakub forwarded for v1. But FTR,
> Sashiko has some feedback on this patch in the form of an existing bug
> (that should almost certainly be handled separately from this patch).
>
> > ---
> > drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
> > 1 file changed, 34 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 9e995094c32a..3c1a2bc68c42 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
> > dma_addr_t dma_addr;
> >
> > spin_lock_init(&q->lock);
> > - q->ndesc = size;
> > q->qdma = qdma;
> > q->free_thr = 1 + MAX_SKB_FRAGS;
> > INIT_LIST_HEAD(&q->tx_list);
> >
> > - q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> > + q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
> > GFP_KERNEL);
> > if (!q->entry)
> > return -ENOMEM;
> >
> > - q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> > + q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
> > &dma_addr, GFP_KERNEL);
> > if (!q->desc)
> > return -ENOMEM;
> >
> > - for (i = 0; i < q->ndesc; i++) {
> > + for (i = 0; i < size; i++) {
> > u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
> >
> > list_add_tail(&q->entry[i].list, &q->tx_list);
> > WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
> > }
> > + q->ndesc = size;
> >
> > /* xmit ring drop default setting */
> > airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> > @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> >
> > static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> > {
> > - struct airoha_eth *eth = q->qdma->eth;
> > - int i;
> > + struct airoha_qdma *qdma = q->qdma;
> > + struct airoha_eth *eth = qdma->eth;
> > + int i, qid = q - &qdma->q_tx[0];
> > + struct airoha_queue_entry *e;
> > + u16 index = 0;
> >
> > spin_lock_bh(&q->lock);
> > for (i = 0; i < q->ndesc; i++) {
> > - struct airoha_queue_entry *e = &q->entry[i];
>
> super nit: In v2 e is always used within a block (here and in the hunk below).
> So I would lean towards declaring e in the blocks where it is
> used.
>
> No need to repost just for this!
>
> > + struct airoha_qdma_desc *desc = &q->desc[i];
> >
> > + e = &q->entry[i];
> > if (!e->dma_addr)
> > continue;
> >
> > @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> > e->dma_addr = 0;
> > e->skb = NULL;
> > list_add_tail(&e->list, &q->tx_list);
> > +
> > + /* Reset DMA descriptor */
> > + WRITE_ONCE(desc->ctrl, 0);
> > + WRITE_ONCE(desc->addr, 0);
> > + WRITE_ONCE(desc->data, 0);
> > + WRITE_ONCE(desc->msg0, 0);
> > + WRITE_ONCE(desc->msg1, 0);
> > + WRITE_ONCE(desc->msg2, 0);
> > +
> > q->queued--;
> > }
> > +
> > + if (!list_empty(&q->tx_list)) {
> > + e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> > + list);
> > + index = e - q->entry;
> > + }
> > + /* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> > + * empty.
> > + */
> > + airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> > + FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> > + airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> > + FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> > +
> > spin_unlock_bh(&q->lock);
> > }
> >
> >
> > ---
> > base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> > change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
> >
> > Best regards,
> > --
> > Lorenzo Bianconi <lorenzo@kernel.org>
> >
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox