Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] macsec: defer RX SA cleanup from RCU callback to workqueue
From: Sabrina Dubroca @ 2026-05-06 17:55 UTC (permalink / raw)
  To: alexjlzheng
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms,
	shenyangyang4, netdev, linux-kernel, alexjlzheng
In-Reply-To: <20260506100107.388184-1-alexjlzheng@tencent.com>

2026-05-06, 18:01:07 +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> crypto_free_aead() can call vunmap() internally (e.g. via
> dma_free_attrs() in hardware crypto drivers like hisi_sec2), which
> must not be called from softirq context.

Ok.

> free_rxsa() is an RCU callback and therefore runs in softirq context,
> causing a kernel crash when the underlying AEAD implementation
> performs DMA unmapping during tfm destruction:
> 
>   vunmap+0x4c/0x70
>   __iommu_dma_free+0xd0/0x138
>   dma_free_attrs+0xf4/0x100
>   sec_aead_exit+0x64/0xb8 [hisi_sec2]
>   crypto_destroy_tfm+0x98/0x110
>   free_rxsa+0x28/0x50 [macsec]
>   rcu_do_batch+0x184/0x460
>   rcu_core+0xf4/0x1f8
>   handle_softirqs+0x118/0x330
> 
> Fix this by splitting free_rxsa() into two parts: the RCU callback
> now only schedules a work item, and the actual resource release
> (crypto_free_aead, free_percpu, kfree) is done in a workqueue
> handler running in process context.
> 
> Add a destroy_work field to struct macsec_rx_sa and initialize it
> in init_rx_sa().

TXSAs go through exactly the same process (destruct via RCU and call
crypto_free_aead). I guess they would need exactly the same fix.


> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>

Missing a Fixes tag (most likely c09440f7dcb3 ("macsec: introduce IEEE
802.1AE driver")).

>  drivers/net/macsec.c | 13 +++++++++++--
>  include/net/macsec.h |  2 ++
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index f6cad0746a02..dabd3d2598ae 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
> @@ -174,15 +174,23 @@ static void macsec_rxsc_put(struct macsec_rx_sc *sc)
>  		call_rcu(&sc->rcu_head, free_rx_sc_rcu);
>  }
>  
> -static void free_rxsa(struct rcu_head *head)
> +static void free_rxsa_work(struct work_struct *work)
>  {
> -	struct macsec_rx_sa *sa = container_of(head, struct macsec_rx_sa, rcu);
> +	struct macsec_rx_sa *sa = container_of(work, struct macsec_rx_sa,
> +					       destroy_work);
>  
>  	crypto_free_aead(sa->key.tfm);
>  	free_percpu(sa->stats);
>  	kfree(sa);
>  }
>  
> +static void free_rxsa(struct rcu_head *head)
> +{
> +	struct macsec_rx_sa *sa = container_of(head, struct macsec_rx_sa, rcu);
> +
> +	schedule_work(&sa->destroy_work);
> +}

This is quite ugly. I'd prefer to change the call_rcu() in
macsec_rxsa_put() to the schedule_work(), and then add a
synchronize_rcu() (to replace the current call_rcu()'s effects) at the
start of free_rxsa_work().

In addition, you need to modify macsec_exit() so that it waits on the
free_rxsa_work() calls. Otherwise, if they happen after the module has
finished unloading, the kernel will crash. Currently there's an
rcu_barrier() that waits for free_rxsa() running as RCU callback, but
it won't wait for the new work.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal
From: Cosmin Ratiu @ 2026-05-06 17:53 UTC (permalink / raw)
  To: dsahern@kernel.org, Ido Schimmel
  Cc: horms@kernel.org, edumazet@google.com, netdev@vger.kernel.org,
	davem@davemloft.net, pabeni@redhat.com, kuba@kernel.org
In-Reply-To: <8fea4084-c9ec-472a-b8ab-ecc87e537216@kernel.org>

On Wed, 2026-05-06 at 10:26 -0600, David Ahern wrote:
> On 5/6/26 7:01 AM, Ido Schimmel wrote:
> > ... it would have been easier to review if split into
> > multiple patches (not saying you should do it). Something like:
> > 
> > 1. Change the various nexthop remove functions to return an
> > indication if
> > flushing is required, but keep doing the flushing in
> > __remove_nexthop_fib(). Referring to these functions:
> > 
> > remove_nexthop()
> > __remove_nexthop()
> > __remove_nexthop_fib()
> > remove_nexthop_from_groups()
> > remove_nh_grp_entry()
> > 
> > 2. Act upon the flushing indication in the various callers of
> > remove_nexthop() and remove the flushing from
> > __remove_nexthop_fib().
> > 
> > 3. Add __must_check annotations.
> > 
> 
> +1. Always send the smallest patches possible to evolve the code.
> Make
> it easy for reviewers - and yourself should you introduce an intended
> side effect.

I didn't split it as the whole thing is tightly coupled across multiple
functions, but I will send V3 tomorrow split along the lines Ido
suggested.

Thank you for taking a look!

Cosmin.

^ permalink raw reply

* [PATCH v2 5/5] MAINTAINERS: BITOPS: include bitrev.[ch]
From: Yury Norov @ 2026-05-06 17:52 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Yury Norov, Rasmus Villemoes, Arnd Bergmann, Eric Biggers,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Jinjie Ruan, linux-kernel, linux-riscv, linux-arch, netdev, bpf
  Cc: Yury Norov
In-Reply-To: <20260506175207.110893-1-ynorov@nvidia.com>

Arch bitrev API is covered in MAINTAINERS under the BITOPS entry,
while generic bitrev is unmaintained. Move it under BITOPS too.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 882214b0e7db..30214ac2f06d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4549,7 +4549,9 @@ F:	arch/*/lib/bitops.c
 F:	include/asm-generic/bitops
 F:	include/asm-generic/bitops.h
 F:	include/linux/bitops.h
+F:	include/linux/bitrev.h
 F:	include/linux/count_zeros.h
+F:	lib/bitrev.c
 F:	lib/hweight.c
 F:	lib/test_bitops.c
 F:	lib/tests/bitops_kunit.c
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 4/5] arch/riscv: Add bitrev.h file to support rev8 and brev8
From: Yury Norov @ 2026-05-06 17:52 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Yury Norov, Rasmus Villemoes, Arnd Bergmann, Eric Biggers,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Jinjie Ruan, linux-kernel, linux-riscv, linux-arch, netdev, bpf
  Cc: Yury Norov, David Laight
In-Reply-To: <20260506175207.110893-1-ynorov@nvidia.com>

From: Jinjie Ruan <ruanjinjie@huawei.com>

The RISC-V Bit-manipulation Extension for Cryptography (Zbkb) provides
the 'brev8' instruction, which reverses the bits within each byte.
Combined with the 'rev8' instruction (from Zbb or Zbkb), which reverses
the byte order of a register, we can efficiently implement 16-bit,
32-bit, and (on RV64) 64-bit bit reversal.

This is significantly faster than the default software table-lookup
implementation in lib/bitrev.c, as it replaces memory accesses and
multiple arithmetic operations with just two or three hardware
instructions.

Select HAVE_ARCH_BITREVERSE as well as GENERIC_BITREVERSE,
and provide <asm/bitrev.h> to utilize these instructions when
the Zbkb extension is available at runtime via the alternatives
mechanism.

[Yury: select the options conditionally on BITREVERSE]

Link: https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html
Suggested-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 arch/riscv/Kconfig              |  2 ++
 arch/riscv/include/asm/bitrev.h | 51 +++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)
 create mode 100644 arch/riscv/include/asm/bitrev.h

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index d235396c4514..a708583f785d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -104,6 +104,7 @@ config RISCV
 	select FUNCTION_ALIGNMENT_8B if DYNAMIC_FTRACE_WITH_CALL_OPS
 	select GENERIC_ARCH_TOPOLOGY
 	select GENERIC_ATOMIC64 if !64BIT
+	select GENERIC_BITREVERSE if HAVE_ARCH_BITREVERSE
 	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
 	select GENERIC_CPU_DEVICES
 	select GENERIC_CPU_VULNERABILITIES
@@ -128,6 +129,7 @@ config RISCV
 	select HAS_IOPORT if MMU
 	select HAVE_ALIGNED_STRUCT_PAGE
 	select HAVE_ARCH_AUDITSYSCALL
+	select HAVE_ARCH_BITREVERSE if RISCV_ISA_ZBKB && BITREVERSE
 	select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_HUGE_VMAP if MMU && 64BIT
 	select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/riscv/include/asm/bitrev.h b/arch/riscv/include/asm/bitrev.h
new file mode 100644
index 000000000000..4b9b8d34cc3b
--- /dev/null
+++ b/arch/riscv/include/asm/bitrev.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_BITREV_H
+#define __ASM_BITREV_H
+
+#include <linux/types.h>
+#include <asm/cpufeature-macros.h>
+#include <asm/hwcap.h>
+#include <asm-generic/bitops/__bitrev.h>
+
+static __always_inline __attribute_const__ u32 __arch_bitrev32(u32 x)
+{
+	unsigned long result;
+
+	if (!riscv_has_extension_likely(RISCV_ISA_EXT_ZBKB))
+		return generic___bitrev32(x);
+
+	asm volatile(
+		".option push\n"
+		".option arch,+zbkb\n"
+		"rev8 %0, %1\n"
+		"brev8 %0, %0\n"
+		".option pop"
+		: "=r" (result) : "r" ((long)x)
+	);
+
+	return result >> (__riscv_xlen - 32);
+}
+
+static __always_inline __attribute_const__ u16 __arch_bitrev16(u16 x)
+{
+	return __arch_bitrev32(x) >> 16;
+}
+
+static __always_inline __attribute_const__ u8 __arch_bitrev8(u8 x)
+{
+	unsigned long result;
+
+	if (!riscv_has_extension_likely(RISCV_ISA_EXT_ZBKB))
+		return generic___bitrev8(x);
+
+	asm volatile(
+		".option push\n"
+		".option arch,+zbkb\n"
+		"brev8 %0, %1\n"
+		".option pop"
+		: "=r" (result) : "r" ((long)x)
+	);
+
+	return result;
+}
+#endif
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 3/5] bitops: Define generic___bitrev8/16/32 for reuse
From: Yury Norov @ 2026-05-06 17:52 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Yury Norov, Rasmus Villemoes, Arnd Bergmann, Eric Biggers,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Jinjie Ruan, linux-kernel, linux-riscv, linux-arch, netdev, bpf
  Cc: Yury Norov
In-Reply-To: <20260506175207.110893-1-ynorov@nvidia.com>

From: Jinjie Ruan <ruanjinjie@huawei.com>

Define generic___bitrev8/16/32 using the implementation in
<linux/bitrev.h>, so they can be reused in <asm/bitrev.h>,
such as RISCV.

Reviewed-by: Yury Norov <ynorov@nvidia.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 include/asm-generic/bitops/__bitrev.h | 23 +++++++++++++++++++++++
 include/linux/bitrev.h                | 20 ++++----------------
 2 files changed, 27 insertions(+), 16 deletions(-)
 create mode 100644 include/asm-generic/bitops/__bitrev.h

diff --git a/include/asm-generic/bitops/__bitrev.h b/include/asm-generic/bitops/__bitrev.h
new file mode 100644
index 000000000000..4addbde14050
--- /dev/null
+++ b/include/asm-generic/bitops/__bitrev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_BITOPS___BITREV_H_
+#define _ASM_GENERIC_BITOPS___BITREV_H_
+
+#include <asm/types.h>
+
+extern u8 const byte_rev_table[256];
+static __always_inline __attribute_const__ u8 generic___bitrev8(u8 byte)
+{
+	return byte_rev_table[byte];
+}
+
+static __always_inline __attribute_const__ u16 generic___bitrev16(u16 x)
+{
+	return (generic___bitrev8(x & 0xff) << 8) | generic___bitrev8(x >> 8);
+}
+
+static __always_inline __attribute_const__ u32 generic___bitrev32(u32 x)
+{
+	return (generic___bitrev16(x & 0xffff) << 16) | generic___bitrev16(x >> 16);
+}
+
+#endif /* _ASM_GENERIC_BITOPS___BITREV_H_ */
diff --git a/include/linux/bitrev.h b/include/linux/bitrev.h
index d35b8ec1c485..11620a70e776 100644
--- a/include/linux/bitrev.h
+++ b/include/linux/bitrev.h
@@ -12,22 +12,10 @@
 #define __bitrev8 __arch_bitrev8
 
 #else
-extern u8 const byte_rev_table[256];
-static inline u8 __bitrev8(u8 byte)
-{
-	return byte_rev_table[byte];
-}
-
-static inline u16 __bitrev16(u16 x)
-{
-	return (__bitrev8(x & 0xff) << 8) | __bitrev8(x >> 8);
-}
-
-static inline u32 __bitrev32(u32 x)
-{
-	return (__bitrev16(x & 0xffff) << 16) | __bitrev16(x >> 16);
-}
-
+#include <asm-generic/bitops/__bitrev.h>
+#define __bitrev32 generic___bitrev32
+#define __bitrev16 generic___bitrev16
+#define __bitrev8 generic___bitrev8
 #endif /* CONFIG_HAVE_ARCH_BITREVERSE */
 
 #define __bitrev8x4(x)	(__bitrev32(swab32(x)))
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 2/5] lib/bitrev: Introduce GENERIC_BITREVERSE
From: Yury Norov @ 2026-05-06 17:52 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Yury Norov, Rasmus Villemoes, Arnd Bergmann, Eric Biggers,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Jinjie Ruan, linux-kernel, linux-riscv, linux-arch, netdev, bpf
  Cc: Yury Norov
In-Reply-To: <20260506175207.110893-1-ynorov@nvidia.com>

The generic bit reversal implementation is controlled by
!HAVE_ARCH_BITREVERSE. This makes it difficult for architectures to
provide a hardware-accelerated implementation while being able to
fall back to the generic version if needed.

This patch adds GENERIC_BITREVERSE, so bitreverse API is controlled by
BITREVERSE, GENERIC_BITREVERSE and HAVE_ARCH_BITREVERSE options. The
relationship between them is described as follows:

 - BITREVERSE is selected by user code; it's required to generate the API;
 - Architectures may select HAVE_ARCH_BITREVERSE and provide an arch
   implementation in arch/$(ARCH)/include/asm/bitrev.h.
 - if HAVE_ARCH_BITREVERSE isn't set, BITREVERSE selects GENERIC_BITREVERSE;
 - if GENERIC_BITREVERSE is set and HAVE_ARCH_BITREVERSE is not, the kernel
   provides generic implementation only, and wires bitrevXX() to it.
 - if HAVE_ARCH_BITREVERSE is set and GENERIC_BITREVERSE is not, the arch
   code provides __arch_bitrevXX(), and it is wired to bitrevXX();
 - if both GENERIC_BITREVERSE and HAVE_ARCH_BITREVERSE are selected, the kernel
   generates generic___bitrev(), but wires bitrev() to the __arch_bitrev().

The last option allows architectures to use generic___bitrev() as a
fallback option.

Drivers and core code should never select GENERIC_BITREVERSE or
HAVE_ARCH_BITREVERSE explicitly.

Architectures that require generic bitreverse API as a fallback should
explicitly enable GENERIC_BITREVERSE together with HAVE_ARCH_BITREVERSE.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 lib/Kconfig  | 12 ++++++++++++
 lib/Makefile |  2 +-
 lib/bitrev.c |  3 ---
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/lib/Kconfig b/lib/Kconfig
index d8e7e89ae320..a33988adfaa3 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -54,6 +54,7 @@ config PACKING_KUNIT_TEST
 
 config BITREVERSE
 	tristate
+	select GENERIC_BITREVERSE if !HAVE_ARCH_BITREVERSE
 
 config HAVE_ARCH_BITREVERSE
 	bool
@@ -63,6 +64,17 @@ config HAVE_ARCH_BITREVERSE
 	  This option enables the use of hardware bit-reversal instructions on
 	  architectures which support such operations.
 
+config GENERIC_BITREVERSE
+	tristate
+	depends on BITREVERSE
+	help
+	  Generic bit reversal implementation. Drivers should never enable
+	  it explicitly. Instead, enable BITREVERSE.
+
+	  Architectures may want to select it as a fall-back option for
+	  HAVE_ARCH_BITREVERSE, when the hardware-accelerated bit reverse
+	  instruction set is optional, like RISC-V ZBKB extension.
+
 config ARCH_HAS_STRNCPY_FROM_USER
 	bool
 
diff --git a/lib/Makefile b/lib/Makefile
index f33a24bf1c19..23e07d19d01c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -145,7 +145,7 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_LIST_HARDENED) += list_debug.o
 obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
 
-obj-$(CONFIG_BITREVERSE) += bitrev.o
+obj-$(CONFIG_GENERIC_BITREVERSE) += bitrev.o
 obj-$(CONFIG_LINEAR_RANGES) += linear_ranges.o
 obj-$(CONFIG_PACKING)	+= packing.o
 obj-$(CONFIG_PACKING_KUNIT_TEST) += packing_test.o
diff --git a/lib/bitrev.c b/lib/bitrev.c
index 81b56e0a7f32..05088231f31f 100644
--- a/lib/bitrev.c
+++ b/lib/bitrev.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0-only
-#ifndef CONFIG_HAVE_ARCH_BITREVERSE
 #include <linux/types.h>
 #include <linux/module.h>
 #include <linux/bitrev.h>
@@ -43,5 +42,3 @@ const u8 byte_rev_table[256] = {
 	0x1f, 0x9f, 0x5f, 0xdf, 0x3f, 0xbf, 0x7f, 0xff,
 };
 EXPORT_SYMBOL_GPL(byte_rev_table);
-
-#endif /* CONFIG_HAVE_ARCH_BITREVERSE */
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2 1/5] arch: select HAVE_ARCH_BITREVERSE conditionally on BITREVERSE
From: Yury Norov @ 2026-05-06 17:52 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Yury Norov, Rasmus Villemoes, Arnd Bergmann, Eric Biggers,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Jinjie Ruan, linux-kernel, linux-riscv, linux-arch, netdev, bpf
  Cc: Yury Norov
In-Reply-To: <20260506175207.110893-1-ynorov@nvidia.com>

Architectures may have bit reversal instructions, but if the API not
needed, the corresponding option should not be selected because it may
lead to generating the unneeded code.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 arch/arm/Kconfig       | 2 +-
 arch/arm64/Kconfig     | 2 +-
 arch/loongarch/Kconfig | 2 +-
 arch/mips/Kconfig      | 2 +-
 lib/Kconfig            | 1 +
 5 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 71fc5dd4123f..0e963e54fe06 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -83,7 +83,7 @@ config ARM
 	select HARDIRQS_SW_RESEND
 	select HAS_IOPORT
 	select HAVE_ARCH_AUDITSYSCALL if AEABI && !OABI_COMPAT
-	select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
+	select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 && BITREVERSE
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU && (!PREEMPT_RT || !SMP)
 	select HAVE_ARCH_KFENCE if MMU && !XIP_KERNEL
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..f5bb62c2ba9c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -150,7 +150,7 @@ config ARM64
 	select HAVE_ACPI_APEI if (ACPI && EFI)
 	select HAVE_ALIGNED_STRUCT_PAGE
 	select HAVE_ARCH_AUDITSYSCALL
-	select HAVE_ARCH_BITREVERSE
+	select HAVE_ARCH_BITREVERSE if BITREVERSE
 	select HAVE_ARCH_COMPILER_H
 	select HAVE_ARCH_HUGE_VMALLOC
 	select HAVE_ARCH_HUGE_VMAP
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 3b042dbb2c41..6c3444e31c0e 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -115,7 +115,7 @@ config LOONGARCH
 	select HAS_IOPORT
 	select HAVE_ALIGNED_STRUCT_PAGE if 64BIT
 	select HAVE_ARCH_AUDITSYSCALL
-	select HAVE_ARCH_BITREVERSE if 64BIT
+	select HAVE_ARCH_BITREVERSE if 64BIT && BITREVERSE
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_JUMP_LABEL_RELATIVE
 	select HAVE_ARCH_KASAN if 64BIT
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 4364f3dba688..7e1494e0dbfa 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2026,7 +2026,7 @@ config CPU_MIPSR6
 	default y if CPU_MIPS32_R6 || CPU_MIPS64_R6
 	select CPU_HAS_RIXI
 	select CPU_HAS_DIEI if !CPU_DIEI_BROKEN
-	select HAVE_ARCH_BITREVERSE
+	select HAVE_ARCH_BITREVERSE if BITREVERSE
 	select MIPS_ASID_BITS_VARIABLE
 	select MIPS_SPRAM
 
diff --git a/lib/Kconfig b/lib/Kconfig
index 00a9509636c1..d8e7e89ae320 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -58,6 +58,7 @@ config BITREVERSE
 config HAVE_ARCH_BITREVERSE
 	bool
 	default n
+	depends on BITREVERSE
 	help
 	  This option enables the use of hardware bit-reversal instructions on
 	  architectures which support such operations.
-- 
2.51.0


^ permalink raw reply related

* (no subject)
From: Yury Norov @ 2026-05-06 17:52 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Yury Norov, Rasmus Villemoes, Arnd Bergmann, Eric Biggers,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Jinjie Ruan, linux-kernel, linux-riscv, linux-arch, netdev, bpf
  Cc: Yury Norov

Subject: [PATCH v2 0/5] lib: rework bitreverse

Cleanup bitreverse-related Kconfig dependency mechanism, and add new
GENERIC_BITREVERSE option to allow architectures to pick the generic
implementation as a fallback for the arch one.

Make RISCV to pick the generic implementation with the absence of ZBKB.

v1: https://lore.kernel.org/all/20260430211351.658193-1-ynorov@nvidia.com/
v2:
 - don't protect headers with the corresponding configs (Arnd, Erik);
 - make HAVE_ARCH_BITREVERSE conditional on BITREVERSE;
 - make GENERIC_BITREVERSE tri-state (sashiko);
 - re-implement GENERIC_BITREVERSE and it's relation to BITREVERSE and
   HAVE_ARCH_BITREVERSE, thus taking over the authorship;
 - RISCV: select GENERIC_BITREVERSE and HAVE_ARCH_BITREVERSE depending
   on BITREVERSE;

Jinjie Ruan (2):
  bitops: Define generic___bitrev8/16/32 for reuse
  arch/riscv: Add bitrev.h file to support rev8 and brev8

Yury Norov (3):
  arch: select HAVE_ARCH_BITREVERSE conditionally on BITREVERSE
  lib/bitrev: Introduce GENERIC_BITREVERSE
  MAINTAINERS: BITOPS: include bitrev.[ch]

 MAINTAINERS                           |  2 ++
 arch/arm/Kconfig                      |  2 +-
 arch/arm64/Kconfig                    |  2 +-
 arch/loongarch/Kconfig                |  2 +-
 arch/mips/Kconfig                     |  2 +-
 arch/riscv/Kconfig                    |  2 ++
 arch/riscv/include/asm/bitrev.h       | 51 +++++++++++++++++++++++++++
 include/asm-generic/bitops/__bitrev.h | 23 ++++++++++++
 include/linux/bitrev.h                | 20 +++--------
 lib/Kconfig                           | 13 +++++++
 lib/Makefile                          |  2 +-
 lib/bitrev.c                          |  3 --
 12 files changed, 100 insertions(+), 24 deletions(-)
 create mode 100644 arch/riscv/include/asm/bitrev.h
 create mode 100644 include/asm-generic/bitops/__bitrev.h

-- 
2.51.0


^ permalink raw reply

* Re: [PATCH net v3] af_unix: Reject SIOCATMARK on non-stream sockets
From: Kuniyuki Iwashima @ 2026-05-06 17:51 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, davem, edumazet, kuba, pabeni, horms, rao.shoaib,
	yuantan098, yifanwucs, tomapufckgml, bird, wangjiexun2025
In-Reply-To: <20260506140825.2987635-1-n05ec@lzu.edu.cn>

On Wed, May 6, 2026 at 7:08 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> From: Jiexun Wang <wangjiexun2025@gmail.com>
>
> SIOCATMARK reports whether the receive queue is at the urgent mark for
> MSG_OOB.
>
> In AF_UNIX, MSG_OOB is supported only for SOCK_STREAM sockets.
> SOCK_DGRAM and SOCK_SEQPACKET reject MSG_OOB in sendmsg() and recvmsg(),
> so they should not support SIOCATMARK either.
>
> Return -EOPNOTSUPP for non-stream sockets before checking the receive
> queue.
>
> Fixes: 314001f0bf92 ("af_unix: Add OOB support")
> Cc: stable@kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH net-next 10/12] net: stmmac: tc956x: add TC956x/QPS615 support
From: Alex Elder @ 2026-05-06 17:44 UTC (permalink / raw)
  To: Xilin Wu, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, rmk+kernel, andersson, konradybcio, robh,
	krzk+dt, conor+dt, linusw, brgl, arnd, gregkh
  Cc: Daniel Thompson, mohd.anwar, a0987203069, alexandre.torgue, ast,
	boon.khai.ng, chenchuangyu, chenhuacai, daniel, hawk, hkallweit1,
	inochiama, john.fastabend, julianbraha, livelycarpet87,
	matthew.gerlach, mcoquelin.stm32, me, prabhakar.mahadev-lad.rj,
	richardcochran, rohan.g.thomas, sdf, siyanteng, weishangjuan,
	wens, netdev, bpf, linux-arm-msm, devicetree, linux-gpio,
	linux-stm32, linux-arm-kernel, linux-kernel
In-Reply-To: <DD71CDEABC7C16D5+02d052ff-13bb-4712-a847-91416f76c578@radxa.com>

On 5/5/26 9:30 PM, Xilin Wu wrote:
> On 5/1/2026 11:54 PM, Alex Elder wrote:
>> From: Daniel Thompson <daniel@riscstar.com>
>>
>> Toshiba TC956x is an Ethernet AVB/TSN bridge and is essentially a
>> small and highly-specialized SoC. TC956x includes an "eMAC" subsystem
>> that can be accessed, along with several other peripherals, via two
>> PCIe endpoint functions. There is a main driver for the endpoint that
>> decomposes things and creates auxiliary bus devices to model the SoC.
>>
>> The eMAC consists of a Designware XGMAC, XPCS and PMA. Each eMAC is
>> supported by an MSIGEN that bridges TC956x level interrupts to PCIe
>> MSIs.
>>
>> Add a driver for the eMAC/MSIGEN combination.
>>
>> Co-developed-by: Alex Elder <elder@riscstar.com>
>> Signed-off-by: Alex Elder <elder@riscstar.com>
>> Signed-off-by: Daniel Thompson <daniel@riscstar.com>
>> ---
>>   drivers/net/ethernet/stmicro/stmmac/Kconfig   |  13 +
>>   drivers/net/ethernet/stmicro/stmmac/Makefile  |   2 +
>>   .../ethernet/stmicro/stmmac/dwmac-tc956x.c    | 791 ++++++++++++++++++
>>   include/soc/toshiba/tc956x-dwmac.h            |  84 ++
>>   4 files changed, 890 insertions(+)
>>   create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-tc956x.c
>>   create mode 100644 include/soc/toshiba/tc956x-dwmac.h
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/ 
>> net/ethernet/stmicro/stmmac/Kconfig
>> index e3dd5adda5aca..66bcfaccbe21f 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
>> +++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
>> @@ -404,6 +404,19 @@ config DWMAC_MOTORCOMM
>>         This enables glue driver for Motorcomm DWMAC-based PCI Ethernet
>>         controllers. Currently only YT6801 is supported.
>> +config DWMAC_TC956X
>> +    tristate "Toshiba TC956X DWMAC support"
>> +    depends on PCI
>> +    depends on COMMON_CLK
>> +    depends on TOSHIBA_TC956X_PCI
>> +    default m if TOSHIBA_TC956X_PCI
> 
> Hi Alex,
> 
> I think GENERIC_IRQ_CHIP should be selected here.

Yes there are a number of things missing in the Kconfig definitions
and I'm working through them this week.  And yes, since we use
irq_generic_chip_ops we must ensure CONFIG_GENERIC_IRQ_CHIP is
enabled here.

> Thank you for the driver.

Thank you for your feedback (this and others I see).

					-Alex



^ permalink raw reply

* Re: [PATCH] macsec: defer RX SA cleanup from RCU callback to workqueue
From: Kuniyuki Iwashima @ 2026-05-06 17:41 UTC (permalink / raw)
  To: alexjlzheng
  Cc: alexjlzheng, andrew+netdev, davem, edumazet, horms, kuba,
	linux-kernel, netdev, pabeni, sd, shenyangyang4
In-Reply-To: <20260506100107.388184-1-alexjlzheng@tencent.com>

From: alexjlzheng@gmail.com
Date: Wed,  6 May 2026 18:01:07 +0800
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> crypto_free_aead() can call vunmap() internally (e.g. via
> dma_free_attrs() in hardware crypto drivers like hisi_sec2), which
> must not be called from softirq context.
> 
> free_rxsa() is an RCU callback and therefore runs in softirq context,
> causing a kernel crash when the underlying AEAD implementation
> performs DMA unmapping during tfm destruction:
> 
>   vunmap+0x4c/0x70
>   __iommu_dma_free+0xd0/0x138
>   dma_free_attrs+0xf4/0x100
>   sec_aead_exit+0x64/0xb8 [hisi_sec2]
>   crypto_destroy_tfm+0x98/0x110
>   free_rxsa+0x28/0x50 [macsec]
>   rcu_do_batch+0x184/0x460
>   rcu_core+0xf4/0x1f8
>   handle_softirqs+0x118/0x330
> 
> Fix this by splitting free_rxsa() into two parts: the RCU callback
> now only schedules a work item, and the actual resource release
> (crypto_free_aead, free_percpu, kfree) is done in a workqueue
> handler running in process context.
> 
> Add a destroy_work field to struct macsec_rx_sa and initialize it
> in init_rx_sa().
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
> ---
>  drivers/net/macsec.c | 13 +++++++++++--
>  include/net/macsec.h |  2 ++
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index f6cad0746a02..dabd3d2598ae 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
> @@ -174,15 +174,23 @@ static void macsec_rxsc_put(struct macsec_rx_sc *sc)
>  		call_rcu(&sc->rcu_head, free_rx_sc_rcu);
>  }
>  
> -static void free_rxsa(struct rcu_head *head)
> +static void free_rxsa_work(struct work_struct *work)
>  {
> -	struct macsec_rx_sa *sa = container_of(head, struct macsec_rx_sa, rcu);
> +	struct macsec_rx_sa *sa = container_of(work, struct macsec_rx_sa,
> +					       destroy_work);
>  
>  	crypto_free_aead(sa->key.tfm);
>  	free_percpu(sa->stats);
>  	kfree(sa);
>  }
>  
> +static void free_rxsa(struct rcu_head *head)
> +{
> +	struct macsec_rx_sa *sa = container_of(head, struct macsec_rx_sa, rcu);
> +
> +	schedule_work(&sa->destroy_work);

rcu_work is what you want.

^ permalink raw reply

* Re: [RFC net-next 0/4] devlink: Add boot-time defaults
From: Mark Bloch @ 2026-05-06 17:35 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Jonathan Corbet, Shuah Khan, Simon Horman,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Morton,
	Borislav Petkov (AMD), Randy Dunlap, Dave Hansen,
	Christian Brauner, Petr Mladek, Peter Zijlstra (Intel),
	Thomas Gleixner, Pawan Gupta, Dapeng Mi, Kees Cook, Marco Elver,
	Eric Biggers, Li RongQing, Paul E. McKenney, linux-doc,
	linux-kernel, netdev, linux-rdma
In-Reply-To: <aftaW-irGmkfA7FS@FV6GYCPJ69>



On 06/05/2026 18:22, Jiri Pirko wrote:
> Wed, May 06, 2026 at 02:37:35PM +0200, mbloch@nvidia.com wrote:
>> This series adds a devlink= kernel command line parameter for applying
>> selected devlink settings during device initialization.
>>
>> Following a discussion with Jakub[1], I am sending this RFC to get the
>> conversation moving. I started from Jakub's example/request and extended
>> it to cover requirements from production systems and configurations that
>> customers use.
>>
>> One important caveat is that the parsing logic in this RFC was written
>> with AI assistance. I am also not sure whether the resulting syntax and
>> parser are too complex for a kernel command line interface. This is part
>> of why I am sending it as an RFC: to understand what direction and level
>> of complexity would be acceptable to people.
>>
>> The implementation is intended to support the following properties:
>>
>> - A system may have multiple devlink devices that usually need the same
>>  configuration. For a configuration such as eswitch mode switchdev, a
>>  user should be able to specify multiple devices to which that
>>  configuration applies.
>>
>> - There may be ordering dependencies between options. For example, in
>>  mlx5, flow_steering_mode should be set before moving to switchdev.
>>  With this in mind, defaults are applied per device in the left-to-right
>>  order in which they appear on the command line.
>>
>> The intent is to let deployments set devlink defaults before normal
>> userspace orchestration runs, while still using devlink concepts and
> 
> "defaults before normal userspace orchestrarion". I read it as config
> before config, which eventually could be skipped.
> 
> 
>> driver callbacks rather than adding driver-specific module parameters.
>> A default is scoped to one or more devlink handles, for example:
>>
>>  devlink=[pci/0000:08:00.0]:esw:mode:switchdev
>>  devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs
>>  devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev
> 
> I don't like this. What you do, you are basically introducing user
> configuration tool on kernel cmdline.
> 
> The same you would achieve with a proper userspace tool/daemon.
> I did try to come up with it and push it here:
> https://github.com/systemd/systemd/pull/37393
> That didn't get merged for unknown reason, but the idea is sound. You
> provide configuration files for devlink object and systemd-devlinkd
> will apply when they appear. Wouldn't this help your case?

I agree that systemd-devlinkd is the right shape for normal
devlink configuration, and it could probably replace the udev/devlink
plumbing we use today.

The case I am trying to cover is earlier than that.

On BlueField/ECPF/DPU systems, the host PF driver cannot always finish
probing independently of the ECPF side. When the ECPF is the eswitch
manager, the host PF is kept in initializing state until the ECPF eswitch
side is set up and mlx5 enables the external host PF HCA. That happens as
part of moving the ECPF to switchdev.

Today userspace observes the ECPF instance and then switches the
mode through devlink, usually via udev or similar plumbing. That still
leaves a window where the ECPF has probed, userspace has not applied the
mode yet, and the host PF is waiting. With many ECPFs this becomes visible
in host PF probe/boot time. A daemon reacting to the devlink object
appearing can make the userspace side cleaner, but it still runs after the
device has appeared and after userspace scheduling/uevent handling.

Long term, for these DPU deployments, we would like mlx5 to initialize
directly in switchdev. I am hesitant to make that unconditional because it
changes existing behavior and there is no early opt-out before probe. The
cmdline parameter was meant as an explicit opt-in middle step: ask the
driver to apply the same devlink operation during init, before this path
depends on userspace.

We previously tried to address this with an mlx5 module parameter. By
design, that was too coarse: it applied to all mlx5 devices handled by the
module. That makes it usable only for narrow DPU-only configurations. The
devlink-handle based cmdline syntax was intended to keep the opt-in scoped
to the specific devices that need this early switchdev transition.

Mark

> 
> [..]


^ permalink raw reply

* [PATCH nf-next v2 6/6] selftests: netfilter: nft_flowtable.sh: Add SIT flowtable selftest
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260506-b4-flowtable-sw-accel-ip6ip-v2-0-439fd427726e@kernel.org>

Similar to IPIP, IP6IP6 and IPv4 over IPv6, introduce specific selftest
for SIT flowtable sw acceleration in nft_flowtable.sh

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 tools/testing/selftests/net/netfilter/config       |  1 +
 .../selftests/net/netfilter/nft_flowtable.sh       | 45 ++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/tools/testing/selftests/net/netfilter/config b/tools/testing/selftests/net/netfilter/config
index 979cff56e1f5..c46604574653 100644
--- a/tools/testing/selftests/net/netfilter/config
+++ b/tools/testing/selftests/net/netfilter/config
@@ -30,6 +30,7 @@ CONFIG_IP_SCTP=m
 CONFIG_IPV6=y
 CONFIG_IPV6_MULTIPLE_TABLES=y
 CONFIG_IPV6_TUNNEL=m
+CONFIG_IPV6_SIT=m
 CONFIG_IP_VS=m
 CONFIG_IP_VS_PROTO_TCP=y
 CONFIG_IP_VS_RR=m
diff --git a/tools/testing/selftests/net/netfilter/nft_flowtable.sh b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
index 219339dbaf6e..6527e27b9121 100755
--- a/tools/testing/selftests/net/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
@@ -597,6 +597,10 @@ ip -net "$nsr1" addr add 192.168.210.1/24 dev tun6
 ip -net "$nsr1" addr add fee1:3::1/64 dev tun6 nodad
 ip netns exec "$nsr1" sysctl net.ipv4.conf.tun6.forwarding=1 > /dev/null
 
+ip -net "$nsr1" link add name sit1 type sit local 192.168.10.1 remote 192.168.10.2 ttl 255
+ip -net "$nsr1" link set sit1 up
+ip -net "$nsr1" addr add fe01:3::1/64 dev sit1 nodad
+
 ip -net "$nsr2" link add name tun0 type ipip local 192.168.10.2 remote 192.168.10.1
 ip -net "$nsr2" link set tun0 up
 ip -net "$nsr2" addr add 192.168.100.2/24 dev tun0
@@ -608,6 +612,10 @@ ip -net "$nsr2" addr add 192.168.210.2/24 dev tun6
 ip -net "$nsr2" addr add fee1:3::2/64 dev tun6 nodad
 ip netns exec "$nsr2" sysctl net.ipv4.conf.tun6.forwarding=1 > /dev/null
 
+ip -net "$nsr2" link add name sit1 type sit local 192.168.10.2 remote 192.168.10.1 ttl 255
+ip -net "$nsr2" link set sit1 up
+ip -net "$nsr2" addr add fe01:3::2/64 dev sit1 nodad
+
 ip -net "$nsr1" route change default via 192.168.100.2
 ip -net "$nsr2" route change default via 192.168.100.1
 
@@ -622,6 +630,7 @@ ip -6 -net "$ns2" route add default via dead:2::1
 
 ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun0 accept'
 ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun6 accept'
+ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif sit1 accept'
 ip netns exec "$nsr1" nft -a insert rule inet filter forward \
 	'meta oif "veth0" tcp sport 12345 ct mark set 1 flow add @f1 counter name routed_repl accept'
 
@@ -648,6 +657,19 @@ if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 "IP6IP4 tunnel"; then
 	ret=1
 fi
 
+ip -6 -net "$nsr1" route delete default
+ip -6 -net "$nsr1" route add default via fe01:3::2
+ip -6 -net "$nsr2" route delete default
+ip -6 -net "$nsr2" route add default via fe01:3::1
+
+if test_tcp_forwarding "$ns1" "$ns2" 1 6 "[dead:2::99]" 12345; then
+	echo "PASS: flow offload for ns1/ns2 SIT tunnel"
+else
+	echo "FAIL: flow offload for ns1/ns2 with SIT tunnel" 1>&2
+	ip netns exec "$nsr1" nft list ruleset
+	ret=1
+fi
+
 # Create vlan tagged devices for IPIP traffic.
 ip -net "$nsr1" link add link veth1 name veth1.10 type vlan id 10
 ip -net "$nsr1" link set veth1.10 up
@@ -672,6 +694,11 @@ ip -6 -net "$nsr1" route delete default
 ip -6 -net "$nsr1" route add default via fee1:5::2
 ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun6.10 accept'
 
+ip -net "$nsr1" link add name sit1.10 type sit local 192.168.20.1 remote 192.168.20.2 ttl 255
+ip -net "$nsr1" link set sit1.10 up
+ip -net "$nsr1" addr add fe01:5::1/64 dev sit1.10 nodad
+ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif sit1.10 accept'
+
 ip -net "$nsr2" link add link veth0 name veth0.10 type vlan id 10
 ip -net "$nsr2" link set veth0.10 up
 ip -net "$nsr2" addr add 192.168.20.2/24 dev veth0.10
@@ -689,6 +716,11 @@ ip -net "$nsr2" link set tun6.10 up
 ip -net "$nsr2" addr add 192.168.220.2/24 dev tun6.10
 ip -net "$nsr2" addr add fee1:5::2/64 dev tun6.10 nodad
 ip netns exec "$nsr2" sysctl net.ipv4.conf.tun6/10.forwarding=1 > /dev/null
+
+ip -net "$nsr2" link add name sit1.10 type sit local 192.168.20.2 remote 192.168.20.1 ttl 255
+ip -net "$nsr2" link set sit1.10 up
+ip -net "$nsr2" addr add fe01:5::2/64 dev sit1.10 nodad
+
 ip -6 -net "$nsr2" route delete default
 ip -6 -net "$nsr2" route add default via fee1:5::1
 
@@ -715,6 +747,19 @@ if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 "IP6IP4 tunnel over vlan"; then
 	ret=1
 fi
 
+ip -6 -net "$nsr1" route delete default
+ip -6 -net "$nsr1" route add default via fe01:5::2
+ip -6 -net "$nsr2" route delete default
+ip -6 -net "$nsr2" route add default via fe01:5::1
+
+if test_tcp_forwarding "$ns1" "$ns2" 1 6 "[dead:2::99]" 12345; then
+	echo "PASS: flow offload for ns1/ns2 SIT tunnel over vlan"
+else
+	echo "FAIL: flow offload for ns1/ns2 with SIT tunnel over vlan" 1>&2
+	ip netns exec "$nsr1" nft list ruleset
+	ret=1
+fi
+
 # Restore the previous configuration
 ip -net "$nsr1" route change default via 192.168.10.2
 ip -net "$nsr2" route change default via 192.168.10.1

-- 
2.54.0


^ permalink raw reply related

* [PATCH nf-next v2 5/6] net: netfilter: Add SIT tunnel flowtable acceleration
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260506-b4-flowtable-sw-accel-ip6ip-v2-0-439fd427726e@kernel.org>

Introduce sw flowtable acceleration for the TX/RX paths of
SIT tunnels, relying on the netfilter flowtable infrastructure.
The feature can be tested with a forwarding scenario between two
NICs (eth0 and eth1), where a SIT tunnel is used to reach a remote
site via eth1 as the underlay device:

    ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.2.2)

[IP configuration]

6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:1::2/64 scope global nodad
       valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.1/24 scope global eth1
       valid_lft forever preferred_lft forever
8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/sit 192.168.2.1 peer 192.168.2.2
    inet6 2001:db8:200::1/64 scope global nodad
       valid_lft forever preferred_lft forever

$ ip route show
192.168.2.0/24 dev eth1 proto kernel scope link src 192.168.2.1

$ ip -6 route show
2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium
2001:db8:200::/64 dev tun0 proto kernel metric 256 pref medium
default via 2001:db8:200::2 dev tun0 metric 1024 pref medium

$ nft list ruleset
table inet filter {
    flowtable ft {
        hook ingress priority filter
        devices = { eth0, eth1 }
    }

    chain forward {
        type filter hook forward priority filter; policy accept;
        meta l4proto { tcp, udp } flow add @ft
    }
}

When reproducing this scenario using veth interfaces, the following
results were observed:

- TCP stream received from SIT tunnel:
  - net-next (baseline):                ~118 Gbps
  - net-next + SIT flowtable support: ~148 Gbps

- TCP stream transmitted to SIT tunnel:
  - net-next (baseline):                ~131 Gbps
  - net-next + SIT flowtable support: ~147 Gbps

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 net/ipv6/sit.c                   |  26 ++++
 net/netfilter/nf_flow_table_ip.c | 304 ++++++++++++++++++++++-----------------
 2 files changed, 196 insertions(+), 134 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 201347b4e127..d1d5ff385d6f 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1362,6 +1362,31 @@ ipip6_tunnel_ctl(struct net_device *dev, struct ip_tunnel_parm_kern *p,
 	}
 }
 
+static int ipip6_tunnel_fill_forward_path(struct net_device_path_ctx *ctx,
+					  struct net_device_path *path)
+{
+	struct ip_tunnel *tunnel = netdev_priv(ctx->dev);
+	const struct iphdr *tiph = &tunnel->parms.iph;
+	struct rtable *rt;
+
+	rt = ip_route_output(dev_net(ctx->dev), tiph->daddr, 0, 0, 0,
+			     RT_SCOPE_UNIVERSE);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+
+	path->type = DEV_PATH_TUN;
+	path->tun.src_v4.s_addr = tiph->saddr;
+	path->tun.dst_v4.s_addr = tiph->daddr;
+	path->tun.l3_proto = IPPROTO_IPV6;
+	path->tun.encap_proto = AF_INET;
+	path->dev = ctx->dev;
+
+	ctx->dev = rt->dst.dev;
+	ip_rt_put(rt);
+
+	return 0;
+}
+
 static int
 ipip6_tunnel_siocdevprivate(struct net_device *dev, struct ifreq *ifr,
 			    void __user *data, int cmd)
@@ -1398,6 +1423,7 @@ static const struct net_device_ops ipip6_netdev_ops = {
 	.ndo_siocdevprivate = ipip6_tunnel_siocdevprivate,
 	.ndo_get_iflink = ip_tunnel_get_iflink,
 	.ndo_tunnel_ctl = ipip6_tunnel_ctl,
+	.ndo_fill_forward_path = ipip6_tunnel_fill_forward_path,
 };
 
 static void ipip6_dev_free(struct net_device *dev)
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 6394f4474f43..0ad2b35d5f35 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -336,8 +336,8 @@ static bool nf_flow_ip4_tunnel_proto(struct nf_flowtable_ctx *ctx,
 	if (iph->ttl <= 1)
 		return false;
 
-	if (iph->protocol == IPPROTO_IPIP) {
-		ctx->tun.proto = IPPROTO_IPIP;
+	if (iph->protocol == IPPROTO_IPIP || iph->protocol == IPPROTO_IPV6) {
+		ctx->tun.proto = iph->protocol;
 		ctx->tun.hdr_size = size;
 		ctx->offset += size;
 	}
@@ -485,21 +485,6 @@ static unsigned int nf_flow_queue_xmit(struct net *net, struct sk_buff *skb,
 	return NF_STOLEN;
 }
 
-static struct flow_offload_tuple_rhash *
-nf_flow_offload_lookup(struct nf_flowtable_ctx *ctx,
-		       struct nf_flowtable *flow_table, struct sk_buff *skb)
-{
-	struct flow_offload_tuple tuple = {};
-
-	if (!nf_flow_skb_encap_protocol(ctx, skb, htons(ETH_P_IP)))
-		return NULL;
-
-	if (nf_flow_tuple_ip(ctx, skb, &tuple) < 0)
-		return NULL;
-
-	return flow_offload_lookup(flow_table, &tuple);
-}
-
 static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx,
 				   struct nf_flowtable *flow_table,
 				   struct flow_offload_tuple_rhash *tuplehash,
@@ -602,19 +587,33 @@ static int nf_flow_tunnel_ipip_push(struct net *net, struct sk_buff *skb,
 				    struct flow_offload_tuple *tuple,
 				    __be32 *ip_daddr)
 {
-	struct iphdr *iph = (struct iphdr *)skb_network_header(skb);
 	struct rtable *rt = dst_rtable(tuple->dst_cache);
-	u8 tos = iph->tos, ttl = iph->ttl;
-	__be16 frag_off = iph->frag_off;
-	u32 headroom = sizeof(*iph);
+	__be16 frag_off = 0;
+	struct iphdr *iph;
+	u8 tos = 0, ttl;
+	u32 headroom;
 	int err;
 
+	if (tuple->tun.l3_proto == IPPROTO_IPV6) {
+		struct ipv6hdr *ip6h;
+
+		ip6h = (struct ipv6hdr *)skb_network_header(skb);
+		tos = ipv6_get_dsfield(ip6h);
+		ttl = ip6h->hop_limit;
+	} else {
+		iph = (struct iphdr *)skb_network_header(skb);
+		frag_off = iph->frag_off;
+		tos = iph->tos;
+		ttl = iph->ttl;
+	}
+
 	err = iptunnel_handle_offloads(skb, SKB_GSO_IPXIP4);
 	if (err)
 		return err;
 
-	skb_set_inner_ipproto(skb, IPPROTO_IPIP);
-	headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len;
+	skb_set_inner_ipproto(skb, tuple->tun.l3_proto);
+	headroom = sizeof(*iph) + LL_RESERVED_SPACE(rt->dst.dev) +
+		   rt->dst.header_len;
 	err = skb_cow_head(skb, headroom);
 	if (err)
 		return err;
@@ -625,6 +624,7 @@ static int nf_flow_tunnel_ipip_push(struct net *net, struct sk_buff *skb,
 	/* Push down and install the IP header. */
 	skb_push(skb, sizeof(*iph));
 	skb_reset_network_header(skb);
+	skb->protocol = htons(ETH_P_IP);
 
 	iph = ip_hdr(skb);
 	iph->version	= 4;
@@ -781,112 +781,6 @@ static int nf_flow_encap_push(struct sk_buff *skb,
 	return 0;
 }
 
-unsigned int
-nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
-			const struct nf_hook_state *state)
-{
-	int encap_limit = IPV6_DEFAULT_TNL_ENCAP_LIMIT;
-	struct flow_offload_tuple_rhash *tuplehash;
-	struct nf_flowtable *flow_table = priv;
-	struct flow_offload_tuple *other_tuple;
-	enum flow_offload_tuple_dir dir;
-	struct nf_flowtable_ctx ctx = {
-		.in	= state->in,
-	};
-	struct nf_flow_xmit xmit = {};
-	struct in6_addr *ip6_daddr;
-	struct flow_offload *flow;
-	struct neighbour *neigh;
-	struct rtable *rt;
-	__be32 ip_daddr;
-	int ret;
-
-	tuplehash = nf_flow_offload_lookup(&ctx, flow_table, skb);
-	if (!tuplehash)
-		return NF_ACCEPT;
-
-	ret = nf_flow_offload_forward(&ctx, flow_table, tuplehash, skb,
-				      encap_limit);
-	if (ret < 0)
-		return NF_DROP;
-	else if (ret == 0)
-		return NF_ACCEPT;
-
-	if (unlikely(tuplehash->tuple.xmit_type == FLOW_OFFLOAD_XMIT_XFRM)) {
-		rt = dst_rtable(tuplehash->tuple.dst_cache);
-		memset(skb->cb, 0, sizeof(struct inet_skb_parm));
-		IPCB(skb)->iif = skb->dev->ifindex;
-		IPCB(skb)->flags = IPSKB_FORWARDED;
-		return nf_flow_xmit_xfrm(skb, state, &rt->dst);
-	}
-
-	dir = tuplehash->tuple.dir;
-	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
-	other_tuple = &flow->tuplehash[!dir].tuple;
-	ip_daddr = other_tuple->src_v4.s_addr;
-
-	if (other_tuple->tun.encap_proto == AF_INET6) {
-		if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
-					   &ip6_daddr,
-					   IPV6_DEFAULT_TNL_ENCAP_LIMIT) < 0)
-			return NF_DROP;
-	} else if (nf_flow_tunnel_v4_push(state->net, skb, other_tuple,
-					  &ip_daddr) < 0) {
-		return NF_DROP;
-	}
-
-	if (nf_flow_encap_push(skb, other_tuple) < 0)
-		return NF_DROP;
-
-	switch (tuplehash->tuple.xmit_type) {
-	case FLOW_OFFLOAD_XMIT_NEIGH: {
-		struct dst_entry *dst;
-
-		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.ifidx);
-		if (!xmit.outdev) {
-			flow_offload_teardown(flow);
-			return NF_DROP;
-		}
-		if (other_tuple->tun.encap_proto == AF_INET6 ||
-		    ctx.tun.proto == IPPROTO_IPV6) {
-			struct rt6_info *rt6;
-
-			rt6 = dst_rt6_info(tuplehash->tuple.dst_cache);
-			neigh = ip_neigh_gw6(rt6->dst.dev,
-					     rt6_nexthop(rt6, ip6_daddr));
-			dst = &rt6->dst;
-		} else {
-			rt = dst_rtable(tuplehash->tuple.dst_cache);
-			neigh = ip_neigh_gw4(rt->dst.dev,
-					     rt_nexthop(rt, ip_daddr));
-			dst = &rt->dst;
-		}
-		if (IS_ERR(neigh)) {
-			flow_offload_teardown(flow);
-			return NF_DROP;
-		}
-		xmit.dest = neigh->ha;
-		skb_dst_set_noref(skb, dst);
-		break;
-	}
-	case FLOW_OFFLOAD_XMIT_DIRECT:
-		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.out.ifidx);
-		if (!xmit.outdev) {
-			flow_offload_teardown(flow);
-			return NF_DROP;
-		}
-		xmit.dest = tuplehash->tuple.out.h_dest;
-		xmit.source = tuplehash->tuple.out.h_source;
-		break;
-	default:
-		WARN_ON_ONCE(1);
-		return NF_DROP;
-	}
-
-	return nf_flow_queue_xmit(state->net, skb, &xmit);
-}
-EXPORT_SYMBOL_GPL(nf_flow_offload_ip_hook);
-
 static void nf_flow_nat_ipv6_tcp(struct sk_buff *skb, unsigned int thoff,
 				 struct in6_addr *addr,
 				 struct in6_addr *new_addr,
@@ -1071,10 +965,17 @@ static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 
 	mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset;
-	if (flow->tuplehash[!dir].tuple.tun_num) {
+	switch (flow->tuplehash[!dir].tuple.tun.encap_proto) {
+	case AF_INET:
+		mtu -= sizeof(struct iphdr);
+		break;
+	case AF_INET6:
 		mtu -= sizeof(*ip6h);
 		if (encap_limit > 0)
 			mtu -= 8; /* encap limit option */
+		break;
+	default:
+		break;
 	}
 
 	if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
@@ -1109,6 +1010,25 @@ static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 	return 1;
 }
 
+static struct flow_offload_tuple_rhash *
+nf_flow_offload_lookup(struct nf_flowtable_ctx *ctx,
+		       struct nf_flowtable *flow_table, struct sk_buff *skb)
+{
+	struct flow_offload_tuple tuple = {};
+
+	if (!nf_flow_skb_encap_protocol(ctx, skb, htons(ETH_P_IP)))
+		return NULL;
+
+	if (ctx->tun.proto == IPPROTO_IPV6) {
+		if (nf_flow_tuple_ipv6(ctx, skb, &tuple) < 0)
+			return NULL;
+	} else if (nf_flow_tuple_ip(ctx, skb, &tuple) < 0) {
+		return NULL;
+	}
+
+	return flow_offload_lookup(flow_table, &tuple);
+}
+
 static struct flow_offload_tuple_rhash *
 nf_flow_offload_ipv6_lookup(struct nf_flowtable_ctx *ctx,
 			    struct nf_flowtable *flow_table,
@@ -1129,6 +1049,117 @@ nf_flow_offload_ipv6_lookup(struct nf_flowtable_ctx *ctx,
 	return flow_offload_lookup(flow_table, &tuple);
 }
 
+unsigned int
+nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
+			const struct nf_hook_state *state)
+{
+	int encap_limit = IPV6_DEFAULT_TNL_ENCAP_LIMIT;
+	struct flow_offload_tuple_rhash *tuplehash;
+	struct nf_flowtable *flow_table = priv;
+	struct flow_offload_tuple *other_tuple;
+	enum flow_offload_tuple_dir dir;
+	struct nf_flowtable_ctx ctx = {
+		.in	= state->in,
+	};
+	struct nf_flow_xmit xmit = {};
+	struct in6_addr *ip6_daddr;
+	struct flow_offload *flow;
+	struct neighbour *neigh;
+	struct rtable *rt;
+	__be32 ip_daddr;
+	int ret;
+
+	tuplehash = nf_flow_offload_lookup(&ctx, flow_table, skb);
+	if (!tuplehash)
+		return NF_ACCEPT;
+
+	if (ctx.tun.proto == IPPROTO_IPV6)
+		ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash,
+						   skb, encap_limit);
+	else
+		ret = nf_flow_offload_forward(&ctx, flow_table, tuplehash, skb,
+					      encap_limit);
+	if (ret < 0)
+		return NF_DROP;
+	else if (ret == 0)
+		return NF_ACCEPT;
+
+	if (unlikely(tuplehash->tuple.xmit_type == FLOW_OFFLOAD_XMIT_XFRM)) {
+		rt = dst_rtable(tuplehash->tuple.dst_cache);
+		memset(skb->cb, 0, sizeof(struct inet_skb_parm));
+		IPCB(skb)->iif = skb->dev->ifindex;
+		IPCB(skb)->flags = IPSKB_FORWARDED;
+		return nf_flow_xmit_xfrm(skb, state, &rt->dst);
+	}
+
+	dir = tuplehash->tuple.dir;
+	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
+	other_tuple = &flow->tuplehash[!dir].tuple;
+	ip_daddr = other_tuple->src_v4.s_addr;
+	ip6_daddr = &other_tuple->src_v6;
+
+	if (other_tuple->tun.encap_proto == AF_INET6) {
+		if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
+					   &ip6_daddr,
+					   IPV6_DEFAULT_TNL_ENCAP_LIMIT) < 0)
+			return NF_DROP;
+	} else if (nf_flow_tunnel_v4_push(state->net, skb, other_tuple,
+					  &ip_daddr) < 0) {
+		return NF_DROP;
+	}
+
+	if (nf_flow_encap_push(skb, other_tuple) < 0)
+		return NF_DROP;
+
+	switch (tuplehash->tuple.xmit_type) {
+	case FLOW_OFFLOAD_XMIT_NEIGH: {
+		struct dst_entry *dst;
+
+		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.ifidx);
+		if (!xmit.outdev) {
+			flow_offload_teardown(flow);
+			return NF_DROP;
+		}
+		if (other_tuple->tun.encap_proto == AF_INET6 ||
+		    ctx.tun.proto == IPPROTO_IPV6) {
+			struct rt6_info *rt6;
+
+			rt6 = dst_rt6_info(tuplehash->tuple.dst_cache);
+			neigh = ip_neigh_gw6(rt6->dst.dev,
+					     rt6_nexthop(rt6, ip6_daddr));
+			dst = &rt6->dst;
+		} else {
+			rt = dst_rtable(tuplehash->tuple.dst_cache);
+			neigh = ip_neigh_gw4(rt->dst.dev,
+					     rt_nexthop(rt, ip_daddr));
+			dst = &rt->dst;
+		}
+		if (IS_ERR(neigh)) {
+			flow_offload_teardown(flow);
+			return NF_DROP;
+		}
+		xmit.dest = neigh->ha;
+		skb_dst_set_noref(skb, dst);
+		break;
+	}
+	case FLOW_OFFLOAD_XMIT_DIRECT:
+		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.out.ifidx);
+		if (!xmit.outdev) {
+			flow_offload_teardown(flow);
+			return NF_DROP;
+		}
+		xmit.dest = tuplehash->tuple.out.h_dest;
+		xmit.source = tuplehash->tuple.out.h_source;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return NF_DROP;
+	}
+
+	return nf_flow_queue_xmit(state->net, skb, &xmit);
+}
+EXPORT_SYMBOL_GPL(nf_flow_offload_ip_hook);
+
 unsigned int
 nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 			  const struct nf_hook_state *state)
@@ -1146,6 +1177,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	struct flow_offload *flow;
 	struct neighbour *neigh;
 	struct rt6_info *rt;
+	__be32 ip_daddr;
 	int ret;
 
 	tuplehash = nf_flow_offload_ipv6_lookup(&ctx, flow_table, skb);
@@ -1174,11 +1206,17 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	dir = tuplehash->tuple.dir;
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 	other_tuple = &flow->tuplehash[!dir].tuple;
+	ip_daddr = other_tuple->src_v4.s_addr;
 	ip6_daddr = &other_tuple->src_v6;
 
-	if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
-				   &ip6_daddr, encap_limit) < 0)
+	if (other_tuple->tun.encap_proto == AF_INET) {
+		if (nf_flow_tunnel_v4_push(state->net, skb, other_tuple,
+					   &ip_daddr) < 0)
+			return NF_DROP;
+	} else if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
+					  &ip6_daddr, encap_limit) < 0) {
 		return NF_DROP;
+	}
 
 	if (nf_flow_encap_push(skb, other_tuple) < 0)
 		return NF_DROP;
@@ -1194,10 +1232,8 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 		}
 		if (other_tuple->tun.encap_proto == AF_INET ||
 		    ctx.tun.proto == IPPROTO_IPIP) {
-			__be32 ip_daddr = other_tuple->src_v4.s_addr;
 			struct rtable *rt4;
 
-			skb->protocol = htons(ETH_P_IP);
 			rt4 = dst_rtable(tuplehash->tuple.dst_cache);
 			neigh = ip_neigh_gw4(rt4->dst.dev,
 					     rt_nexthop(rt4, ip_daddr));

-- 
2.54.0


^ permalink raw reply related

* [PATCH nf-next v2 4/6] selftests: netfilter: nft_flowtable.sh: Add IPv4 over IPv6 flowtable selftest
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260506-b4-flowtable-sw-accel-ip6ip-v2-0-439fd427726e@kernel.org>

Similar to IPIP and IP6IP6, introduce specific selftest for IPv4 over IPv6
flowtable sw acceleration in nft_flowtable.sh

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 .../selftests/net/netfilter/nft_flowtable.sh       | 33 +++++++++++++++++++---
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/netfilter/nft_flowtable.sh b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
index 7a34ef468975..219339dbaf6e 100755
--- a/tools/testing/selftests/net/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
@@ -579,9 +579,8 @@ if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 ""; then
 	ip netns exec "$nsr1" nft list ruleset
 fi
 
-# IPIP tunnel test:
-# Add IPIP tunnel interfaces and check flowtable acceleration.
-test_ipip() {
+# IP tunnel tests:
+test_ip_tnls() {
 if ! ip -net "$nsr1" link add name tun0 type ipip \
      local 192.168.10.1 remote 192.168.10.2 >/dev/null;then
 	echo "SKIP: could not add ipip tunnel"
@@ -594,7 +593,9 @@ ip netns exec "$nsr1" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null
 
 ip -net "$nsr1" link add name tun6 type ip6tnl local fee1:2::1 remote fee1:2::2
 ip -net "$nsr1" link set tun6 up
+ip -net "$nsr1" addr add 192.168.210.1/24 dev tun6
 ip -net "$nsr1" addr add fee1:3::1/64 dev tun6 nodad
+ip netns exec "$nsr1" sysctl net.ipv4.conf.tun6.forwarding=1 > /dev/null
 
 ip -net "$nsr2" link add name tun0 type ipip local 192.168.10.2 remote 192.168.10.1
 ip -net "$nsr2" link set tun0 up
@@ -603,7 +604,9 @@ ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null
 
 ip -net "$nsr2" link add name tun6 type ip6tnl local fee1:2::2 remote fee1:2::1 || ret=1
 ip -net "$nsr2" link set tun6 up
+ip -net "$nsr2" addr add 192.168.210.2/24 dev tun6
 ip -net "$nsr2" addr add fee1:3::2/64 dev tun6 nodad
+ip netns exec "$nsr2" sysctl net.ipv4.conf.tun6.forwarding=1 > /dev/null
 
 ip -net "$nsr1" route change default via 192.168.100.2
 ip -net "$nsr2" route change default via 192.168.100.1
@@ -636,6 +639,15 @@ else
 	ret=1
 fi
 
+ip -net "$nsr1" route change default via 192.168.210.2
+ip -net "$nsr2" route change default via 192.168.210.1
+
+if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 "IP6IP4 tunnel"; then
+	echo "FAIL: flow offload for ns1/ns2 with IP6IP4 tunnel" 1>&2
+	ip netns exec "$nsr1" nft list ruleset
+	ret=1
+fi
+
 # Create vlan tagged devices for IPIP traffic.
 ip -net "$nsr1" link add link veth1 name veth1.10 type vlan id 10
 ip -net "$nsr1" link set veth1.10 up
@@ -653,7 +665,9 @@ ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun0.10 a
 
 ip -net "$nsr1" link add name tun6.10 type ip6tnl local fee1:4::1 remote fee1:4::2
 ip -net "$nsr1" link set tun6.10 up
+ip -net "$nsr1" addr add 192.168.220.1/24 dev tun6.10
 ip -net "$nsr1" addr add fee1:5::1/64 dev tun6.10 nodad
+ip netns exec "$nsr1" sysctl net.ipv4.conf.tun6/10.forwarding=1 > /dev/null
 ip -6 -net "$nsr1" route delete default
 ip -6 -net "$nsr1" route add default via fee1:5::2
 ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun6.10 accept'
@@ -672,7 +686,9 @@ ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0/10.forwarding=1 > /dev/null
 
 ip -net "$nsr2" link add name tun6.10 type ip6tnl local fee1:4::2 remote fee1:4::1 || ret=1
 ip -net "$nsr2" link set tun6.10 up
+ip -net "$nsr2" addr add 192.168.220.2/24 dev tun6.10
 ip -net "$nsr2" addr add fee1:5::2/64 dev tun6.10 nodad
+ip netns exec "$nsr2" sysctl net.ipv4.conf.tun6/10.forwarding=1 > /dev/null
 ip -6 -net "$nsr2" route delete default
 ip -6 -net "$nsr2" route add default via fee1:5::1
 
@@ -690,6 +706,15 @@ else
 	ret=1
 fi
 
+ip -net "$nsr1" route change default via 192.168.220.2
+ip -net "$nsr2" route change default via 192.168.220.1
+
+if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 "IP6IP4 tunnel over vlan"; then
+	echo "FAIL: flow offload for ns1/ns2 with IP6IP4 tunnel over vlan" 1>&2
+	ip netns exec "$nsr1" nft list ruleset
+	ret=1
+fi
+
 # Restore the previous configuration
 ip -net "$nsr1" route change default via 192.168.10.2
 ip -net "$nsr2" route change default via 192.168.10.1
@@ -782,7 +807,7 @@ ip -net "$nsr1" addr add dead:1::1/64 dev veth0 nodad
 ip -net "$nsr1" link set up dev veth0
 }
 
-test_ipip
+test_ip_tnls
 
 test_bridge
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH nf-next v2 3/6] net: netfilter: Add IPv4 over IPv6 tunnel flowtable acceleration
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260506-b4-flowtable-sw-accel-ip6ip-v2-0-439fd427726e@kernel.org>

Introduce sw flowtable acceleration for the TX/RX paths of
IPv4 over IPv6 tunnels, relying on the netfilter flowtable
infrastructure.
The feature can be tested with a forwarding scenario between two
NICs (eth0 and eth1), where an IPv4 over IPv6 tunnel is used to
reach a remote site via eth1 as the underlay device:

    ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (2001:db8:2::2)

[IP configuration]

6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.2/24 scope global eth0
       valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:2::1/64 scope global nodad
       valid_lft forever preferred_lft forever
8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr ce9c:2940:7dcc::
    inet 192.168.100.1/24 scope global tun0
       valid_lft forever preferred_lft forever

$ ip route show
default via 192.168.100.2 dev tun0
192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2
192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1

$ ip -6 route show
2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium

$ nft list ruleset
table inet filter {
    flowtable ft {
        hook ingress priority filter
        devices = { eth0, eth1 }
    }

    chain forward {
        type filter hook forward priority filter; policy accept;
        meta l4proto { tcp, udp } flow add @ft
    }
}

When reproducing this scenario using veth interfaces, the following
results were observed:

- TCP stream received from IPv4 over IPv6 tunnel:
  - net-next (baseline):                ~126 Gbps
  - net-next + IP6IP flowtable support: ~138 Gbps

- TCP stream transmitted to IPv4 over IPv6 tunnel:
  - net-next (baseline):                ~127 Gbps
  - net-next + IP6IP flowtable support: ~140 Gbps

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 net/netfilter/nf_flow_table_core.c |  14 +++-
 net/netfilter/nf_flow_table_ip.c   | 146 ++++++++++++++++++++++++++++---------
 net/netfilter/nf_flow_table_path.c |   6 +-
 3 files changed, 123 insertions(+), 43 deletions(-)

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 2c4140e6f53c..53fea3da0747 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -76,9 +76,11 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
 }
 EXPORT_SYMBOL_GPL(flow_offload_alloc);
 
-static u32 flow_offload_dst_cookie(struct flow_offload_tuple *flow_tuple)
+static u32 flow_offload_dst_cookie(struct flow_offload_tuple *flow_tuple,
+				   u8 tun_encap_proto)
 {
-	if (flow_tuple->l3proto == NFPROTO_IPV6)
+	if (flow_tuple->l3proto == NFPROTO_IPV6 ||
+	    tun_encap_proto == NFPROTO_IPV6)
 		return rt6_get_cookie(dst_rt6_info(flow_tuple->dst_cache));
 
 	return 0;
@@ -134,10 +136,14 @@ static int flow_offload_fill_route(struct flow_offload *flow,
 		dst_release(dst);
 		break;
 	case FLOW_OFFLOAD_XMIT_XFRM:
-	case FLOW_OFFLOAD_XMIT_NEIGH:
+	case FLOW_OFFLOAD_XMIT_NEIGH: {
+		u8 encap_proto = route->tuple[!dir].in.tun.encap_proto;
+
 		flow_tuple->ifidx = route->tuple[dir].out.ifindex;
 		flow_tuple->dst_cache = dst;
-		flow_tuple->dst_cookie = flow_offload_dst_cookie(flow_tuple);
+		flow_tuple->dst_cookie = flow_offload_dst_cookie(flow_tuple,
+								 encap_proto);
+		}
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 9efd76b57847..6394f4474f43 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -191,27 +191,27 @@ static void nf_flow_tuple_encap(struct nf_flowtable_ctx *ctx,
 		break;
 	}
 
-	switch (inner_proto) {
-	case htons(ETH_P_IP):
-		iph = (struct iphdr *)(skb_network_header(skb) + offset);
-		if (ctx->tun.proto == IPPROTO_IPIP) {
+	if (ctx->tun.proto == IPPROTO_IPIP || ctx->tun.proto == IPPROTO_IPV6) {
+		switch (inner_proto) {
+		case htons(ETH_P_IP):
+			iph = (struct iphdr *)(skb_network_header(skb) +
+					       offset);
 			tuple->tun.dst_v4.s_addr = iph->daddr;
 			tuple->tun.src_v4.s_addr = iph->saddr;
-			tuple->tun.l3_proto = IPPROTO_IPIP;
+			tuple->tun.l3_proto = ctx->tun.proto;
 			tuple->tun.encap_proto = AF_INET;
-		}
-		break;
-	case htons(ETH_P_IPV6):
-		ip6h = (struct ipv6hdr *)(skb_network_header(skb) + offset);
-		if (ctx->tun.proto == IPPROTO_IPV6) {
+			break;
+		case htons(ETH_P_IPV6):
+			ip6h = (struct ipv6hdr *)(skb_network_header(skb) +
+						  offset);
 			tuple->tun.dst_v6 = ip6h->daddr;
 			tuple->tun.src_v6 = ip6h->saddr;
-			tuple->tun.l3_proto = IPPROTO_IPV6;
+			tuple->tun.l3_proto = ctx->tun.proto;
 			tuple->tun.encap_proto = AF_INET6;
+			break;
+		default:
+			break;
 		}
-		break;
-	default:
-		break;
 	}
 }
 
@@ -367,9 +367,9 @@ static bool nf_flow_ip6_tunnel_proto(struct nf_flowtable_ctx *ctx,
 	if (hdrlen < 0)
 		return false;
 
-	if (nexthdr == IPPROTO_IPV6) {
+	if (nexthdr == IPPROTO_IPIP || nexthdr == IPPROTO_IPV6) {
 		ctx->tun.hdr_size = hdrlen;
-		ctx->tun.proto = IPPROTO_IPV6;
+		ctx->tun.proto = nexthdr;
 	}
 	ctx->offset += ctx->tun.hdr_size;
 
@@ -388,6 +388,10 @@ static void nf_flow_ip_tunnel_pop(struct nf_flowtable_ctx *ctx,
 
 	skb_pull(skb, ctx->tun.hdr_size);
 	skb_reset_network_header(skb);
+	if (ctx->tun.proto == IPPROTO_IPIP)
+		skb->protocol = htons(ETH_P_IP);
+	else
+		skb->protocol = htons(ETH_P_IPV6);
 }
 
 static bool nf_flow_skb_encap_protocol(struct nf_flowtable_ctx *ctx,
@@ -499,7 +503,7 @@ nf_flow_offload_lookup(struct nf_flowtable_ctx *ctx,
 static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx,
 				   struct nf_flowtable *flow_table,
 				   struct flow_offload_tuple_rhash *tuplehash,
-				   struct sk_buff *skb)
+				   struct sk_buff *skb, int encap_limit)
 {
 	enum flow_offload_tuple_dir dir;
 	struct flow_offload *flow;
@@ -510,8 +514,18 @@ static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx,
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 
 	mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset;
-	if (flow->tuplehash[!dir].tuple.tun_num)
+	switch (flow->tuplehash[!dir].tuple.tun.encap_proto) {
+	case AF_INET:
 		mtu -= sizeof(*iph);
+		break;
+	case AF_INET6:
+		mtu -= sizeof(struct ipv6hdr);
+		if (encap_limit > 0)
+			mtu -= 8; /* encap limit option */
+		break;
+	default:
+		break;
+	}
 
 	if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
 		return 0;
@@ -650,18 +664,29 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 				      struct in6_addr **ip6_daddr,
 				      int encap_limit)
 {
-	struct ipv6hdr *ip6h = (struct ipv6hdr *)skb_network_header(skb);
-	u8 hop_limit = ip6h->hop_limit, proto = IPPROTO_IPV6;
 	struct rtable *rt = dst_rtable(tuple->dst_cache);
-	__u8 dsfield = ipv6_get_dsfield(ip6h);
+	u8 hop_limit, proto = tuple->tun.l3_proto;
 	struct flowi6 fl6 = {
 		.daddr = tuple->tun.src_v6,
 		.saddr = tuple->tun.dst_v6,
 		.flowi6_proto = proto,
 	};
+	struct ipv6hdr *ip6h;
+	__u8 dsfield;
 	int err, mtu;
 	u32 headroom;
 
+	if (tuple->tun.l3_proto == IPPROTO_IPIP) {
+		struct iphdr *iph = (struct iphdr *)skb_network_header(skb);
+
+		dsfield = ipv4_get_dsfield(iph);
+		hop_limit = iph->ttl;
+	} else {
+		ip6h = (struct ipv6hdr *)skb_network_header(skb);
+		dsfield = ipv6_get_dsfield(ip6h);
+		hop_limit = ip6h->hop_limit;
+	}
+
 	err = iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6);
 	if (err)
 		return err;
@@ -697,12 +722,13 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 
 		hopt = skb_push(skb, ipv6_optlen(opt.ops.dst1opt));
 		memcpy(hopt, opt.ops.dst1opt, ipv6_optlen(opt.ops.dst1opt));
-		hopt->nexthdr = IPPROTO_IPV6;
+		hopt->nexthdr = proto;
 		proto = NEXTHDR_DEST;
 	}
 
 	skb_push(skb, sizeof(*ip6h));
 	skb_reset_network_header(skb);
+	skb->protocol = htons(ETH_P_IPV6);
 
 	ip6h = ipv6_hdr(skb);
 	ip6_flow_hdr(ip6h, dsfield,
@@ -759,6 +785,7 @@ unsigned int
 nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 			const struct nf_hook_state *state)
 {
+	int encap_limit = IPV6_DEFAULT_TNL_ENCAP_LIMIT;
 	struct flow_offload_tuple_rhash *tuplehash;
 	struct nf_flowtable *flow_table = priv;
 	struct flow_offload_tuple *other_tuple;
@@ -767,6 +794,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 		.in	= state->in,
 	};
 	struct nf_flow_xmit xmit = {};
+	struct in6_addr *ip6_daddr;
 	struct flow_offload *flow;
 	struct neighbour *neigh;
 	struct rtable *rt;
@@ -777,7 +805,8 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	if (!tuplehash)
 		return NF_ACCEPT;
 
-	ret = nf_flow_offload_forward(&ctx, flow_table, tuplehash, skb);
+	ret = nf_flow_offload_forward(&ctx, flow_table, tuplehash, skb,
+				      encap_limit);
 	if (ret < 0)
 		return NF_DROP;
 	else if (ret == 0)
@@ -796,28 +825,50 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	other_tuple = &flow->tuplehash[!dir].tuple;
 	ip_daddr = other_tuple->src_v4.s_addr;
 
-	if (nf_flow_tunnel_v4_push(state->net, skb, other_tuple, &ip_daddr) < 0)
+	if (other_tuple->tun.encap_proto == AF_INET6) {
+		if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
+					   &ip6_daddr,
+					   IPV6_DEFAULT_TNL_ENCAP_LIMIT) < 0)
+			return NF_DROP;
+	} else if (nf_flow_tunnel_v4_push(state->net, skb, other_tuple,
+					  &ip_daddr) < 0) {
 		return NF_DROP;
+	}
 
 	if (nf_flow_encap_push(skb, other_tuple) < 0)
 		return NF_DROP;
 
 	switch (tuplehash->tuple.xmit_type) {
-	case FLOW_OFFLOAD_XMIT_NEIGH:
-		rt = dst_rtable(tuplehash->tuple.dst_cache);
+	case FLOW_OFFLOAD_XMIT_NEIGH: {
+		struct dst_entry *dst;
+
 		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.ifidx);
 		if (!xmit.outdev) {
 			flow_offload_teardown(flow);
 			return NF_DROP;
 		}
-		neigh = ip_neigh_gw4(rt->dst.dev, rt_nexthop(rt, ip_daddr));
+		if (other_tuple->tun.encap_proto == AF_INET6 ||
+		    ctx.tun.proto == IPPROTO_IPV6) {
+			struct rt6_info *rt6;
+
+			rt6 = dst_rt6_info(tuplehash->tuple.dst_cache);
+			neigh = ip_neigh_gw6(rt6->dst.dev,
+					     rt6_nexthop(rt6, ip6_daddr));
+			dst = &rt6->dst;
+		} else {
+			rt = dst_rtable(tuplehash->tuple.dst_cache);
+			neigh = ip_neigh_gw4(rt->dst.dev,
+					     rt_nexthop(rt, ip_daddr));
+			dst = &rt->dst;
+		}
 		if (IS_ERR(neigh)) {
 			flow_offload_teardown(flow);
 			return NF_DROP;
 		}
 		xmit.dest = neigh->ha;
-		skb_dst_set_noref(skb, &rt->dst);
+		skb_dst_set_noref(skb, dst);
 		break;
+	}
 	case FLOW_OFFLOAD_XMIT_DIRECT:
 		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.out.ifidx);
 		if (!xmit.outdev) {
@@ -1068,8 +1119,12 @@ nf_flow_offload_ipv6_lookup(struct nf_flowtable_ctx *ctx,
 	if (!nf_flow_skb_encap_protocol(ctx, skb, htons(ETH_P_IPV6)))
 		return NULL;
 
-	if (nf_flow_tuple_ipv6(ctx, skb, &tuple) < 0)
+	if (ctx->tun.proto == IPPROTO_IPIP) {
+		if (nf_flow_tuple_ip(ctx, skb, &tuple) < 0)
+			return NULL;
+	} else if (nf_flow_tuple_ipv6(ctx, skb, &tuple) < 0) {
 		return NULL;
+	}
 
 	return flow_offload_lookup(flow_table, &tuple);
 }
@@ -1097,8 +1152,12 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	if (tuplehash == NULL)
 		return NF_ACCEPT;
 
-	ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash, skb,
-					   encap_limit);
+	if (ctx.tun.proto == IPPROTO_IPIP)
+		ret = nf_flow_offload_forward(&ctx, flow_table, tuplehash, skb,
+					      encap_limit);
+	else
+		ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash,
+						   skb, encap_limit);
 	if (ret < 0)
 		return NF_DROP;
 	else if (ret == 0)
@@ -1125,21 +1184,38 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 		return NF_DROP;
 
 	switch (tuplehash->tuple.xmit_type) {
-	case FLOW_OFFLOAD_XMIT_NEIGH:
-		rt = dst_rt6_info(tuplehash->tuple.dst_cache);
+	case FLOW_OFFLOAD_XMIT_NEIGH: {
+		struct dst_entry *dst;
+
 		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.ifidx);
 		if (!xmit.outdev) {
 			flow_offload_teardown(flow);
 			return NF_DROP;
 		}
-		neigh = ip_neigh_gw6(rt->dst.dev, rt6_nexthop(rt, ip6_daddr));
+		if (other_tuple->tun.encap_proto == AF_INET ||
+		    ctx.tun.proto == IPPROTO_IPIP) {
+			__be32 ip_daddr = other_tuple->src_v4.s_addr;
+			struct rtable *rt4;
+
+			skb->protocol = htons(ETH_P_IP);
+			rt4 = dst_rtable(tuplehash->tuple.dst_cache);
+			neigh = ip_neigh_gw4(rt4->dst.dev,
+					     rt_nexthop(rt4, ip_daddr));
+			dst = &rt4->dst;
+		} else {
+			rt = dst_rt6_info(tuplehash->tuple.dst_cache);
+			neigh = ip_neigh_gw6(rt->dst.dev,
+					     rt6_nexthop(rt, ip6_daddr));
+			dst = &rt->dst;
+		}
 		if (IS_ERR(neigh)) {
 			flow_offload_teardown(flow);
 			return NF_DROP;
 		}
 		xmit.dest = neigh->ha;
-		skb_dst_set_noref(skb, &rt->dst);
+		skb_dst_set_noref(skb, dst);
 		break;
+	}
 	case FLOW_OFFLOAD_XMIT_DIRECT:
 		xmit.outdev = dev_get_by_index_rcu(state->net, tuplehash->tuple.out.ifidx);
 		if (!xmit.outdev) {
diff --git a/net/netfilter/nf_flow_table_path.c b/net/netfilter/nf_flow_table_path.c
index 5a5774d9b6f5..74b6f5ea35f9 100644
--- a/net/netfilter/nf_flow_table_path.c
+++ b/net/netfilter/nf_flow_table_path.c
@@ -209,12 +209,11 @@ static int nft_flow_tunnel_update_route(const struct nft_pktinfo *pkt,
 	struct dst_entry *tun_dst = NULL;
 	struct flowi fl = {};
 
-	switch (nft_pf(pkt)) {
+	switch (tun->encap_proto) {
 	case NFPROTO_IPV4:
 		fl.u.ip4.daddr = tun->dst_v4.s_addr;
 		fl.u.ip4.saddr = tun->src_v4.s_addr;
 		fl.u.ip4.flowi4_iif = nft_in(pkt)->ifindex;
-		fl.u.ip4.flowi4_dscp = ip4h_dscp(ip_hdr(pkt->skb));
 		fl.u.ip4.flowi4_mark = pkt->skb->mark;
 		fl.u.ip4.flowi4_flags = FLOWI_FLAG_ANYSRC;
 		break;
@@ -222,13 +221,12 @@ static int nft_flow_tunnel_update_route(const struct nft_pktinfo *pkt,
 		fl.u.ip6.daddr = tun->dst_v6;
 		fl.u.ip6.saddr = tun->src_v6;
 		fl.u.ip6.flowi6_iif = nft_in(pkt)->ifindex;
-		fl.u.ip6.flowlabel = ip6_flowinfo(ipv6_hdr(pkt->skb));
 		fl.u.ip6.flowi6_mark = pkt->skb->mark;
 		fl.u.ip6.flowi6_flags = FLOWI_FLAG_ANYSRC;
 		break;
 	}
 
-	nf_route(nft_net(pkt), &tun_dst, &fl, false, nft_pf(pkt));
+	nf_route(nft_net(pkt), &tun_dst, &fl, false, tun->encap_proto);
 	if (!tun_dst)
 		return -ENOENT;
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v1 net] tcp: Fix out-of-bounds access for twsk in tcp_ao_established_key().
From: Kuniyuki Iwashima @ 2026-05-06 17:28 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, David S. Miller, Jakub Kicinski,
	Paolo Abeni
  Cc: Dmitry Safonov, Simon Horman, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, Damiano Melotti

lockdep_sock_is_held() was added in tcp_ao_established_key()
by the cited commit.

It can be called from tcp_v[46]_timewait_ack() with twsk.

Since it does not have sk->sk_lock, the lockdep annotation
results in out-of-bound access.

  $ pahole -C tcp_timewait_sock vmlinux | grep size
  	/* size: 288, cachelines: 5, members: 8 */
  $ pahole -C sock vmlinux | grep sk_lock
  	socket_lock_t              sk_lock;              /*   440   192 */

Let's not use lockdep_sock_is_held() for TCP_TIME_WAIT.

Fixes: 6b2d11e2d8fc ("net/tcp: Add missing lockdep annotations for TCP-AO hlist traversals")
Reported-by: Damiano Melotti <melotti@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/ipv4/tcp_ao.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index a97cdf3e6af4..e2720233e36b 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -116,7 +116,9 @@ struct tcp_ao_key *tcp_ao_established_key(const struct sock *sk,
 {
 	struct tcp_ao_key *key;
 
-	hlist_for_each_entry_rcu(key, &ao->head, node, lockdep_sock_is_held(sk)) {
+	hlist_for_each_entry_rcu(key, &ao->head, node,
+				 sk->sk_state == TCP_TIME_WAIT ||
+				 lockdep_sock_is_held(sk)) {
 		if ((sndid >= 0 && key->sndid != sndid) ||
 		    (rcvid >= 0 && key->rcvid != rcvid))
 			continue;
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related

* [PATCH nf-next v2 2/6] net: netfilter: Add encap_proto to flow_offload_tunnel
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260506-b4-flowtable-sw-accel-ip6ip-v2-0-439fd427726e@kernel.org>

Add encap_proto (AF_INET or AF_INET6) to struct flow_offload_tunnel
to allow its use as part of the hash table key during flowtable entry
lookup.
This is a preliminary change to support IPv4 over IPv6 tunneling via
the flowtable infrastructure for software acceleration.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/linux/netdevice.h             | 1 +
 include/net/netfilter/nf_flow_table.h | 1 +
 net/ipv4/ipip.c                       | 1 +
 net/ipv6/ip6_tunnel.c                 | 1 +
 net/netfilter/nf_flow_table_ip.c      | 2 ++
 net/netfilter/nf_flow_table_path.c    | 2 ++
 6 files changed, 8 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 85bd9d46b5a0..02f593397fad 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -902,6 +902,7 @@ struct net_device_path {
 			};
 
 			u8	l3_proto;
+			u8	encap_proto;
 		} tun;
 		struct {
 			enum {
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index b09c11c048d5..96e8ecf0f530 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -118,6 +118,7 @@ struct flow_offload_tunnel {
 	};
 
 	u8	l3_proto;
+	u8	encap_proto;
 };
 
 struct flow_offload_tuple {
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index ff95b1b9908e..5425af051d5a 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -369,6 +369,7 @@ static int ipip_fill_forward_path(struct net_device_path_ctx *ctx,
 	path->tun.src_v4.s_addr = tiph->saddr;
 	path->tun.dst_v4.s_addr = tiph->daddr;
 	path->tun.l3_proto = IPPROTO_IPIP;
+	path->tun.encap_proto = AF_INET;
 	path->dev = ctx->dev;
 
 	ctx->dev = rt->dst.dev;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 3d64e672eeee..c99ed41bfc99 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1851,6 +1851,7 @@ static int ip6_tnl_fill_forward_path(struct net_device_path_ctx *ctx,
 		path->type = DEV_PATH_TUN;
 		path->tun.src_v6 = t->parms.laddr;
 		path->tun.dst_v6 = t->parms.raddr;
+		path->tun.encap_proto = AF_INET6;
 		if (ctx->ether_type == cpu_to_be16(ETH_P_IP))
 			path->tun.l3_proto = IPPROTO_IPIP;
 		else
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index fd56d663cb5b..9efd76b57847 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -198,6 +198,7 @@ static void nf_flow_tuple_encap(struct nf_flowtable_ctx *ctx,
 			tuple->tun.dst_v4.s_addr = iph->daddr;
 			tuple->tun.src_v4.s_addr = iph->saddr;
 			tuple->tun.l3_proto = IPPROTO_IPIP;
+			tuple->tun.encap_proto = AF_INET;
 		}
 		break;
 	case htons(ETH_P_IPV6):
@@ -206,6 +207,7 @@ static void nf_flow_tuple_encap(struct nf_flowtable_ctx *ctx,
 			tuple->tun.dst_v6 = ip6h->daddr;
 			tuple->tun.src_v6 = ip6h->saddr;
 			tuple->tun.l3_proto = IPPROTO_IPV6;
+			tuple->tun.encap_proto = AF_INET6;
 		}
 		break;
 	default:
diff --git a/net/netfilter/nf_flow_table_path.c b/net/netfilter/nf_flow_table_path.c
index df4e180ed3c2..5a5774d9b6f5 100644
--- a/net/netfilter/nf_flow_table_path.c
+++ b/net/netfilter/nf_flow_table_path.c
@@ -127,6 +127,7 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
 				info->tun.src_v6 = path->tun.src_v6;
 				info->tun.dst_v6 = path->tun.dst_v6;
 				info->tun.l3_proto = path->tun.l3_proto;
+				info->tun.encap_proto = path->tun.encap_proto;
 				info->num_tuns++;
 			} else {
 				if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX) {
@@ -270,6 +271,7 @@ static void nft_dev_forward_path(const struct nft_pktinfo *pkt,
 		route->tuple[!dir].in.tun.src_v6 = info.tun.dst_v6;
 		route->tuple[!dir].in.tun.dst_v6 = info.tun.src_v6;
 		route->tuple[!dir].in.tun.l3_proto = info.tun.l3_proto;
+		route->tuple[!dir].in.tun.encap_proto = info.tun.encap_proto;
 		route->tuple[!dir].in.num_tuns = info.num_tuns;
 	}
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH nf-next v2 1/6] net: netfilter: Add ether_type to net_device_path_ctx
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260506-b4-flowtable-sw-accel-ip6ip-v2-0-439fd427726e@kernel.org>

Add an ether_type field to struct net_device_path_ctx to allow IPv6
tunnel drivers to select the appropriate L3 protocol based on the
encapsulated traffic.
Update the airoha and mtk Ethernet drivers to use the new
dev_fill_forward_path() signature.
This is a preliminary patch to enable sw flowtable acceleration for
IPv4 over IPv6 tunnels.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/airoha/airoha_ppe.c        | 14 +++++++++-----
 drivers/net/ethernet/mediatek/mtk_ppe_offload.c | 13 ++++++++-----
 include/linux/netdevice.h                       |  4 +++-
 net/core/dev.c                                  |  6 ++++--
 net/ipv6/ip6_tunnel.c                           |  5 ++++-
 net/netfilter/nf_flow_table_path.c              |  8 +++++---
 6 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c
index 26da519236bf..c5eccb3a43a1 100644
--- a/drivers/net/ethernet/airoha/airoha_ppe.c
+++ b/drivers/net/ethernet/airoha/airoha_ppe.c
@@ -245,7 +245,8 @@ static int airoha_ppe_flow_mangle_ipv4(const struct flow_action_entry *act,
 	return 0;
 }
 
-static int airoha_ppe_get_wdma_info(struct net_device *dev, const u8 *addr,
+static int airoha_ppe_get_wdma_info(struct net_device *dev,
+				    const u8 *addr, __be16 ether_type,
 				    struct airoha_wdma_info *info)
 {
 	struct net_device_path_stack stack;
@@ -256,7 +257,7 @@ static int airoha_ppe_get_wdma_info(struct net_device *dev, const u8 *addr,
 		return -ENODEV;
 
 	rcu_read_lock();
-	err = dev_fill_forward_path(dev, addr, &stack);
+	err = dev_fill_forward_path(dev, addr, ether_type, &stack);
 	rcu_read_unlock();
 	if (err)
 		return err;
@@ -300,7 +301,7 @@ static int airoha_ppe_foe_entry_prepare(struct airoha_eth *eth,
 					struct airoha_foe_entry *hwe,
 					struct net_device *dev, int type,
 					struct airoha_flow_data *data,
-					int l4proto)
+					__be16 ether_type, int l4proto)
 {
 	u32 qdata = FIELD_PREP(AIROHA_FOE_SHAPER_ID, 0x7f), ports_pad, val;
 	int wlan_etype = -EINVAL, dsa_port = airoha_get_dsa_port(&dev);
@@ -322,7 +323,8 @@ static int airoha_ppe_foe_entry_prepare(struct airoha_eth *eth,
 	if (dev) {
 		struct airoha_wdma_info info = {};
 
-		if (!airoha_ppe_get_wdma_info(dev, data->eth.h_dest, &info)) {
+		if (!airoha_ppe_get_wdma_info(dev, data->eth.h_dest,
+					      ether_type, &info)) {
 			val |= FIELD_PREP(AIROHA_FOE_IB2_NBQ, info.idx) |
 			       FIELD_PREP(AIROHA_FOE_IB2_PSE_PORT,
 					  FE_PSE_PORT_CDM4);
@@ -1047,6 +1049,7 @@ static int airoha_ppe_flow_offload_replace(struct airoha_eth *eth,
 	struct flow_action_entry *act;
 	struct airoha_foe_entry hwe;
 	int err, i, offload_type;
+	__be16 ether_type = 0;
 	u16 addr_type = 0;
 	u8 l4proto = 0;
 
@@ -1073,6 +1076,7 @@ static int airoha_ppe_flow_offload_replace(struct airoha_eth *eth,
 		struct flow_match_basic match;
 
 		flow_rule_match_basic(rule, &match);
+		ether_type = match.key->n_proto;
 		l4proto = match.key->ip_proto;
 	} else {
 		return -EOPNOTSUPP;
@@ -1143,7 +1147,7 @@ static int airoha_ppe_flow_offload_replace(struct airoha_eth *eth,
 		return -EINVAL;
 
 	err = airoha_ppe_foe_entry_prepare(eth, &hwe, odev, offload_type,
-					   &data, l4proto);
+					   &data, ether_type, l4proto);
 	if (err)
 		return err;
 
diff --git a/drivers/net/ethernet/mediatek/mtk_ppe_offload.c b/drivers/net/ethernet/mediatek/mtk_ppe_offload.c
index cc8c4ef8038f..2601c17b29c8 100644
--- a/drivers/net/ethernet/mediatek/mtk_ppe_offload.c
+++ b/drivers/net/ethernet/mediatek/mtk_ppe_offload.c
@@ -89,7 +89,8 @@ mtk_flow_offload_mangle_eth(const struct flow_action_entry *act, void *eth)
 }
 
 static int
-mtk_flow_get_wdma_info(struct net_device *dev, const u8 *addr, struct mtk_wdma_info *info)
+mtk_flow_get_wdma_info(struct net_device *dev, const u8 *addr,
+		       __be16 ether_type, struct mtk_wdma_info *info)
 {
 	struct net_device_path_stack stack;
 	struct net_device_path *path;
@@ -102,7 +103,7 @@ mtk_flow_get_wdma_info(struct net_device *dev, const u8 *addr, struct mtk_wdma_i
 		return -1;
 
 	rcu_read_lock();
-	err = dev_fill_forward_path(dev, addr, &stack);
+	err = dev_fill_forward_path(dev, addr, ether_type, &stack);
 	rcu_read_unlock();
 	if (err)
 		return err;
@@ -190,12 +191,12 @@ mtk_flow_get_dsa_port(struct net_device **dev)
 static int
 mtk_flow_set_output_device(struct mtk_eth *eth, struct mtk_foe_entry *foe,
 			   struct net_device *dev, const u8 *dest_mac,
-			   int *wed_index)
+			   __be16 ether_type, int *wed_index)
 {
 	struct mtk_wdma_info info = {};
 	int pse_port, dsa_port, queue;
 
-	if (mtk_flow_get_wdma_info(dev, dest_mac, &info) == 0) {
+	if (mtk_flow_get_wdma_info(dev, dest_mac, ether_type, &info) == 0) {
 		mtk_foe_entry_set_wdma(eth, foe, info.wdma_idx, info.queue,
 				       info.bss, info.wcid, info.amsdu);
 		if (mtk_is_netsys_v2_or_greater(eth)) {
@@ -273,6 +274,7 @@ mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f,
 	struct mtk_flow_data data = {};
 	struct mtk_foe_entry foe;
 	struct mtk_flow_entry *entry;
+	__be16 ether_type = 0;
 	int offload_type = 0;
 	int wed_index = -1;
 	u16 addr_type = 0;
@@ -319,6 +321,7 @@ mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f,
 		struct flow_match_basic match;
 
 		flow_rule_match_basic(rule, &match);
+		ether_type = match.key->n_proto;
 		l4proto = match.key->ip_proto;
 	} else {
 		return -EOPNOTSUPP;
@@ -481,7 +484,7 @@ mtk_flow_offload_replace(struct mtk_eth *eth, struct flow_cls_offload *f,
 		mtk_foe_entry_set_pppoe(eth, &foe, data.pppoe.sid);
 
 	err = mtk_flow_set_output_device(eth, &foe, odev, data.eth.h_dest,
-					 &wed_index);
+					 ether_type, &wed_index);
 	if (err)
 		return err;
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 744ffa243501..85bd9d46b5a0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -938,6 +938,7 @@ struct net_device_path_stack {
 struct net_device_path_ctx {
 	const struct net_device *dev;
 	u8			daddr[ETH_ALEN];
+	__be16			ether_type;
 
 	int			num_vlans;
 	struct {
@@ -3391,7 +3392,8 @@ void dev_remove_offload(struct packet_offload *po);
 
 int dev_get_iflink(const struct net_device *dev);
 int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb);
-int dev_fill_forward_path(const struct net_device *dev, const u8 *daddr,
+int dev_fill_forward_path(const struct net_device *dev,
+			  const u8 *daddr, __be16 ether_type,
 			  struct net_device_path_stack *stack);
 struct net_device *dev_get_by_name(struct net *net, const char *name);
 struct net_device *dev_get_by_name_rcu(struct net *net, const char *name);
diff --git a/net/core/dev.c b/net/core/dev.c
index 06c195906231..5f6171c08849 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -750,12 +750,14 @@ static struct net_device_path *dev_fwd_path(struct net_device_path_stack *stack)
 	return &stack->path[k];
 }
 
-int dev_fill_forward_path(const struct net_device *dev, const u8 *daddr,
+int dev_fill_forward_path(const struct net_device *dev,
+			  const u8 *daddr, __be16 ether_type,
 			  struct net_device_path_stack *stack)
 {
 	const struct net_device *last_dev;
 	struct net_device_path_ctx ctx = {
-		.dev	= dev,
+		.dev		= dev,
+		.ether_type	= ether_type,
 	};
 	struct net_device_path *path;
 	int ret = 0;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index c468c83af0f2..3d64e672eeee 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1851,7 +1851,10 @@ static int ip6_tnl_fill_forward_path(struct net_device_path_ctx *ctx,
 		path->type = DEV_PATH_TUN;
 		path->tun.src_v6 = t->parms.laddr;
 		path->tun.dst_v6 = t->parms.raddr;
-		path->tun.l3_proto = IPPROTO_IPV6;
+		if (ctx->ether_type == cpu_to_be16(ETH_P_IP))
+			path->tun.l3_proto = IPPROTO_IPIP;
+		else
+			path->tun.l3_proto = IPPROTO_IPV6;
 		path->dev = ctx->dev;
 		ctx->dev = dst->dev;
 	}
diff --git a/net/netfilter/nf_flow_table_path.c b/net/netfilter/nf_flow_table_path.c
index 6bb9579dcc2a..df4e180ed3c2 100644
--- a/net/netfilter/nf_flow_table_path.c
+++ b/net/netfilter/nf_flow_table_path.c
@@ -45,7 +45,8 @@ static bool nft_is_valid_ether_device(const struct net_device *dev)
 static int nft_dev_fill_forward_path(const struct nf_flow_route *route,
 				     const struct dst_entry *dst_cache,
 				     const struct nf_conn *ct,
-				     enum ip_conntrack_dir dir, u8 *ha,
+				     enum ip_conntrack_dir dir,
+				     u8 *ha, __be16 ether_type,
 				     struct net_device_path_stack *stack)
 {
 	const void *daddr = &ct->tuplehash[!dir].tuple.src.u3;
@@ -70,7 +71,7 @@ static int nft_dev_fill_forward_path(const struct nf_flow_route *route,
 		return -1;
 
 out:
-	return dev_fill_forward_path(dev, ha, stack);
+	return dev_fill_forward_path(dev, ha, ether_type, stack);
 }
 
 struct nft_forward_info {
@@ -248,7 +249,8 @@ static void nft_dev_forward_path(const struct nft_pktinfo *pkt,
 	unsigned char ha[ETH_ALEN];
 	int i;
 
-	if (nft_dev_fill_forward_path(route, dst, ct, dir, ha, &stack) >= 0)
+	if (nft_dev_fill_forward_path(route, dst, ct, dir, ha, pkt->ethertype,
+				      &stack) >= 0)
 		nft_dev_path_info(&stack, &info, ha, &ft->data);
 
 	if (info.outdev)

-- 
2.54.0


^ permalink raw reply related

* [PATCH nf-next v2 0/6] Add IPv4 over IPv6 and SIT flowtable SW acceleration
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest

Similar to IPIP and IP6I6 tunnels, introduce sw acceleration for IPv4 over
IPv6 and SIT tunnels in the netfilter flowtable infrastructure.

---
Changes in v2:
- Fix MTU check in nf_flow_offload_forward() and in
  nf_flow_offload_ipv6_forward()
- Add SIT sw acceleration support
- Link to v1: https://lore.kernel.org/r/20260505-b4-flowtable-sw-accel-ip6ip-v1-0-9ac39ccc9ea9@kernel.org

---
Lorenzo Bianconi (6):
      net: netfilter: Add ether_type to net_device_path_ctx
      net: netfilter: Add encap_proto to flow_offload_tunnel
      net: netfilter: Add IPv4 over IPv6 tunnel flowtable acceleration
      selftests: netfilter: nft_flowtable.sh: Add IPv4 over IPv6 flowtable selftest
      net: netfilter: Add SIT tunnel flowtable acceleration
      selftests: netfilter: nft_flowtable.sh: Add SIT flowtable selftest

 drivers/net/ethernet/airoha/airoha_ppe.c           |  14 +-
 drivers/net/ethernet/mediatek/mtk_ppe_offload.c    |  13 +-
 include/linux/netdevice.h                          |   5 +-
 include/net/netfilter/nf_flow_table.h              |   1 +
 net/core/dev.c                                     |   6 +-
 net/ipv4/ipip.c                                    |   1 +
 net/ipv6/ip6_tunnel.c                              |   6 +-
 net/ipv6/sit.c                                     |  26 ++
 net/netfilter/nf_flow_table_core.c                 |  14 +-
 net/netfilter/nf_flow_table_ip.c                   | 386 +++++++++++++--------
 net/netfilter/nf_flow_table_path.c                 |  16 +-
 tools/testing/selftests/net/netfilter/config       |   1 +
 .../selftests/net/netfilter/nft_flowtable.sh       |  78 ++++-
 13 files changed, 402 insertions(+), 165 deletions(-)
---
base-commit: c1e5127b577c6b88fa48e532616932ae978528d5
change-id: 20260505-b4-flowtable-sw-accel-ip6ip-7101034cd147

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply

* [PATCH net-next v3] net: bridge: replace simple_strtoul with kstrtoul
From: Aadarsh Chandra @ 2026-05-06 17:24 UTC (permalink / raw)
  To: razor, idosch; +Cc: davem, kuba, netdev, bridge, linux-kernel, Aadarsh Chandra

The simple_strtoul() function is deprecated. It does not handle
errors or overflows correctly. Replace it with kstrtoul() in
brport_store() to ensure that invalid user input is caught and
returned as an error.

Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Aadarsh Chandra <aadarsh.official.xz@gmail.com>
---
v3: move change log below the line and add Acked-by.
v2: target net-next and simplify by reusing the ret variable as
    suggested by Nikolay Aleksandrov.

 net/bridge/br_sysfs_if.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 1f57c36a7fc0..cdecc7d1260c 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -318,7 +318,6 @@ static ssize_t brport_store(struct kobject *kobj,
        struct net_bridge_port *p = kobj_to_brport(kobj);
        ssize_t ret = -EINVAL;
        unsigned long val;
-       char *endp;
 
        if (!ns_capable(dev_net(p->dev)->user_ns, CAP_NET_ADMIN))
                return -EPERM;
@@ -339,8 +338,8 @@ static ssize_t brport_store(struct kobject *kobj,
                spin_unlock_bh(&p->br->lock);
                kfree(buf_copy);
        } else if (brport_attr->store) {
-               val = simple_strtoul(buf, &endp, 0);
-               if (endp == buf)
+               ret = kstrtoul(buf, 0, &val);
+               if (ret)
                        goto out_unlock;
                spin_lock_bh(&p->br->lock);
                ret = brport_attr->store(p, val);
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH nf-next 0/4] Add IPv4 over IPv6 flowtable SW acceleration
From: Lorenzo Bianconi @ 2026-05-06 17:27 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Felix Fietkau, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, David Ahern,
	Ido Schimmel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shuah Khan
  Cc: linux-arm-kernel, linux-mediatek, netdev, netfilter-devel,
	coreteam, linux-kselftest
In-Reply-To: <20260505-b4-flowtable-sw-accel-ip6ip-v1-0-9ac39ccc9ea9@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1629 bytes --]

> Similar to IPIP and IP6I6 tunnels, introduce sw acceleration for IPv4 over
> IPv6 tunnels in the netfilter flowtable infrastructure.

Please drop this revision, I spotted a bug in MTU configuration in
nf_flow_offload_forward(). I will post v2 fixing the issue and adding
SIT support.

Regards,
Lorenzo

> 
> ---
> Lorenzo Bianconi (4):
>       net: netfilter: Add ether_type to net_device_path_ctx
>       net: netfilter: Add encap_proto to flow_offload_tunnel
>       net: netfilter: Add IPv4 over IPv6 tunnel flowtable acceleration
>       selftests: netfilter: nft_flowtable.sh: Add IPv4 over IPv6 flowtable selftest
> 
>  drivers/net/ethernet/airoha/airoha_ppe.c           |  14 ++-
>  drivers/net/ethernet/mediatek/mtk_ppe_offload.c    |  13 ++-
>  include/linux/netdevice.h                          |   5 +-
>  include/net/netfilter/nf_flow_table.h              |   1 +
>  net/core/dev.c                                     |   6 +-
>  net/ipv4/ipip.c                                    |   1 +
>  net/ipv6/ip6_tunnel.c                              |   6 +-
>  net/netfilter/nf_flow_table_core.c                 |  14 ++-
>  net/netfilter/nf_flow_table_ip.c                   | 129 ++++++++++++++++-----
>  net/netfilter/nf_flow_table_path.c                 |  16 +--
>  .../selftests/net/netfilter/nft_flowtable.sh       |  26 +++++
>  11 files changed, 174 insertions(+), 57 deletions(-)
> ---
> base-commit: c1e5127b577c6b88fa48e532616932ae978528d5
> change-id: 20260505-b4-flowtable-sw-accel-ip6ip-7101034cd147
> 
> Best regards,
> -- 
> Lorenzo Bianconi <lorenzo@kernel.org>
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH ipsec-next] xfrm: Use regular error handling instead of BUG_ON() in the netlink API.
From: Sabrina Dubroca @ 2026-05-06 17:20 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, devel
In-Reply-To: <aftnl6d59WUM_Dfm@secunet.com>

2026-05-06, 18:08:55 +0200, Steffen Klassert wrote:
> The xfrm netlink API uses BUG_ON() on failures since it exists.
> However all these error are uncritical and can be handled
> with regular error handling. This fixes machine crashes
> in situations where an emergency break is not needed.

Nice clean up!

[...]
> @@ -1794,7 +1797,11 @@ static int xfrm_get_sadinfo(struct sk_buff *skb, struct nlmsghdr *nlh,
>  		return -ENOMEM;
>  
>  	err = build_sadinfo(r_skb, net, sportid, seq, *flags);
> -	BUG_ON(err < 0);
> +	if (err < 0) {
> +		kfree_skb(r_skb);
> +		return err;
> +	}
> +

nit: extra blank line

>  
>  	return nlmsg_unicast(xfrm_net_nlsk(net, skb), r_skb, sportid);
>  }

[...]
> @@ -4151,7 +4170,8 @@ static int xfrm_send_report(struct net *net, u8 proto,
>  		return -ENOMEM;
>  
>  	err = build_report(skb, proto, sel, addr);
> -	BUG_ON(err < 0);
> +	if (err < 0)
> +		return err;

This one is missing the kfree_skb(skb).

>  
>  	return xfrm_nlmsg_multicast(net, skb, 0, XFRMNLGRP_REPORT);
>  }

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH v13 6/6] selftests: net: add TLS hardware offload test
From: Sabrina Dubroca @ 2026-05-06 17:07 UTC (permalink / raw)
  To: Rishikesh Jethwani
  Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, kuba,
	davem, pabeni, edumazet, leon
In-Reply-To: <20260429181016.3164935-7-rjethwani@purestorage.com>

2026-04-29, 12:10:16 -0600, Rishikesh Jethwani wrote:
[...]
> +static int do_client(void)
> +{

Both this and do_server could be refactored into a few nice helpers
(prep the socket up to connect+ulp/accept+ulp, echo handling, rekey,
payload verification on RX).

[...]
> +	for (i = 1; i <= num_iterations; i++) {
> +		int this_size;
> +
> +		if (random_size_max > 0)
> +			this_size = (rand() % random_size_max) + 1;
> +		else
> +			this_size = send_size;
> +
> +		/* In burst mode, use a per-iteration fill pattern so the
> +		 * receiver can detect any plaintext corruption without a
> +		 * round-trip echo.
> +		 */
> +		if (burst_mode) {
> +			memset(buf, i & 0xFF, this_size);
> +		} else {
> +			for (j = 0; j < this_size; j++)
> +				buf[j] = rand() & 0xFF;
> +		}
> +
> +		n = send(csk, buf, this_size, 0);
> +		if (n != this_size) {
> +			printf("FAIL: send failed: %s\n", strerror(errno));
> +			goto out;
> +		}
> +		/* Throttle per-iteration progress lines on long burst runs so
> +		 * stdout over ssh doesn't become the bottleneck.

I'm not sure those lines have enough benefit in burst mode to be worth
printing. (same on the server side)


> +static int do_server(void)
> +{
[...]
> +	while (1) {
[...]
> +		/* Burst mode: verify payload matches the client's fill
> +		 * pattern. TLS record boundaries may differ from send()
> +		 * boundaries, so walk the received buffer in chunks that
> +		 * fit within the current iteration's remaining bytes.
> +		 * Catches decrypt-succeeded-but-plaintext-corrupt bugs
> +		 * that AEAD counters alone would miss.
> +		 */
> +		if (burst_mode) {
> +			int off = 0;
> +
> +			while (off < n) {

This would be a good deal simpler if you passed MSG_WAITALL to the
recvmsg call in burst mode. Then you'd get the full chunk of data for
that iteration.

[...]
> +static void print_usage(const char *prog)
> +{
> +	printf("TLS Hardware Offload Two-Node Test\n\n");
> +	printf("Usage:\n");
> +	printf("  %s server [OPTIONS]\n", prog);
> +	printf("  %s client -s <ip> [OPTIONS]\n", prog);
> +	printf("\nOptions:\n");
> +	printf("  -s <ip>       Server IPv4 address (client, required)\n");
> +	printf("  -p <port>     Server port (default: 4433)\n");
> +	printf("  -b <size>     Send buffer size in bytes (default: 16384)\n");
> +	printf("  -r <max>      Use random send buffer sizes (1..<max>)\n");
> +	printf("  -v <version>  TLS version: 1.2 or 1.3 (default: 1.3)\n");
> +	printf("  -c <cipher>   Cipher: 128 or 256 (default: 128)\n");
> +	printf("  -n <N>        Number of send/echo iterations (default: 100)\n");
> +	printf("  -k <N>        Perform N rekeys (client only, TLS 1.3; N < iterations)\n");
> +	printf("  -B            Burst mode: client sends continuously without echo;\n");
> +	printf("                server drains and handles KeyUpdate without responding.\n");
> +	printf("  -Z            Enable zero-copy RX (TLS_RX_EXPECT_NO_PAD);\n");

This is misleading, since zero-copy RX will be enabled by default for 1.2.



> diff --git a/tools/testing/selftests/drivers/net/hw/tls_hw_offload.py b/tools/testing/selftests/drivers/net/hw/tls_hw_offload.py
> new file mode 100755
> index 000000000000..f12da0e66afd
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/hw/tls_hw_offload.py
[...]
> +def verify_tls_counters(stats_before, stats_after, expected_rekeys,
> +                        tls_role, is_dut, burst=False):
> +    """Verify TLS counters on one side of the connection.

Even with the introduction of the check_* helpers, this function still
has a lot of c/p'd code just depending on role and test mode.

> +    tls_role: 'client' or 'server' (TLS role this side played).
> +    is_dut: True for the local DUT; requires HW offload counters.
> +    burst: burst mode - only the TLS client rotates its TX key; the TLS
> +           server only follows with an RX rotation on KeyUpdate receipt.
> +    """
> +    errors = 0
> +    role = 'DUT' if is_dut else 'Peer'
> +
> +    # In burst mode only one direction carries TLS traffic per side
> +    # (TLS client sends, TLS server receives). Check HW offload only on
> +    # the active direction(s); require HW on the DUT's active direction.
> +    if burst:
> +        if tls_role == 'client':
> +            errors += check_path(stats_before, stats_after, 'Tx', role,
> +                                 require_hw=is_dut)
> +        else:
> +            errors += check_path(stats_before, stats_after, 'Rx', role,
> +                                 require_hw=is_dut)
> +    else:
> +        errors += check_path(stats_before, stats_after, 'Tx', role,
> +                             require_hw=is_dut)
> +        errors += check_path(stats_before, stats_after, 'Rx', role,
> +                             require_hw=is_dut)

    # in burst mode, client does TX and server only does RX
    # otherwise, both sides do both TX and RX
    with_tx = not burst or tls_role == 'client':
    with_rx = not burst or tls_role != 'client':

    if with_tx:
        check_path(Tx...)
    if with_rx:
        check_path(Rx...)

> +    if expected_rekeys > 0:
> +        if burst:
> +            if tls_role == 'client':
> +                errors += check_min(stats_before, stats_after,
> +                                    'TlsTxRekeyOk', expected_rekeys, role)
> +                errors += check_zero(stats_before, stats_after,
> +                                     'TlsTxRekeyError', role)

and same for those

> +            else:
> +                errors += check_min(stats_before, stats_after,
> +                                    'TlsRxRekeyOk', expected_rekeys, role)
> +                errors += check_min(stats_before, stats_after,
> +                                    'TlsRxRekeyReceived', expected_rekeys,
> +                                    role)
> +                errors += check_zero(stats_before, stats_after,
> +                                     'TlsRxRekeyError', role)
> +        else:
> +            errors += check_min(stats_before, stats_after,
> +                                'TlsTxRekeyOk', expected_rekeys, role)
> +            errors += check_min(stats_before, stats_after,
> +                                'TlsRxRekeyOk', expected_rekeys, role)
> +            if tls_role == 'server':
> +                errors += check_min(stats_before, stats_after,
> +                                    'TlsRxRekeyReceived', expected_rekeys,
> +                                    role)

Why are you restricting this to the server? The client should get as
many rekeys as it sends.

> +            errors += check_zero(stats_before, stats_after,
> +                                 'TlsTxRekeyError', role)
> +            errors += check_zero(stats_before, stats_after,
> +                                 'TlsRxRekeyError', role)
> +
> +    errors += check_zero(stats_before, stats_after, 'TlsDecryptError', role)
> +
> +    return errors == 0

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH v4] Bluetooth: serialize accept_q access
From: Jann Horn @ 2026-05-06 17:04 UTC (permalink / raw)
  To: Ren Wei, luiz.dentz
  Cc: linux-bluetooth, netdev, marcel, davem, edumazet, kuba, pabeni,
	horms, yuantan098, yifanwucs, tomapufckgml, bird, wangjiexun2025
In-Reply-To: <20260506114338.2873496-1-n05ec@lzu.edu.cn>

[-- Attachment #1: Type: text/plain, Size: 10085 bytes --]

On Wed, May 6, 2026 at 1:43 PM Ren Wei <n05ec@lzu.edu.cn> wrote:
> bt_sock_poll() walks the accept queue without synchronization, while
> child teardown can unlink the same socket and drop its last reference.
> The unsynchronized accept queue walk has existed since the initial
> Bluetooth import.
>
> Protect accept_q with a dedicated lock for queue updates and polling.
> Also rework bt_accept_dequeue() to take temporary child references under
> the queue lock before dropping it and locking the child socket.
>
> Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")
> Cc: stable@vger.kernel.org
> Reported-by: Jann Horn <jannh@google.com>
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>

The patch looks good to me. I have some comments below, but they're
not important - from my perspective, this patch is ready to land in
the tree.

Reviewed-by: Jann Horn <jannh@google.com>

> ---
> Changes in v4:
> - no functional changes
> - clarify that the race dates back to the initial Bluetooth import
> - update trailers
>   I noticed Jann also proposed a fix at
>   https://patchwork.kernel.org/project/bluetooth/patch/20260504-bluetooth-accept-uaf-fix-v1-1-1ca63c0efadd@google.com/,
>   so we're adding his Reported-by tag here. Please let us know if this
>   isn't appropriate.

Thanks for letting me know that my patch was redundant, and for
listing me in Reported-by.
This addresses the race I described.
(You could add the line
"Closes: https://lore.kernel.org/r/20260504-bluetooth-accept-uaf-fix-v1-1-1ca63c0efadd@google.com"
after the "Reported-by: Jann Horn <jannh@google.com>".)

[...]
> @@ -254,45 +258,72 @@ EXPORT_SYMBOL(bt_accept_enqueue);
>   */
>  void bt_accept_unlink(struct sock *sk)
>  {
> +       struct sock *parent = bt_sk(sk)->parent;
> +
>         BT_DBG("sk %p state %d", sk, sk->sk_state);
>
> +       spin_lock_bh(&bt_sk(parent)->accept_q_lock);
>         list_del_init(&bt_sk(sk)->accept_q);
> -       sk_acceptq_removed(bt_sk(sk)->parent);
> +       sk_acceptq_removed(parent);
> +       spin_unlock_bh(&bt_sk(parent)->accept_q_lock);
>         bt_sk(sk)->parent = NULL;
>         sock_put(sk);
>  }
>  EXPORT_SYMBOL(bt_accept_unlink);
>
> +static struct sock *bt_accept_get(struct sock *parent, struct sock *sk)
> +{
> +       struct bt_sock *bt = bt_sk(parent);
> +       struct sock *next = NULL;
> +
> +       /* accept_q is modified from child teardown paths too, so take a
> +        * temporary reference before dropping the queue lock.
> +        */
> +       spin_lock_bh(&bt->accept_q_lock);
> +
> +       if (sk) {
> +               if (bt_sk(sk)->parent != parent)
> +                       goto out;

This check seems redundant? The caller already bailed out if
"bt_sk(sk)->parent != parent", and lock_sock(sk) ensures that
bt_sk(sk)->parent can't change concurrently because bt_accept_unlink()
is also protected by lock_sock() or lock_sock_nested(), as the comment
above bt_accept_unlink() documents.

> +
> +               if (!list_is_last(&bt_sk(sk)->accept_q, &bt->accept_q)) {
> +                       next = &list_next_entry(bt_sk(sk), accept_q)->sk;
> +                       sock_hold(next);
> +               }
> +       } else if (!list_empty(&bt->accept_q)) {
> +               next = &list_first_entry(&bt->accept_q,
> +                                        struct bt_sock, accept_q)->sk;
> +               sock_hold(next);
> +       }
> +
> +out:
> +       spin_unlock_bh(&bt->accept_q_lock);
> +       return next;
> +}

Hmm. This looks a bit complicated to me, and I find it hard to reason
about how accept_q walks are restarted after temporarily dropping the
lock; I think it would be nice if you could instead walk the
->accept_q while holding the accept_q_lock until you identify a socket
with the right ->sk_state, then drop the accept_q_lock and lock the
sock. Something like this diff on top of your patch (completely
untested); I have attached a properly formatted version of this diff
that you can apply with "git apply":

```
diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index 9d68dd86023c..26e7c7198522 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -271,50 +271,36 @@ void bt_accept_unlink(struct sock *sk)
 }
 EXPORT_SYMBOL(bt_accept_unlink);

-static struct sock *bt_accept_get(struct sock *parent, struct sock *sk)
-{
-       struct bt_sock *bt = bt_sk(parent);
-       struct sock *next = NULL;
-
-       /* accept_q is modified from child teardown paths too, so take a
-        * temporary reference before dropping the queue lock.
-        */
-       spin_lock_bh(&bt->accept_q_lock);
-
-       if (sk) {
-               if (bt_sk(sk)->parent != parent)
-                       goto out;
-
-               if (!list_is_last(&bt_sk(sk)->accept_q, &bt->accept_q)) {
-                       next = &list_next_entry(bt_sk(sk), accept_q)->sk;
-                       sock_hold(next);
-               }
-       } else if (!list_empty(&bt->accept_q)) {
-               next = &list_first_entry(&bt->accept_q,
-                                        struct bt_sock, accept_q)->sk;
-               sock_hold(next);
-       }
-
-out:
-       spin_unlock_bh(&bt->accept_q_lock);
-       return next;
-}
-
 struct sock *bt_accept_dequeue(struct sock *parent, struct socket *newsock)
 {
-       struct sock *sk, *next;
+       struct bt_sock *s;
+       struct sock *sk;

        BT_DBG("parent %p", parent);

 restart:
-       for (sk = bt_accept_get(parent, NULL); sk; sk = next) {
+       spin_lock_bh(&bt_sk(parent)->accept_q_lock);
+       list_for_each_entry(s, &bt_sk(parent)->accept_q, accept_q) {
+               unsigned char state;
+
+               sk = &s->sk;
+
+               /* lockless version of the checks below */
+               state = data_race(READ_ONCE(sk->sk_state));
+               if (state != BT_CLOSED && state != BT_CONNECTED && newsock &&
+                   !test_bit(BT_SK_DEFER_SETUP, &bt_sk(parent)->flags))
+                       continue;
+
                /* Prevent early freeing of sk due to unlink and sock_kill */
+               sock_hold(sk);
+               spin_unlock_bh(&bt_sk(parent)->accept_q_lock);
                lock_sock(sk);
+               /* socket is now locked, redo checks reliably */

                /* Check sk has not already been unlinked via
                 * bt_accept_unlink() due to serialisation caused by sk locking
                 */
-               if (bt_sk(sk)->parent != parent) {
+               if (s->parent != parent) {
                        BT_DBG("sk %p, already unlinked", sk);
                        release_sock(sk);
                        sock_put(sk);
@@ -322,8 +308,6 @@ struct sock *bt_accept_dequeue(struct sock
*parent, struct socket *newsock)
                        goto restart;
                }

-               next = bt_accept_get(parent, sk);
-
                /* sk is safely in the parent list so reduce reference count */
                sock_put(sk);

@@ -331,7 +315,7 @@ struct sock *bt_accept_dequeue(struct sock
*parent, struct socket *newsock)
                if (sk->sk_state == BT_CLOSED) {
                        bt_accept_unlink(sk);
                        release_sock(sk);
-                       continue;
+                       goto restart;
                }

                if (sk->sk_state == BT_CONNECTED || !newsock ||
@@ -341,12 +325,11 @@ struct sock *bt_accept_dequeue(struct sock
*parent, struct socket *newsock)
                                sock_graft(sk, newsock);

                        release_sock(sk);
-                       if (next)
-                               sock_put(next);
                        return sk;
                }

                release_sock(sk);
+               goto restart;
        }

        return NULL;
```

I think this makes the code simpler, and it reduces the line count;
however, I think your approach is okay too, so it would also be fine
to keep your approach if you prefer.

[...]
> @@ -518,18 +551,28 @@ EXPORT_SYMBOL(bt_sock_stream_recvmsg);
>
>  static inline __poll_t bt_accept_poll(struct sock *parent)
>  {
> -       struct bt_sock *s, *n;
> +       struct bt_sock *bt = bt_sk(parent);
> +       struct bt_sock *s;
>         struct sock *sk;
> +       __poll_t mask = 0;
> +
> +       spin_lock_bh(&bt->accept_q_lock);
> +       list_for_each_entry(s, &bt->accept_q, accept_q) {
> +               int state;
>
> -       list_for_each_entry_safe(s, n, &bt_sk(parent)->accept_q, accept_q) {
>                 sk = (struct sock *)s;
> -               if (sk->sk_state == BT_CONNECTED ||
> -                   (test_bit(BT_SK_DEFER_SETUP, &bt_sk(parent)->flags) &&
> -                    sk->sk_state == BT_CONNECT2))
> -                       return EPOLLIN | EPOLLRDNORM;
> +               state = READ_ONCE(sk->sk_state);

nitpick: This READ_ONCE() is not synchronized with a corresponding
WRITE_ONCE(); that's not really clean, and it might be appropriate to
mark this with data_race() if this is intentionally racy with
potentially-torn stores. But that's a minor detail.


> +
> +               if (state == BT_CONNECTED ||
> +                   (test_bit(BT_SK_DEFER_SETUP, &bt->flags) &&
> +                    state == BT_CONNECT2)) {
> +                       mask = EPOLLIN | EPOLLRDNORM;
> +                       break;
> +               }
>         }
> +       spin_unlock_bh(&bt->accept_q_lock);
>
> -       return 0;
> +       return mask;
>  }
>
>  __poll_t bt_sock_poll(struct file *file, struct socket *sock,
> --
> 2.34.1
>

[-- Attachment #2: locked-walk.diff --]
[-- Type: application/x-patch, Size: 2914 bytes --]

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox