Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v10 07/12] static_call: Add EXPORT_STATIC_CALL_FOR_MODULES()
From: Pawan Gupta @ 2026-04-14  7:07 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-0-efa924abae5f@linux.intel.com>

There is EXPORT_STATIC_CALL_TRAMP() that hides the static key from all
modules. But there is no equivalent of EXPORT_SYMBOL_FOR_MODULES() to
restrict symbol visibility to only certain modules.

Add EXPORT_STATIC_CALL_FOR_MODULES(name, mods) that wraps both the key and
the trampoline with EXPORT_SYMBOL_FOR_MODULES(), allowing only a limited
set of modules to see and update the static key.

The immediate user is KVM, in the following commit.

checkpatch reported below warnings with this change that I believe don't
apply in this case:

  include/linux/static_call.h:219: WARNING: Non-declarative macros with multiple statements should be enclosed in a do - while loop
  include/linux/static_call.h:220: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 include/linux/static_call.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 78a77a4ae0ea..b610afd1ed55 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -216,6 +216,9 @@ extern long __static_call_return0(void);
 #define EXPORT_STATIC_CALL_GPL(name)					\
 	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name));			\
 	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
+#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
+	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods);		\
+	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_TRAMP(name), mods)
 
 /* Leave the key unexported, so modules can't change static call targets: */
 #define EXPORT_STATIC_CALL_TRAMP(name)					\
@@ -276,6 +279,9 @@ extern long __static_call_return0(void);
 #define EXPORT_STATIC_CALL_GPL(name)					\
 	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name));			\
 	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
+#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
+	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods);		\
+	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_TRAMP(name), mods)
 
 /* Leave the key unexported, so modules can't change static call targets: */
 #define EXPORT_STATIC_CALL_TRAMP(name)					\
@@ -346,6 +352,8 @@ static inline int static_call_text_reserved(void *start, void *end)
 
 #define EXPORT_STATIC_CALL(name)	EXPORT_SYMBOL(STATIC_CALL_KEY(name))
 #define EXPORT_STATIC_CALL_GPL(name)	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name))
+#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
+	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods)
 
 #endif /* CONFIG_HAVE_STATIC_CALL */
 

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 04/12] x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user
From: Pawan Gupta @ 2026-04-14  7:06 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-0-efa924abae5f@linux.intel.com>

With the upcoming changes x86_ibpb_exit_to_user will also be used when BHB
clearing sequence is used. Rename it cover both the cases.

No functional change.

Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Jon Kohler <jon@nutanix.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/include/asm/entry-common.h  | 6 +++---
 arch/x86/include/asm/nospec-branch.h | 2 +-
 arch/x86/kernel/cpu/bugs.c           | 4 ++--
 arch/x86/kvm/x86.c                   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce3eb6d5fdf9..c45858db16c9 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -94,11 +94,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	 */
 	choose_random_kstack_offset(rdtsc());
 
-	/* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */
+	/* Avoid unnecessary reads of 'x86_predictor_flush_exit_to_user' */
 	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
-	    this_cpu_read(x86_ibpb_exit_to_user)) {
+	    this_cpu_read(x86_predictor_flush_exit_to_user)) {
 		indirect_branch_prediction_barrier();
-		this_cpu_write(x86_ibpb_exit_to_user, false);
+		this_cpu_write(x86_predictor_flush_exit_to_user, false);
 	}
 }
 #define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 157eb69c7f0f..0381db59c39d 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -533,7 +533,7 @@ void alternative_msr_write(unsigned int msr, u64 val, unsigned int feature)
 		: "memory");
 }
 
-DECLARE_PER_CPU(bool, x86_ibpb_exit_to_user);
+DECLARE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
 
 static inline void indirect_branch_prediction_barrier(void)
 {
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 2cb4a96247d8..002bf4adccc3 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -65,8 +65,8 @@ EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
  * be needed to before running userspace. That IBPB will flush the branch
  * predictor content.
  */
-DEFINE_PER_CPU(bool, x86_ibpb_exit_to_user);
-EXPORT_PER_CPU_SYMBOL_GPL(x86_ibpb_exit_to_user);
+DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
+EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
 
 u64 x86_pred_cmd __ro_after_init = PRED_CMD_IBPB;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd1c4a36b593..45d7cfedc507 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11464,7 +11464,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * may migrate to.
 	 */
 	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER))
-		this_cpu_write(x86_ibpb_exit_to_user, true);
+		this_cpu_write(x86_predictor_flush_exit_to_user, true);
 
 	/*
 	 * Consume any pending interrupts, including the possible source of

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 03/12] x86/bhi: Rename clear_bhb_loop() to clear_bhb_loop_nofence()
From: Pawan Gupta @ 2026-04-14  7:06 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-0-efa924abae5f@linux.intel.com>

To reflect the recent change that moved LFENCE to the caller side.

Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Tested-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S            | 8 ++++----
 arch/x86/include/asm/nospec-branch.h | 6 +++---
 arch/x86/net/bpf_jit_comp.c          | 2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index bbd4b1c7ec04..1f56d086d312 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1532,7 +1532,7 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * Note, callers should use a speculation barrier like LFENCE immediately after
  * a call to this function to ensure BHB is cleared before indirect branches.
  */
-SYM_FUNC_START(clear_bhb_loop)
+SYM_FUNC_START(clear_bhb_loop_nofence)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
@@ -1570,6 +1570,6 @@ SYM_FUNC_START(clear_bhb_loop)
 5:
 	pop	%rbp
 	RET
-SYM_FUNC_END(clear_bhb_loop)
-EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop)
-STACK_FRAME_NON_STANDARD(clear_bhb_loop)
+SYM_FUNC_END(clear_bhb_loop_nofence)
+EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop_nofence)
+STACK_FRAME_NON_STANDARD(clear_bhb_loop_nofence)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 87b83ae7c97f..157eb69c7f0f 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -331,11 +331,11 @@
 
 #ifdef CONFIG_X86_64
 .macro CLEAR_BRANCH_HISTORY
-	ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_LOOP
+	ALTERNATIVE "", "call clear_bhb_loop_nofence; lfence", X86_FEATURE_CLEAR_BHB_LOOP
 .endm
 
 .macro CLEAR_BRANCH_HISTORY_VMEXIT
-	ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_VMEXIT
+	ALTERNATIVE "", "call clear_bhb_loop_nofence; lfence", X86_FEATURE_CLEAR_BHB_VMEXIT
 .endm
 #else
 #define CLEAR_BRANCH_HISTORY
@@ -389,7 +389,7 @@ extern void entry_untrain_ret(void);
 extern void write_ibpb(void);
 
 #ifdef CONFIG_X86_64
-extern void clear_bhb_loop(void);
+extern void clear_bhb_loop_nofence(void);
 #endif
 
 extern void (*x86_return_thunk)(void);
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 63d6c9fa5e80..f40e88f87273 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1619,7 +1619,7 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 		EMIT1(0x51); /* push rcx */
 		ip += 2;
 
-		func = (u8 *)clear_bhb_loop;
+		func = (u8 *)clear_bhb_loop_nofence;
 		ip += x86_call_depth_emit_accounting(&prog, func, ip);
 
 		if (emit_call(&prog, func, ip))

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 02/12] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta @ 2026-04-14  7:05 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-0-efa924abae5f@linux.intel.com>

As a mitigation for BHI, clear_bhb_loop() executes branches that overwrite
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs use the BHI_DIS_S hardware mitigation
in the kernel.

Now with VMSCAPE (BHI variant) it is also required to isolate branch
history between guests and userspace. Since BHI_DIS_S only protects the
kernel, the newer CPUs also use IBPB.

A cheaper alternative to the current IBPB mitigation is clear_bhb_loop().
But it currently does not clear enough BHB entries to be effective on newer
CPUs with larger BHB. At boot, dynamically set the loop count of
clear_bhb_loop() such that it is effective on newer CPUs too.

Introduce global loop counts, initializing them with appropriate value
based on the hardware feature X86_FEATURE_BHI_CTRL.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S            |  8 +++++---
 arch/x86/include/asm/nospec-branch.h |  2 ++
 arch/x86/kernel/cpu/bugs.c           | 13 +++++++++++++
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3a180a36ca0e..bbd4b1c7ec04 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1536,7 +1536,9 @@ SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
-	movl	$5, %ecx
+
+	movzbl    bhb_seq_outer_loop(%rip), %ecx
+
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1556,8 +1558,8 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * This should be ideally be: .skip 32 - (.Lret2 - 2f), 0xcc
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
-	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+	.skip 32 - 20, 0xcc
+2:	movzbl  bhb_seq_inner_loop(%rip), %eax
 3:	jmp	4f
 	nop
 4:	sub	$1, %eax
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 70b377fcbc1c..87b83ae7c97f 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -548,6 +548,8 @@ DECLARE_PER_CPU(u64, x86_spec_ctrl_current);
 extern void update_spec_ctrl_cond(u64 val);
 extern u64 spec_ctrl_current(void);
 
+extern u8 bhb_seq_inner_loop, bhb_seq_outer_loop;
+
 /*
  * With retpoline, we must use IBRS to restrict branch prediction
  * before calling into firmware.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 83f51cab0b1e..2cb4a96247d8 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -2047,6 +2047,10 @@ enum bhi_mitigations {
 static enum bhi_mitigations bhi_mitigation __ro_after_init =
 	IS_ENABLED(CONFIG_MITIGATION_SPECTRE_BHI) ? BHI_MITIGATION_AUTO : BHI_MITIGATION_OFF;
 
+/* Default to short BHB sequence values */
+u8 bhb_seq_outer_loop __ro_after_init = 5;
+u8 bhb_seq_inner_loop __ro_after_init = 5;
+
 static int __init spectre_bhi_parse_cmdline(char *str)
 {
 	if (!str)
@@ -3242,6 +3246,15 @@ void __init cpu_select_mitigations(void)
 		x86_spec_ctrl_base &= ~SPEC_CTRL_MITIGATIONS_MASK;
 	}
 
+	/*
+	 * Switch to long BHB clear sequence on newer CPUs (with BHI_CTRL
+	 * support), see Intel's BHI guidance.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
+		bhb_seq_outer_loop = 12;
+		bhb_seq_inner_loop = 7;
+	}
+
 	x86_arch_cap_msr = x86_read_arch_cap_msr();
 
 	cpu_print_attack_vectors();

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 01/12] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
From: Pawan Gupta @ 2026-04-14  7:05 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc
In-Reply-To: <20260414-vmscape-bhb-v10-0-efa924abae5f@linux.intel.com>

Currently, the BHB clearing sequence is followed by an LFENCE to prevent
transient execution of subsequent indirect branches prematurely. However,
the LFENCE barrier could be unnecessary in certain cases. For example, when
the kernel is using the BHI_DIS_S mitigation, and BHB clearing is only
needed for userspace. In such cases, the LFENCE is redundant because ring
transitions would provide the necessary serialization.

Below is a quick recap of BHI mitigation options:

On Alder Lake and newer

    BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
    performance overhead.

    Long loop: Alternatively, a longer version of the BHB clearing sequence
    can be used to mitigate BHI. It can also be used to mitigate the BHI
    variant of VMSCAPE. This is not yet implemented in Linux.

On older CPUs

    Short loop: Clears BHB at kernel entry and VMexit. The "Long loop" is
    effective on older CPUs as well, but should be avoided because of
    unnecessary overhead.

On Alder Lake and newer CPUs, eIBRS isolates the indirect targets between
guest and host. But when affected by the BHI variant of VMSCAPE, a guest's
branch history may still influence indirect branches in userspace. This
also means the big hammer IBPB could be replaced with a cheaper option that
clears the BHB at exit-to-userspace after a VMexit.

In preparation for adding the support for the BHB sequence (without LFENCE)
on newer CPUs, move the LFENCE to the caller side after clear_bhb_loop() is
executed. Allow callers to decide whether they need the LFENCE or not. This
adds a few extra bytes to the call sites, but it obviates the need for
multiple variants of clear_bhb_loop().

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Jon Kohler <jon@nutanix.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S            | 5 ++++-
 arch/x86/include/asm/nospec-branch.h | 4 ++--
 arch/x86/net/bpf_jit_comp.c          | 2 ++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42447b1e1dff..3a180a36ca0e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1528,6 +1528,9 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * refactored in the future if needed. The .skips are for safety, to ensure
  * that all RETs are in the second half of a cacheline to mitigate Indirect
  * Target Selection, rather than taking the slowpath via its_return_thunk.
+ *
+ * Note, callers should use a speculation barrier like LFENCE immediately after
+ * a call to this function to ensure BHB is cleared before indirect branches.
  */
 SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
@@ -1562,7 +1565,7 @@ SYM_FUNC_START(clear_bhb_loop)
 	sub	$1, %ecx
 	jnz	1b
 .Lret2:	RET
-5:	lfence
+5:
 	pop	%rbp
 	RET
 SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 4f4b5e8a1574..70b377fcbc1c 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -331,11 +331,11 @@
 
 #ifdef CONFIG_X86_64
 .macro CLEAR_BRANCH_HISTORY
-	ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_LOOP
+	ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_LOOP
 .endm
 
 .macro CLEAR_BRANCH_HISTORY_VMEXIT
-	ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_VMEXIT
+	ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_VMEXIT
 .endm
 #else
 #define CLEAR_BRANCH_HISTORY
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e9b78040d703..63d6c9fa5e80 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1624,6 +1624,8 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 
 		if (emit_call(&prog, func, ip))
 			return -EINVAL;
+		/* Don't speculate past this until BHB is cleared */
+		EMIT_LFENCE();
 		EMIT1(0x59); /* pop rcx */
 		EMIT1(0x58); /* pop rax */
 	}

-- 
2.34.1



^ permalink raw reply related

* RE: [PATCH] tipc: Ensure NUL-termination of remote algorithm name
From: Tung Quang Nguyen @ 2026-04-14  7:05 UTC (permalink / raw)
  To: Dudu Lu; +Cc: jmaloy@redhat.com, netdev@vger.kernel.org
In-Reply-To: <20260413085852.76786-1-phx0fer@gmail.com>

>Subject: [PATCH] tipc: Ensure NUL-termination of remote algorithm name
>
>In tipc_crypto_key_rcv(), the algorithm name is copied from the incoming
>message using memcpy with a fixed size of TIPC_AEAD_ALG_NAME (32 bytes).
>If the remote peer sends a name that fills all 32 bytes without a NUL
>terminator, the alg_name field will not be NUL-terminated. This string is later
>passed to crypto_alloc_aead() which expects a NUL-terminated string,
>potentially causing an out-of-bounds read when the crypto subsystem
>searches for the algorithm by name.
TIPC only supports one algorithm name "gcm(aes)" which is 8-byte length.
So, there is no " name that fills all 32 bytes" as you mentioned.
>
>Fix by explicitly NUL-terminating the last byte of alg_name after the memcpy.
>
>Fixes: 1ef6f7c9390f ("tipc: add automatic session key exchange")
>Signed-off-by: Dudu Lu <phx0fer@gmail.com>
>---
> net/tipc/crypto.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c index
>d3046a39ff72..ac072356bf0c 100644
>--- a/net/tipc/crypto.c
>+++ b/net/tipc/crypto.c
>@@ -2325,6 +2325,7 @@ static bool tipc_crypto_key_rcv(struct tipc_crypto
>*rx, struct tipc_msg *hdr)
> 	/* Copy key from msg data */
> 	skey->keylen = keylen;
> 	memcpy(skey->alg_name, data, TIPC_AEAD_ALG_NAME);
>+	skey->alg_name[TIPC_AEAD_ALG_NAME - 1] = '\0';
This is not needed as explained above.
> 	memcpy(skey->key, data + TIPC_AEAD_ALG_NAME + sizeof(__be32),
> 	       skey->keylen);
>
>--
>2.39.3 (Apple Git-145)
>


^ permalink raw reply

* [PATCH v10 00/12] VMSCAPE optimization for BHI variant
From: Pawan Gupta @ 2026-04-14  7:05 UTC (permalink / raw)
  To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
	David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Paolo Bonzini, Jonathan Corbet
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
	linux-doc

v10:
- Add patches to define EXPORT_STATIC_CALL_FOR_MODULES() and
  EXPORT_STATIC_CALL_FOR_KVM(), so that vmscape_predictor_flush static key
  is only accessible to KVM and not to other kernel modules. (PeterZ)
  (Borisov earlier objected to exporting the static key to all modules, but
  now the static key is only exported to KVM. I guess that resolves the
  concern.)
- Avoid an explicit call to vmscape_mitigation_enabled() and instead use
  static_call_query() in VMexit hot path. (Sean)
- Drop vmscape_mitigation_enabled(), as it is no longer needed.
- Rebased to v7.0

v9: https://lore.kernel.org/r/20260402-vmscape-bhb-v9-0-94d16bc29774@linux.intel.com
- Use global variables for BHB loop counters instead of ALTERNATIVE-based
  approach. (Dave & others)
- Use 32-bit registers (%eax/%ecx) for loop counters, loaded via movzbl
  from 8-bit globals. 8-bit registers (e.g. %ah in the inner loop) caused
  performance regression on certain CPUs due to partial-register stalls. (David Laight)
- Let BPF save/restore %rax/%rcx as in the original implementation, since
  it is the only caller that needs these registers preserved across the
  BHB clearing sequence.
- Drop Reviewed-by from patch 2/10 as the implementation changed significantly.
- Apply Tested-by from Jon Kohler to the series (except patch 2/10).
- Fix commit message grammar. (Borislav)
- Rebased to v7.0-rc6.

v8: https://lore.kernel.org/r/20260324-vmscape-bhb-v8-0-68bb524b3ab9@linux.intel.com
- Use helper in KVM to convey the mitigation status. (PeterZ/Borisov)
- Fix the documentation for default vmscape mitigation. (BPF bot)
- Remove the stray lines in bug.c (BPF bot).
- Updated commit messages and comments.
- Rebased to v7.0-rc5.

v7: https://lore.kernel.org/r/20260319-vmscape-bhb-v7-0-b76a777a98af@linux.intel.com
- s/This allows/Allow/ and s/This does adds/This adds/ in patch 1/10 commit
  message (Borislav).
- Minimize register usage in BHB clearing seq. (David Laight)
  - Instead of separate ecx/eax counters, use al/ah.
  - Adjust the alignment of RET due to register size change.
  - save/restore rax in the seq itself.
  - Remove the save/restore of rax/rcx for BPF callers.
- Rename clear_bhb_loop() to clear_bhb_loop_nofence() to make it
  obvious that the LFENCE is not part of the sequence (Borislav).
- Fix Kconfig: s/select/depends on/ HAVE_STATIC_CALL (PeterZ).
- Rebased to v7.0-rc4.

v6: https://lore.kernel.org/r/20251201-vmscape-bhb-v6-0-d610dd515714@linux.intel.com
- Remove semicolon at the end of asm in ALTERNATIVE (Uros).
- Fix build warning in vmscape_select_mitigation() (LKP).
- Rebased to v6.18.

v5: https://lore.kernel.org/r/20251126-vmscape-bhb-v5-2-02d66e423b00@linux.intel.com
- For BHI seq, limit runtime-patching to loop counts only (Dave).
  Dropped 2 patches that moved the BHB seq to a macro.
- Remove redundant switch cases in vmscape_select_mitigation() (Nikolay).
- Improve commit message (Nikolay).
- Collected tags.

v4: https://lore.kernel.org/r/20251119-vmscape-bhb-v4-0-1adad4e69ddc@linux.intel.com
- Move LFENCE to the callsite, out of clear_bhb_loop(). (Dave)
- Make clear_bhb_loop() work for larger BHB. (Dave)
  This now uses hardware enumeration to determine the BHB size to clear.
- Use write_ibpb() instead of indirect_branch_prediction_barrier() when
  IBPB is known to be available. (Dave)
- Use static_call() to simplify mitigation at exit-to-userspace. (Dave)
- Refactor vmscape_select_mitigation(). (Dave)
- Fix vmscape=on which was wrongly behaving as AUTO. (Dave)
- Split the patches. (Dave)
  - Patch 1-4 prepares for making the sequence flexible for VMSCAPE use.
  - Patch 5 trivial rename of variable.
  - Patch 6-8 prepares for deploying BHB mitigation for VMSCAPE.
  - Patch 9 deploys the mitigation.
  - Patch 10-11 fixes ON Vs AUTO mode.

v3: https://lore.kernel.org/r/20251027-vmscape-bhb-v3-0-5793c2534e93@linux.intel.com
- s/x86_pred_flush_pending/x86_predictor_flush_exit_to_user/ (Sean).
- Removed IBPB & BHB-clear mutual exclusion at exit-to-userspace.
- Collected tags.

v2: https://lore.kernel.org/r/20251015-vmscape-bhb-v2-0-91cbdd9c3a96@linux.intel.com
- Added check for IBPB feature in vmscape_select_mitigation(). (David)
- s/vmscape=auto/vmscape=on/ (David)
- Added patch to remove LFENCE from VMSCAPE BHB-clear sequence.
- Rebased to v6.18-rc1.

v1: https://lore.kernel.org/r/20250924-vmscape-bhb-v1-0-da51f0e1934d@linux.intel.com

Hi All,

These patches aim to improve the performance of a recent mitigation for
VMSCAPE[1] vulnerability. This improvement is relevant for BHI variant of
VMSCAPE that affect Alder Lake and newer processors.

The current mitigation approach uses IBPB on kvm-exit-to-userspace for all
affected range of CPUs. This is an overkill for CPUs that are only affected
by the BHI variant. On such CPUs clearing the branch history is sufficient
for VMSCAPE, and also more apt as the underlying issue is due to poisoned
branch history.

Below is the iPerf data for transfer between guest and host, comparing IBPB
and BHB-clear mitigation. BHB-clear shows performance improvement over IBPB
in most cases.

Platform: Emerald Rapids
Baseline: vmscape=off
Target: IBPB at VMexit-to-userspace Vs the new BHB-clear at
	VMexit-to-userspace mitigation (both compared against baseline).

(pN = N parallel connections)

| iPerf user-net | IBPB    | BHB Clear |
|----------------|---------|-----------|
| UDP 1-vCPU_p1  | -12.5%  |   1.3%    |
| TCP 1-vCPU_p1  | -10.4%  |  -1.5%    |
| TCP 1-vCPU_p1  | -7.5%   |  -3.0%    |
| UDP 4-vCPU_p16 | -3.7%   |  -3.7%    |
| TCP 4-vCPU_p4  | -2.9%   |  -1.4%    |
| UDP 4-vCPU_p4  | -0.6%   |   0.0%    |
| TCP 4-vCPU_p4  |  3.5%   |   0.0%    |

| iPerf bridge-net | IBPB    | BHB Clear |
|------------------|---------|-----------|
| UDP 1-vCPU_p1    | -9.4%   |  -0.4%    |
| TCP 1-vCPU_p1    | -3.9%   |  -0.5%    |
| UDP 4-vCPU_p16   | -2.2%   |  -3.8%    |
| TCP 4-vCPU_p4    | -1.0%   |  -1.0%    |
| TCP 4-vCPU_p4    |  0.5%   |   0.5%    |
| UDP 4-vCPU_p4    |  0.0%   |   0.9%    |
| TCP 1-vCPU_p1    |  0.0%   |   0.9%    |

| iPerf vhost-net | IBPB    | BHB Clear |
|-----------------|---------|-----------|
| UDP 1-vCPU_p1   | -4.3%   |   1.0%    |
| TCP 1-vCPU_p1   | -3.8%   |  -0.5%    |
| TCP 1-vCPU_p1   | -2.7%   |  -0.7%    |
| UDP 4-vCPU_p16  | -0.7%   |  -2.2%    |
| TCP 4-vCPU_p4   | -0.4%   |   0.8%    |
| UDP 4-vCPU_p4   |  0.4%   |  -0.7%    |
| TCP 4-vCPU_p4   |  0.0%   |   0.6%    |

[1] https://comsec.ethz.ch/research/microarch/vmscape-exposing-and-exploiting-incomplete-branch-predictor-isolation-in-cloud-environments/

---
Pawan Gupta (12):
      x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
      x86/bhi: Make clear_bhb_loop() effective on newer CPUs
      x86/bhi: Rename clear_bhb_loop() to clear_bhb_loop_nofence()
      x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user
      x86/vmscape: Move mitigation selection to a switch()
      x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier()
      static_call: Add EXPORT_STATIC_CALL_FOR_MODULES()
      kvm: Define EXPORT_STATIC_CALL_FOR_KVM()
      x86/vmscape: Use static_call() for predictor flush
      x86/vmscape: Deploy BHB clearing mitigation
      x86/vmscape: Resolve conflict between attack-vectors and vmscape=force
      x86/vmscape: Add cmdline vmscape=on to override attack vector controls

 Documentation/admin-guide/hw-vuln/vmscape.rst   | 15 ++++-
 Documentation/admin-guide/kernel-parameters.txt |  6 +-
 arch/x86/Kconfig                                |  1 +
 arch/x86/entry/entry_64.S                       | 21 ++++---
 arch/x86/include/asm/cpufeatures.h              |  2 +-
 arch/x86/include/asm/entry-common.h             | 13 ++--
 arch/x86/include/asm/kvm_types.h                |  1 +
 arch/x86/include/asm/nospec-branch.h            | 15 +++--
 arch/x86/kernel/cpu/bugs.c                      | 84 +++++++++++++++++++++----
 arch/x86/kvm/x86.c                              |  4 +-
 arch/x86/net/bpf_jit_comp.c                     |  4 +-
 include/linux/kvm_types.h                       | 13 +++-
 include/linux/static_call.h                     |  8 +++
 13 files changed, 150 insertions(+), 37 deletions(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: 20250916-vmscape-bhb-d7d469977f2f

Best regards,
--  
Thanks,
Pawan



^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net] idpf: fix double free and use-after-free in aux device error paths
From: Paul Menzel @ 2026-04-14  6:54 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: intel-wired-lan, netdev, linux-kernel, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, stable
In-Reply-To: <2026041116-retail-bagginess-250f@gregkh>

Dear Greg,


Thank you for the patch.

Am 11.04.26 um 12:12 schrieb Greg Kroah-Hartman:
> When auxiliary_device_add() fails in idpf_plug_vport_aux_dev() or
> idpf_plug_core_aux_dev(), the err_aux_dev_add label calls
> auxiliary_device_uninit() and falls through to err_aux_dev_init.  The
> uninit call will trigger put_device(), which invokes the release
> callback (idpf_vport_adev_release / idpf_core_adev_release) that frees
> iadev.  The fall-through then reads adev->id from the freed iadev for
> ida_free() and double-frees iadev with kfree().
> 
> Free the IDA slot and clear the back-pointer before uninit, while adev
> is still valid, then return immediately.
> 
> Commit 65637c3a1811 65637c3a1811 ("idpf: fix UAF in RDMA core aux dev

The commit hash is pasted twice.

> deinitialization") fixed the same use-after-free in the matching unplug
> path in this file but missed both probe error paths.
> 
> Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: stable <stable@kernel.org>
> Fixes: be91128c579c ("idpf: implement RDMA vport auxiliary dev create, init, and destroy")
> Fixes: f4312e6bfa2a ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
> Assisted-by: gregkh_clanker_t1000
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
> Note, these cleanup paths are messy, but I couldn't see a simpler way
> without a lot more rework, so I choose the simple way :)
> 
>   drivers/net/ethernet/intel/idpf/idpf_idc.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_idc.c b/drivers/net/ethernet/intel/idpf/idpf_idc.c
> index 7e4f4ac92653..b7d6b08fc89e 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_idc.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_idc.c
> @@ -90,7 +90,10 @@ static int idpf_plug_vport_aux_dev(struct iidc_rdma_core_dev_info *cdev_info,
>   	return 0;
>   
>   err_aux_dev_add:
> +	ida_free(&idpf_idc_ida, adev->id);
> +	vdev_info->adev = NULL;
>   	auxiliary_device_uninit(adev);
> +	return ret;
>   err_aux_dev_init:
>   	ida_free(&idpf_idc_ida, adev->id);
>   err_ida_alloc:
> @@ -228,7 +231,10 @@ static int idpf_plug_core_aux_dev(struct iidc_rdma_core_dev_info *cdev_info)
>   	return 0;
>   
>   err_aux_dev_add:
> +	ida_free(&idpf_idc_ida, adev->id);
> +	cdev_info->adev = NULL;
>   	auxiliary_device_uninit(adev);
> +	return ret;
>   err_aux_dev_init:
>   	ida_free(&idpf_idc_ida, adev->id);
>   err_ida_alloc:

Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>

gemini/gemini-3.1-pro-preview has two comments [1]. Maybe the driver 
developers could judge their relevance.


Kind regards,

Paul


[1]: 
https://sashiko.dev/#/patchset/2026041116-retail-bagginess-250f%40gregkh

^ permalink raw reply

* [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-14  6:50 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev

Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
Changes in v2:
- Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
  order to avoid any possible NULL pointer dereference in
  airoha_qdma_cleanup_tx_queue()
- Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
- Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
---
 drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 9e995094c32a..3c1a2bc68c42 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
 	dma_addr_t dma_addr;
 
 	spin_lock_init(&q->lock);
-	q->ndesc = size;
 	q->qdma = qdma;
 	q->free_thr = 1 + MAX_SKB_FRAGS;
 	INIT_LIST_HEAD(&q->tx_list);
 
-	q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
+	q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
 				GFP_KERNEL);
 	if (!q->entry)
 		return -ENOMEM;
 
-	q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
+	q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
 				      &dma_addr, GFP_KERNEL);
 	if (!q->desc)
 		return -ENOMEM;
 
-	for (i = 0; i < q->ndesc; i++) {
+	for (i = 0; i < size; i++) {
 		u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
 
 		list_add_tail(&q->entry[i].list, &q->tx_list);
 		WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
 	}
+	q->ndesc = size;
 
 	/* xmit ring drop default setting */
 	airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
@@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
 
 static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
 {
-	struct airoha_eth *eth = q->qdma->eth;
-	int i;
+	struct airoha_qdma *qdma = q->qdma;
+	struct airoha_eth *eth = qdma->eth;
+	int i, qid = q - &qdma->q_tx[0];
+	struct airoha_queue_entry *e;
+	u16 index = 0;
 
 	spin_lock_bh(&q->lock);
 	for (i = 0; i < q->ndesc; i++) {
-		struct airoha_queue_entry *e = &q->entry[i];
+		struct airoha_qdma_desc *desc = &q->desc[i];
 
+		e = &q->entry[i];
 		if (!e->dma_addr)
 			continue;
 
@@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
 		e->dma_addr = 0;
 		e->skb = NULL;
 		list_add_tail(&e->list, &q->tx_list);
+
+		/* Reset DMA descriptor */
+		WRITE_ONCE(desc->ctrl, 0);
+		WRITE_ONCE(desc->addr, 0);
+		WRITE_ONCE(desc->data, 0);
+		WRITE_ONCE(desc->msg0, 0);
+		WRITE_ONCE(desc->msg1, 0);
+		WRITE_ONCE(desc->msg2, 0);
+
 		q->queued--;
 	}
+
+	if (!list_empty(&q->tx_list)) {
+		e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
+				     list);
+		index = e - q->entry;
+	}
+	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
+	 * empty.
+	 */
+	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
+			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
+	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
+			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
+
 	spin_unlock_bh(&q->lock);
 }
 

---
base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply related

* Re: [PATCH net] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-14  6:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: andrew+netdev, davem, edumazet, pabeni, linux-arm-kernel,
	linux-mediatek, netdev
In-Reply-To: <20260413225709.3648413-1-kuba@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3295 bytes --]

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
> 
> This commit resets DMA TX descriptors in airoha_qdma_cleanup_tx_queue
> and synchronizes TX_DMA_IDX to TX_CPU_IDX to notify the hardware that
> the ring is empty.
> 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 9285a68f435fe..963ab7b8d166b 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -1044,13 +1044,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> [ ... ]
> >  	spin_lock_bh(&q->lock);
> >  	for (i = 0; i < q->ndesc; i++) {
> > -		struct airoha_queue_entry *e = &q->entry[i];
> > +		struct airoha_qdma_desc *desc = &q->desc[i];
> >  
> > +		e = &q->entry[i];
> 
> Can this result in a NULL pointer dereference?
> 
> If airoha_qdma_init_tx_queue() sets q->ndesc before allocating q->entry and
> q->desc, and devm_kzalloc() fails, the error path jumps to
> airoha_hw_cleanup().
> 
> Because q->ndesc > 0, this cleanup loop would execute and dereference
> q->entry[i] and q->desc[i], which may not have been allocated.

ack, I will fix it in v2.

> 
> [ ... ]
> > +		/* Reset DMA descriptor */
> > +		WRITE_ONCE(desc->ctrl, 0);
> > +		WRITE_ONCE(desc->addr, 0);
> > +		WRITE_ONCE(desc->data, 0);
> > +		WRITE_ONCE(desc->msg0, 0);
> > +		WRITE_ONCE(desc->msg1, 0);
> > +		WRITE_ONCE(desc->msg2, 0);
> 
> Is it safe to modify these DMA descriptors and reset the ring indices below
> without waiting for the hardware DMA engine to become idle?
> 
> In airoha_dev_stop(), the driver clears GLOBAL_CFG_TX_DMA_EN_MASK to request
> the hardware DMA engine to stop, but it does not appear to poll
> GLOBAL_CFG_TX_DMA_BUSY_MASK to ensure it has actually stopped.
> 
> Modifying the DMA descriptors and internal ring indices while the hardware
> is still actively processing in-flight descriptors could potentially lead
> to corrupted DMA fetches or IOMMU faults.

This is already fixed here:
https://lore.kernel.org/netdev/20260411-airoha-fix-ndo_stop-v1-1-caddaa181739@kernel.org/

> 
> >  		q->queued--;
> >  	}
> > +
> > +	e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> > +			     list);
> > +	index = e - q->entry;
> > +	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> > +	 * empty.
> > +	 */
> > +	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> > +			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> > +	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> > +			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> 
> Could list_first_entry() be called on an empty list here?
> 
> If q->tx_list is empty (for example, if allocation failed in
> airoha_qdma_init_tx_queue or if no entries had a valid dma_addr), calling
> list_first_entry() returns an invalid pointer.
> 
> The subsequent pointer arithmetic would produce a garbage index value,
> which is then written to the REG_TX_CPU_IDX and REG_TX_DMA_IDX registers.

ack, I will fix it in v2.

Regards,
Lorenzo

> -- 
> pw-bot: cr

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH v5.10] netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR
From: Keerthana K @ 2026-04-14  6:32 UTC (permalink / raw)
  To: stable, gregkh
  Cc: pablo, kadlec, fw, davem, edumazet, kuba, pabeni, netfilter-devel,
	coreteam, netdev, linux-kernel, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu,
	Stefano Brivio, Mukul Sikka, Brennan Lamoreaux, Keerthana K

From: Florian Westphal <fw@strlen.de>

commit 07ace0bbe03b3d8e85869af1dec5e4087b1d57b8 upstream

pipapo relies on kmalloc(0) returning ZERO_SIZE_PTR (i.e., not NULL
but pointer is invalid).

Rework this to not call slab allocator when we'd request a 0-byte
allocation.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mukul Sikka <mukul.sikka@broadcom.com>
Signed-off-by: Brennan Lamoreaux <brennan.lamoreaux@broadcom.com>
[Keerthana: In older stable branches (v6.6 and earlier), the allocation logic in
pipapo_clone() still relies on `src->rules` rather than `src->rules_alloc`
(introduced in v6.9 via 9f439bd6ef4f). Consequently, the previously
backported INT_MAX clamping check uses `src->rules`. This patch correctly
moves that `src->rules > (INT_MAX / ...)` check inside the new
`if (src->rules > 0)` block]
Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
---
 net/netfilter/nft_set_pipapo.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index a4fdd1587bb3..83606dfde033 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -524,6 +524,9 @@ static struct nft_pipapo_elem *pipapo_get(const struct net *net,
 	struct nft_pipapo_field *f;
 	int i;
 
+	if (m->bsize_max == 0)
+		return ret;
+
 	res_map = kmalloc_array(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
 	if (!res_map) {
 		ret = ERR_PTR(-ENOMEM);
@@ -1363,14 +1366,20 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 		       src->bsize * sizeof(*dst->lt) *
 		       src->groups * NFT_PIPAPO_BUCKETS(src->bb));
 
-		if (src->rules > (INT_MAX / sizeof(*src->mt)))
-			goto out_mt;
+		if (src->rules > 0) {
+			if (src->rules > (INT_MAX / sizeof(*src->mt)))
+				goto out_mt;
 
-		dst->mt = kvmalloc(src->rules * sizeof(*src->mt), GFP_KERNEL);
-		if (!dst->mt)
-			goto out_mt;
+			dst->mt = kvmalloc_array(src->rules, sizeof(*src->mt),
+						 GFP_KERNEL);
+			if (!dst->mt)
+				goto out_mt;
+
+			memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt));
+		} else {
+			dst->mt = NULL;
+		}
 
-		memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt));
 		src++;
 		dst++;
 	}
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 v5.15-v6.1] netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR
From: Keerthana K @ 2026-04-14  6:31 UTC (permalink / raw)
  To: stable, gregkh
  Cc: pablo, kadlec, fw, davem, edumazet, kuba, pabeni, netfilter-devel,
	coreteam, netdev, linux-kernel, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu,
	Stefano Brivio, Mukul Sikka, Brennan Lamoreaux, Keerthana K

From: Florian Westphal <fw@strlen.de>

commit 07ace0bbe03b3d8e85869af1dec5e4087b1d57b8 upstream

pipapo relies on kmalloc(0) returning ZERO_SIZE_PTR (i.e., not NULL
but pointer is invalid).

Rework this to not call slab allocator when we'd request a 0-byte
allocation.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mukul Sikka <mukul.sikka@broadcom.com>
Signed-off-by: Brennan Lamoreaux <brennan.lamoreaux@broadcom.com>
[Keerthana: In older stable branches (v6.6 and earlier), the allocation logic in
pipapo_clone() still relies on `src->rules` rather than `src->rules_alloc`
(introduced in v6.9 via 9f439bd6ef4f). Consequently, the previously
backported INT_MAX clamping check uses `src->rules`. This patch correctly
moves that `src->rules > (INT_MAX / ...)` check inside the new
`if (src->rules > 0)` block]
Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
---
Changes in v2:
- Fixed patch apply failure

v1: https://lore.kernel.org/all/20260413043247.3327855-1-keerthana.kalyanasundaram@broadcom.com/

 net/netfilter/nft_set_pipapo.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 863162c82330..2072c89a467d 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -525,6 +525,8 @@ static struct nft_pipapo_elem *pipapo_get(const struct net *net,
 	int i;
 
 	m = priv->clone;
+	if (m->bsize_max == 0)
+		return ret;
 
 	res_map = kmalloc_array(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
 	if (!res_map) {
@@ -1365,14 +1367,20 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 		       src->bsize * sizeof(*dst->lt) *
 		       src->groups * NFT_PIPAPO_BUCKETS(src->bb));
 
-		if (src->rules > (INT_MAX / sizeof(*src->mt)))
-			goto out_mt;
+		if (src->rules > 0) {
+			if (src->rules > (INT_MAX / sizeof(*src->mt)))
+				goto out_mt;
+
+			dst->mt = kvmalloc_array(src->rules, sizeof(*src->mt),
+						 GFP_KERNEL);
+			if (!dst->mt)
+				goto out_mt;
 
-		dst->mt = kvmalloc(src->rules * sizeof(*src->mt), GFP_KERNEL);
-		if (!dst->mt)
-			goto out_mt;
+			memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt));
+		} else {
+			dst->mt = NULL;
+		}
 
-		memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt));
 		src++;
 		dst++;
 	}
-- 
2.43.7


^ permalink raw reply related

* Re: [PATCH v2] rose: fix OOB reads on short CLEAR REQUEST frames
From: Eric Dumazet @ 2026-04-14  6:11 UTC (permalink / raw)
  To: Ashutosh Desai
  Cc: netdev, linux-hams, davem, kuba, pabeni, horms, linux-kernel
In-Reply-To: <177614667427.3606651.8700070406932922261@gmail.com>

On Mon, Apr 13, 2026 at 11:04 PM Ashutosh Desai
<ashutoshdesai993@gmail.com> wrote:
>
> rose_process_rx_frame() calls rose_decode() which reads skb->data[2]
> without any prior length check. For CLEAR REQUEST frames the state
> machines then read skb->data[3] and skb->data[4] as the cause and
> diagnostic bytes.
>
> A crafted 3-byte ROSE CLEAR REQUEST frame passes the minimum length
> gate in rose_route_frame() and reaches rose_process_rx_frame(), where
> rose_decode() reads one byte past the header and the state machines
> read two bytes past the valid buffer.
>
> Add a pskb_may_pull(skb, 3) check before rose_decode() to cover its
> skb->data[2] access, and a pskb_may_pull(skb, 5) check afterwards for
> the CLEAR REQUEST path to cover the cause and diagnostic reads.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
> ---
> V1 -> V2: switch skb->len check to pskb_may_pull; also add
>           pskb_may_pull(skb, 3) before rose_decode() to cover its
>           skb->data[2] access
>
> v1: https://lore.kernel.org/netdev/20260409013246.2051746-1-ashutoshdesai993@gmail.com/
>
>  net/rose/rose_in.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/net/rose/rose_in.c b/net/rose/rose_in.c
> index 0276b393f0e5..b9f01a11e2df 100644
> --- a/net/rose/rose_in.c
> +++ b/net/rose/rose_in.c
> @@ -269,8 +269,18 @@ int rose_process_rx_frame(struct sock *sk, struct sk_buff *skb)
>  if (rose->state == ROSE_STATE_0)
>   0;
>
> +if (!pskb_may_pull(skb, 3)) {
> +kfree_skb(skb);
> +return 0;
> +}
> +
>  frametype = rose_decode(skb, &ns, &nr, &q, &d, &m);
>
> +if (frametype == ROSE_CLEAR_REQUEST && !pskb_may_pull(skb, 5)) {
> +kfree_skb(skb);
> +return 0;
> +}
> +
>  switch (rose->state) {
>  case ROSE_STATE_1:
>  ueued = rose_state1_machine(sk, skb, frametype);
> --
> 2.34.1

rose_process_rx_frame() callers already call kfree_skb(skb) if
rose_process_rx_frame()
returns a 0.
Your patch would add double-frees.

Your patch is white-space mangled.

Please take a look at Documentation/process/maintainer-netdev.rst

Preparing changes
-----------------

Attention to detail is important.  Re-read your own work as if you were the
reviewer.  You can start with using ``checkpatch.pl``, perhaps even with
the ``--strict`` flag.  But do not be mindlessly robotic in doing so.
If your change is a bug fix, make sure your commit log indicates the
end-user visible symptom, the underlying reason as to why it happens,
and then if necessary, explain why the fix proposed is the best way to
get things done.  Don't mangle whitespace, and as is common, don't
mis-indent function arguments that span multiple lines.  If it is your
first patch, mail it to yourself so you can test apply it to an
unpatched tree to confirm infrastructure didn't mangle it.

Finally, go back and read
:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
to be sure you are not repeating some common mistake documented there.

Also:

Indicating target tree
~~~~~~~~~~~~~~~~~~~~~~

To help maintainers and CI bots you should explicitly mark which tree
your patch is targeting. Assuming that you use git, use the prefix
flag::

  git format-patch --subject-prefix='PATCH net-next' start..finish

Use ``net`` instead of ``net-next`` (always lower case) in the above for
bug-fix ``net`` content.

Please

pw-bot: cr

^ permalink raw reply

* [PATCH v2] rose: fix OOB reads on short CLEAR REQUEST frames
From: Ashutosh Desai @ 2026-04-14  6:04 UTC (permalink / raw)
  To: netdev; +Cc: linux-hams, davem, edumazet, kuba, pabeni, horms, linux-kernel

rose_process_rx_frame() calls rose_decode() which reads skb->data[2]
without any prior length check. For CLEAR REQUEST frames the state
machines then read skb->data[3] and skb->data[4] as the cause and
diagnostic bytes.

A crafted 3-byte ROSE CLEAR REQUEST frame passes the minimum length
gate in rose_route_frame() and reaches rose_process_rx_frame(), where
rose_decode() reads one byte past the header and the state machines
read two bytes past the valid buffer.

Add a pskb_may_pull(skb, 3) check before rose_decode() to cover its
skb->data[2] access, and a pskb_may_pull(skb, 5) check afterwards for
the CLEAR REQUEST path to cover the cause and diagnostic reads.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
---
V1 -> V2: switch skb->len check to pskb_may_pull; also add
          pskb_may_pull(skb, 3) before rose_decode() to cover its
          skb->data[2] access

v1: https://lore.kernel.org/netdev/20260409013246.2051746-1-ashutoshdesai993@gmail.com/

 net/rose/rose_in.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/rose/rose_in.c b/net/rose/rose_in.c
index 0276b393f0e5..b9f01a11e2df 100644
--- a/net/rose/rose_in.c
+++ b/net/rose/rose_in.c
@@ -269,8 +269,18 @@ int rose_process_rx_frame(struct sock *sk, struct sk_buff *skb)
 if (rose->state == ROSE_STATE_0)
  0;

+if (!pskb_may_pull(skb, 3)) {
+kfree_skb(skb);
+return 0;
+}
+
 frametype = rose_decode(skb, &ns, &nr, &q, &d, &m);

+if (frametype == ROSE_CLEAR_REQUEST && !pskb_may_pull(skb, 5)) {
+kfree_skb(skb);
+return 0;
+}
+
 switch (rose->state) {
 case ROSE_STATE_1:
 ueued = rose_state1_machine(sk, skb, frametype);
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH net-next v2 3/3] selftests/net: Add additional test coverage in nk_qlease
From: Nikolay Aleksandrov @ 2026-04-14  5:59 UTC (permalink / raw)
  To: Daniel Borkmann, netdev; +Cc: kuba, dw, pabeni
In-Reply-To: <20260413220809.604592-4-daniel@iogearbox.net>

On 4/14/26 01:08, Daniel Borkmann wrote:
> Add further netkit queue-lease coverage for netns lifecycle of the guest
> and physical halves, channel resize across active leases, single-device
> and multi-lessee scenarios, L3 mode operation, lease capacity exhaustion,
> and corner-cases of e.g. queue-create rejection paths. Also make the tests
> more robust by removing the time.sleep(0.1) after netns deletion and turn
> them into a wait loop.
> 
> Full test run:
> 
>   # ./nk_qlease.py
>   TAP version 13
>   1..45
>   ok 1 nk_qlease.test_remove_phys
>   ok 2 nk_qlease.test_double_lease
>   ok 3 nk_qlease.test_virtual_lessor
>   ok 4 nk_qlease.test_phys_lessee
>   ok 5 nk_qlease.test_different_lessors
>   ok 6 nk_qlease.test_queue_out_of_range
>   ok 7 nk_qlease.test_resize_leased
>   ok 8 nk_qlease.test_self_lease
>   ok 9 nk_qlease.test_create_tx_type
>   ok 10 nk_qlease.test_create_primary
>   ok 11 nk_qlease.test_create_limit
>   ok 12 nk_qlease.test_link_flap_phys
>   ok 13 nk_qlease.test_queue_get_virtual
>   ok 14 nk_qlease.test_remove_virt_first
>   ok 15 nk_qlease.test_multiple_leases
>   ok 16 nk_qlease.test_lease_queue_tx_type
>   ok 17 nk_qlease.test_invalid_netns
>   ok 18 nk_qlease.test_invalid_phys_ifindex
>   ok 19 nk_qlease.test_multi_netkit_remove_phys
>   ok 20 nk_qlease.test_single_remove_phys
>   ok 21 nk_qlease.test_link_flap_virt
>   ok 22 nk_qlease.test_phys_queue_no_lease
>   ok 23 nk_qlease.test_same_ns_lease
>   ok 24 nk_qlease.test_resize_after_unlease
>   ok 25 nk_qlease.test_lease_queue_zero
>   ok 26 nk_qlease.test_release_and_reuse
>   ok 27 nk_qlease.test_veth_queue_create
>   ok 28 nk_qlease.test_two_netkits_same_queue
>   ok 29 nk_qlease.test_l3_mode_lease
>   ok 30 nk_qlease.test_single_double_lease
>   ok 31 nk_qlease.test_single_different_lessors
>   ok 32 nk_qlease.test_cross_ns_netns_id
>   ok 33 nk_qlease.test_delete_guest_netns
>   ok 34 nk_qlease.test_move_guest_netns
>   ok 35 nk_qlease.test_resize_phys_no_reduction
>   ok 36 nk_qlease.test_delete_one_netkit_of_two
>   ok 37 nk_qlease.test_bind_rx_leased_phys_queue
>   ok 38 nk_qlease.test_resize_phys_shrink_past_leased
>   ok 39 nk_qlease.test_resize_virt_not_supported
>   ok 40 nk_qlease.test_lease_devices_down
>   ok 41 nk_qlease.test_lease_capacity_exhaustion
>   ok 42 nk_qlease.test_resize_phys_up
>   ok 43 nk_qlease.test_multi_ns_lease
>   ok 44 nk_qlease.test_multi_ns_delete_one
>   ok 45 nk_qlease.test_move_phys_netns
>   # Totals: pass:45 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/testing/selftests/net/nk_qlease.py | 951 ++++++++++++++++++++++-
>  1 file changed, 946 insertions(+), 5 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply

* Re: [PATCH net-next v2 2/3] selftests/net: Split netdevsim tests from HW tests in nk_qlease
From: Nikolay Aleksandrov @ 2026-04-14  5:58 UTC (permalink / raw)
  To: Daniel Borkmann, netdev; +Cc: kuba, dw, pabeni
In-Reply-To: <20260413220809.604592-3-daniel@iogearbox.net>

On 4/14/26 01:08, Daniel Borkmann wrote:
> As pointed out in 3d2c3d2eea9a ("selftests: net: py: explicitly forbid
> multiple ksft_run() calls"), ksft_run() cannot be called multiple times.
> 
> Move the netdevsim-based queue lease tests to selftests/net/ so that
> each file has exactly one ksft_run() call.
> 
> The HW tests (io_uring ZC RX, queue attrs, XDP with MP, destroy) remain
> in selftests/drivers/net/hw/.
> 
> Fixes: 65d657d80684 ("selftests/net: Add queue leasing tests with netkit")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://lore.kernel.org/netdev/20260409181950.7e099b6c@kernel.org
> ---
>  .../selftests/drivers/net/hw/nk_qlease.py     | 1142 ----------------
>  tools/testing/selftests/net/Makefile          |    1 +
>  tools/testing/selftests/net/nk_qlease.py      | 1168 +++++++++++++++++
>  3 files changed, 1169 insertions(+), 1142 deletions(-)
>  create mode 100755 tools/testing/selftests/net/nk_qlease.py
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply

* Re: [PATCH net-next v2 1/3] tools/ynl: Make YnlFamily closeable as a context manager
From: Nikolay Aleksandrov @ 2026-04-14  5:57 UTC (permalink / raw)
  To: Daniel Borkmann, netdev; +Cc: kuba, dw, pabeni
In-Reply-To: <20260413220809.604592-2-daniel@iogearbox.net>

On 4/14/26 01:08, Daniel Borkmann wrote:
> YnlFamily opens an AF_NETLINK socket in __init__ but has no way
> to release it other than leaving it to the GC. YnlFamily holds a
> self reference cycle through SpecFamily's self.family = self
> in its super().__init__() call, so refcount GC cannot reclaim
> it and the socket stays open until the cyclic GC runs.
> 
> If a test creates a guest netns, instantiates a YnlFamily inside
> it via NetNSEnter(), performs some test case work via Ynl, and
> then deletes the netns, then the 'ip netns del' only drops the
> mount binding and cleanup_net in the kernel never runs, so any
> subsequent test case assertions that objects got cleaned up would
> fail given this only gets triggered later via cyclic GC run.
> 
> Add an explicit close() that closes the netlink socket and wire
> up the __enter__/__exit__ so callers can scope the instance
> deterministically via 'with YnlFamily(...) as ynl: ...'.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/net/ynl/pyynl/lib/ynl.py | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tools/net/ynl/pyynl/lib/ynl.py b/tools/net/ynl/pyynl/lib/ynl.py
> index 9c078599cea0..f63c6f828735 100644
> --- a/tools/net/ynl/pyynl/lib/ynl.py
> +++ b/tools/net/ynl/pyynl/lib/ynl.py
> @@ -731,6 +731,16 @@ class YnlFamily(SpecFamily):
>              bound_f = functools.partial(self._op, op_name)
>              setattr(self, op.ident_name, bound_f)
>  
> +    def close(self):
> +        if self.sock is not None:
> +            self.sock.close()
> +            self.sock = None
> +
> +    def __enter__(self):
> +        return self
> +
> +    def __exit__(self, exc_type, exc, tb):
> +        self.close()
>  
>      def ntf_subscribe(self, mcast_name):
>          mcast_id = self.nlproto.get_mcast_id(mcast_name, self.mcast_groups)

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply

* Re: [PATCH net-next v2 2/2] selftests/bpf: verify syncookie statistics in tcp_custom_syncookie
From: Kuniyuki Iwashima @ 2026-04-14  5:50 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: netdev, Eric Dumazet, Neal Cardwell, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, linux-kernel, bpf, linux-kselftest
In-Reply-To: <20260411013211.225834-2-jiayuan.chen@linux.dev>

On Fri, Apr 10, 2026 at 6:32 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> Add read_tcpext_snmp() helper to network_helpers which reads a
> TcpExt SNMP counter via nstat, and use it in the tcp_custom_syncookie
> test to verify that LINUX_MIB_SYNCOOKIESRECV is incremented and
> LINUX_MIB_SYNCOOKIESFAILED stays unchanged across a successful
> BPF custom syncookie validation.
>
> The delta is captured between start_server() and accept(), which
> covers the full SYN/ACK/cookie-check path for one connection.
>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  tools/testing/selftests/bpf/network_helpers.c | 22 +++++++++++++++++++
>  tools/testing/selftests/bpf/network_helpers.h |  1 +
>  .../bpf/prog_tests/tcp_custom_syncookie.c     | 20 +++++++++++++++++

As you touch bpf selftest helper files, please rebase on bpf-next
to avoid possible conflicts and tag bpf-next in the Subject.

Change itself looks good.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

one nit below.


>  3 files changed, 43 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> index b82f572641b7..3388dd5112b6 100644
> --- a/tools/testing/selftests/bpf/network_helpers.c
> +++ b/tools/testing/selftests/bpf/network_helpers.c
> @@ -621,6 +621,28 @@ int get_socket_local_port(int sock_fd)
>         return -1;
>  }
>
> +int read_tcpext_snmp(const char *name, unsigned long *val)
> +{
> +       char cmd[128], buf[128];
> +       int ret = 0;
> +       FILE *f;
> +
> +       snprintf(cmd, sizeof(cmd),
> +                "nstat -az TcpExt%s | awk '/TcpExt/ {print $2}'", name);
> +       f = popen(cmd, "r");
> +       if (!f)
> +               return -errno;
> +
> +       if (!fgets(buf, sizeof(buf), f)) {
> +               ret = ferror(f) ? -errno : -ENODATA;
> +               goto out;
> +       }
> +       *val = strtoul(buf, NULL, 10);
> +out:
> +       pclose(f);
> +       return ret;
> +}
> +
>  int get_hw_ring_size(char *ifname, struct ethtool_ringparam *ring_param)
>  {
>         struct ifreq ifr = {0};
> diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
> index 79a010c88e11..c53cd781df6e 100644
> --- a/tools/testing/selftests/bpf/network_helpers.h
> +++ b/tools/testing/selftests/bpf/network_helpers.h
> @@ -84,6 +84,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
>                   struct sockaddr_storage *addr, socklen_t *len);
>  char *ping_command(int family);
>  int get_socket_local_port(int sock_fd);
> +int read_tcpext_snmp(const char *name, unsigned long *val);
>  int get_hw_ring_size(char *ifname, struct ethtool_ringparam *ring_param);
>  int set_hw_ring_size(char *ifname, struct ethtool_ringparam *ring_param);
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
> index eaf441dc7e79..6adfb4b892f8 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
> @@ -91,12 +91,21 @@ static void transfer_message(int sender, int receiver)
>
>  static void create_connection(struct test_tcp_custom_syncookie_case *test_case)
>  {
> +       unsigned long recv_before, recv_after;
> +       unsigned long failed_before, failed_after;

While at it, please keep reverse xmas tree order


>         int server, client, child;
>
>         server = start_server(test_case->family, test_case->type, test_case->addr, 0, 0);
>         if (!ASSERT_NEQ(server, -1, "start_server"))
>                 return;
>
> +       if (!ASSERT_OK(read_tcpext_snmp("SyncookiesRecv", &recv_before),
> +                      "read SyncookiesRecv before"))
> +               goto close_server;
> +       if (!ASSERT_OK(read_tcpext_snmp("SyncookiesFailed", &failed_before),
> +                      "read SyncookiesFailed before"))
> +               goto close_server;
> +
>         client = connect_to_fd(server, 0);
>         if (!ASSERT_NEQ(client, -1, "connect_to_fd"))
>                 goto close_server;
> @@ -105,9 +114,20 @@ static void create_connection(struct test_tcp_custom_syncookie_case *test_case)
>         if (!ASSERT_NEQ(child, -1, "accept"))
>                 goto close_client;
>
> +       if (!ASSERT_OK(read_tcpext_snmp("SyncookiesRecv", &recv_after),
> +                      "read SyncookiesRecv after"))
> +               goto close_child;
> +       if (!ASSERT_OK(read_tcpext_snmp("SyncookiesFailed", &failed_after),
> +                      "read SyncookiesFailed after"))
> +               goto close_child;
> +
> +       ASSERT_EQ(recv_after - recv_before, 1, "SyncookiesRecv delta");
> +       ASSERT_EQ(failed_after - failed_before, 0, "SyncookiesFailed delta");
> +
>         transfer_message(client, child);
>         transfer_message(child, client);
>
> +close_child:
>         close(child);
>  close_client:
>         close(client);
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] net: add missing syncookie statistics for BPF custom syncookies
From: Kuniyuki Iwashima @ 2026-04-14  5:38 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: netdev, Eric Dumazet, Neal Cardwell, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Andrii Nakryiko, Eduard Zingerman, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, linux-kernel, bpf, linux-kselftest
In-Reply-To: <20260411013211.225834-1-jiayuan.chen@linux.dev>

On Fri, Apr 10, 2026 at 6:32 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> 1. Replace IS_ENABLED(CONFIG_BPF) with CONFIG_BPF_SYSCALL for
>    cookie_bpf_ok() and cookie_bpf_check(). CONFIG_BPF is selected by
>    CONFIG_NET unconditionally, so IS_ENABLED(CONFIG_BPF) is always
>    true and provides no real guard. CONFIG_BPF_SYSCALL is the correct
>    config for BPF program functionality.
>
> 2. Remove the CONFIG_BPF_SYSCALL guard around struct bpf_tcp_req_attrs.
>    This struct is referenced by bpf_sk_assign_tcp_reqsk() in
>    net/core/filter.c which is compiled unconditionally, so wrapping
>    the definition in a config guard could cause build failures when
>    CONFIG_BPF_SYSCALL=n.
>
> 3. Fix mismatched declaration of cookie_bpf_check() between the
>    CONFIG_BPF_SYSCALL and stub paths: the real definition takes
>    'struct net *net' but the declaration in the header did not.
>    Add the net parameter to the declaration and all call sites.
>
> 4. Add missing LINUX_MIB_SYNCOOKIESRECV and LINUX_MIB_SYNCOOKIESFAILED
>    statistics in cookie_bpf_check(), so that BPF custom syncookie
>    validation is accounted for in SNMP counters just like the
>    non-BPF path.
>
> Compile-tested with CONFIG_BPF_SYSCALL=y and CONFIG_BPF_SYSCALL
> not set.
>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH net v2 1/1] af_unix: Reject SIOCATMARK on non-stream sockets
From: Kuniyuki Iwashima @ 2026-04-14  5:33 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, davem, edumazet, kuba, pabeni, horms, rao.shoaib,
	yifanwucs, tomapufckgml, yuantan098, bird, enjou1224z,
	wangjiexun2025
In-Reply-To: <20260413122916.1479959-1-n05ec@lzu.edu.cn>

On Mon, Apr 13, 2026 at 5:29 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> From: Jiexun Wang <wangjiexun2025@gmail.com>
>
> SIOCATMARK reports whether the receive queue is at the urgent mark for
> MSG_OOB.
>
> In AF_UNIX, MSG_OOB is supported only for SOCK_STREAM sockets.
> SOCK_DGRAM and SOCK_SEQPACKET reject MSG_OOB in sendmsg() and recvmsg(),
> so they should not support SIOCATMARK either.
>
> Return -EOPNOTSUPP for non-stream sockets before checking the receive
> queue.
>
> Fixes: 314001f0bf92 ("af_unix: Add OOB support")
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
> Suggested-by: Xin Liu <bird@lzu.edu.cn>

Please read this guideline again.
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by

Co-developed-by is not where you mention someone who
developed a tool to find a bug, and Suggested-by is not where
you mention someone who funds your research.
https://lore.kernel.org/netdev/7c26a74d-90c5-4520-a10a-22f06e098b86@gmail.com/

When you just copy my fix and modify the commit message,
the two tags are inappropriate.


> Tested-by: Ren Wei <enjou1224z@gmail.com>
> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> ---
> Changes in v2:
> - Rework the fix based on maintainer feedback.
> - Drop the receive-queue locking approach and reject SIOCATMARK on
>   non-stream sockets instead, since it is only meaningful for MSG_OOB.
> - V1 link: https://lore.kernel.org/netdev/f6cbbc8da90e95584847b5ceb60aae830d1631c2.1775731983.git.wangjiexun2025@gmail.com/
>
>  net/unix/af_unix.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index b23c33df8b46..09d43b4813b1 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -3300,6 +3300,9 @@ static int unix_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
>                         struct sk_buff *skb;
>                         int answ = 0;
>
> +                       if (sk->sk_type != SOCK_STREAM)
> +                               return -EOPNOTSUPP;
> +
>                         mutex_lock(&u->iolock);
>
>                         skb = skb_peek(&sk->sk_receive_queue);
> --
> 2.34.1
>

^ permalink raw reply

* [PATCH v4] nfc: hci: fix out-of-bounds read in HCP header parsing
From: Ashutosh Desai @ 2026-04-14  5:24 UTC (permalink / raw)
  To: netdev; +Cc: kuba, edumazet, davem, pabeni, horms, linux-kernel

nfc_hci_recv_from_llc() and nci_hci_data_received_cb() cast skb->data
to struct hcp_packet and read the message header byte without checking
that enough data is present in the linear sk_buff area. A malicious NFC
peer can send a 1-byte HCP frame that passes through the SHDLC layer
and reaches these functions, causing an out-of-bounds heap read.

Fix this by adding pskb_may_pull() before each cast to ensure the full
2-byte HCP header is pulled into the linear area before it is accessed.

Fixes: 8b8d2e08bf0d ("NFC: HCI support")
Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
---
V3 -> V4: add Fixes tags
V2 -> V3: drop redundant checks from nfc_hci_msg_rx_work/nci_hci_msg_rx_work,
          remove incorrect Suggested-by tag
V1 -> V2: switch skb->len check to pskb_may_pull

v3: https://lore.kernel.org/netdev/20260413024329.3293075-1-ashutoshdesai993@gmail.com/
v2: https://lore.kernel.org/netdev/20260409150825.2217133-1-ashutoshdesai993@gmail.com/
v1: https://lore.kernel.org/netdev/20260408223113.2009304-1-ashutoshdesai993@gmail.com/

 net/nfc/hci/core.c | 5 +++++
 net/nfc/nci/hci.c  | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/net/nfc/hci/core.c b/net/nfc/hci/core.c
index 0d33c81a15fe..cd9cf6c94a50 100644
--- a/net/nfc/hci/core.c
+++ b/net/nfc/hci/core.c
@@ -904,6 +904,11 @@ static void nfc_hci_recv_from_llc(struct nfc_hci_dev *hdev, struct sk_buff *skb)
          * unblock waiting cmd context. Otherwise, enqueue to dispatch
          * in separate context where handler can also execute command.
          */
+if (!pskb_may_pull(hcp_skb, NFC_HCI_HCP_HEADER_LEN)) {
+kfree_skb(hcp_skb);
+return;
+}
+
 packet = (struct hcp_packet *)hcp_skb->data;
 type = HCP_MSG_GET_TYPE(packet->message.header);
 if (type == NFC_HCI_HCP_RESPONSE) {
diff --git a/net/nfc/nci/hci.c b/net/nfc/nci/hci.c
index 40ae8e5a7ec7..6e633da257d1 100644
--- a/net/nfc/nci/hci.c
+++ b/net/nfc/nci/hci.c
@@ -482,6 +482,11 @@ void nci_hci_data_received_cb(void *context,
          * unblock waiting cmd context. Otherwise, enqueue to dispatch
          * in separate context where handler can also execute command.
          */
+if (!pskb_may_pull(hcp_skb, NCI_HCI_HCP_HEADER_LEN)) {
+kfree_skb(hcp_skb);
+return;
+}
+
 packet = (struct nci_hcp_packet *)hcp_skb->data;
 type = NCI_HCP_MSG_GET_TYPE(packet->message.header);
 if (type == NCI_HCI_HCP_RESPONSE) {
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH net] net: usb: cdc_ncm: reject negative chained NDP offsets
From: Greg Kroah-Hartman @ 2026-04-14  4:23 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: Oliver Neukum, linux-usb, netdev, linux-kernel, Oliver Neukum,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, stable
In-Reply-To: <87wlyavnl3.fsf@miraculix.mork.no>

On Mon, Apr 13, 2026 at 06:20:40PM +0200, Bjørn Mork wrote:
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
> > On Mon, Apr 13, 2026 at 02:11:50PM +0200, Oliver Neukum wrote:
> >> On 13.04.26 12:43, Greg Kroah-Hartman wrote:
> >> > On Mon, Apr 13, 2026 at 10:36:19AM +0200, Oliver Neukum wrote:
> >> > > 
> >> > > 
> >> > > On 11.04.26 12:53, Greg Kroah-Hartman wrote:
> >> > > > cdc_ncm_rx_fixup() reads dwNextNdpIndex from each NDP32 to chain to the
> >> > > > next one.  The 32-bit value from the device is stored into the signed
> >> > > > int ndpoffset so that means values with the high bit set become
> >> > > 
> >> > > Well, then isn't the problem rather that you should not store an
> >> > > unsigned value in a signed variable?
> >> > 
> >> > No.  well, yes.  but no.
> >> > 
> >> > cdc_ncm_rx_verify_nth16() returns an int, and is negative if something
> >> > went wrong, so we need it that way, and then we need to check it, like
> >> > we properly do at the top of the loop, it's just that at the bottom of
> >> > the loop we also need to do the same exact thing.
> >> 
> >> Doesn't that suggest that cdc_ncm_rx_verify_nth16() is the problem?
> >> To be precise, the way it indicates errors?
> >> As this is an offset into a buffer and the header must be at the start
> >> of the buffer, isn't 0 the natural indication of an error?
> >
> > Maybe?  I really don't know, sorry, parsing the cdc_ncm buffer is not
> > something I looked too deeply into :)
> 
> Oliver is correct AFAICS. These functions could use 0 to indicate
> errors.  This would make the code simpler and cleaner.
> 
> The negative error return is just a sloppy choice I made at a time we
> only supported the 16bit versions.  Didn't anticipate 32bit support
> since it is optional and pointless.  But as usual, hardware vendors do
> surprising things.
> 
> Note that cdc_mbim.c must be updated if cdc_ncm_rx_verify_nth16() is
> changed.

Ok thanks for the background, I'll rework this after the merge window is
over.

greg k-h

^ permalink raw reply

* [PATCH 5.10.y] Revert "wifi: cfg80211: stop NAN and P2P in cfg80211_leave"
From: guocai.he.cn @ 2026-04-14  4:03 UTC (permalink / raw)
  To: gregkh
  Cc: stable, johannes.berg, netdev, regressions,
	miriam.rachel.korenblit, linux-kernel

From: Guocai He <guocai.he.cn@windriver.com>

This reverts commit d91240f24e831d3bd36954599ada6b456fb1bd0a which is commit
e1696c8bd0056bc1a5f7766f58ac333adc203e8a upstream.

The reverted patch introduced a deadlock. The locking situation in mainline is
totally different, so it is incorrect to directly backport the commit from mainline.

Signed-off-by: Guocai He <guocai.he.cn@windriver.com>
---
 net/wireless/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/wireless/core.c b/net/wireless/core.c
index cc2093f75468..3b25b78896a2 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -1207,10 +1207,8 @@ void __cfg80211_leave(struct cfg80211_registered_device *rdev,
 		/* must be handled by mac80211/driver, has no APIs */
 		break;
 	case NL80211_IFTYPE_P2P_DEVICE:
-		cfg80211_stop_p2p_device(rdev, wdev);
-		break;
 	case NL80211_IFTYPE_NAN:
-		cfg80211_stop_nan(rdev, wdev);
+		/* cannot happen, has no netdev */
 		break;
 	case NL80211_IFTYPE_AP_VLAN:
 	case NL80211_IFTYPE_MONITOR:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2] net: reduce RFS/ARFS flow updates by checking LLC affinity
From: Chuang Wang @ 2026-04-14  3:59 UTC (permalink / raw)
  Cc: Chuang Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, Krishna Kumar, Neal Cardwell,
	Willem de Bruijn, netdev, linux-kernel

The current implementation of rps_record_sock_flow() updates the flow
table every time a socket is processed on a different CPU. In high-load
scenarios, especially with Accelerated RFS (ARFS), this triggers
frequent flow steering updates via ndo_rx_flow_steer.

For drivers like mlx5 that implement hardware flow steering, these
constant updates lead to significant contention on internal driver locks
(e.g., arfs_lock). This contention often becomes a performance
bottleneck that outweighs the steering benefits.

This patch introduces a cache-aware update strategy: the flow record is
only updated if the flow migrates across Last Level Cache (LLC)
boundaries. This minimizes expensive hardware reconfigurations while
preserving cache locality for the application. A new sysctl,
net.core.rps_feat_llc_affinity, is added to toggle this feature.

Performance Test Results:
The patch was tested in a K8s environment (AMD CPU 128*2, 16-core Pod
with CPU pinning, mlx5 NIC) using brpc[1] echo_server and rpc_press.

rpc_press Commands:

  for i in {1..8}; do
    ./rpc_press -proto=./echo.proto -method=example.EchoService.Echo
    -server=<IP>:8000 -input='{"message":"hello"}'
    -qps=0 -thread_num=512 -connection_type=pooled &
  done

Monitor mlx5e_rx_flow_steer frequency:

  /usr/share/bcc/tools/funccount -i 1 mlx5e_rx_flow_steer

Frequency of mlx5e_rx_flow_steer (via funccount[2]):

  Before: ~335,000 counts/sec
  After:   ~23,000 counts/sec (reduced by ~93%)

System Metrics (after enabling rps_feat_llc_affinity):

  CPU Utilization: 38% -> 32%
  CPU PSI (Pressure Stall Information): 20% -> 10%

These results demonstrate that filtering updates by LLC affinity
significantly reduces driver lock contention and improves overall
CPU efficiency under heavy network load.

[1] https://github.com/apache/brpc/
[2] https://github.com/iovisor/bcc/blob/master/tools/funccount.py

Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
---
v1 -> v2: add rps_feat_llc_affinity; add brpc tests

 include/net/rps.h          | 18 ++--------
 net/core/dev.c             | 72 ++++++++++++++++++++++++++++++++++++++
 net/core/sysctl_net_core.c | 34 ++++++++++++++++++
 3 files changed, 108 insertions(+), 16 deletions(-)

diff --git a/include/net/rps.h b/include/net/rps.h
index e33c6a2fa8bb..37bbb7009c36 100644
--- a/include/net/rps.h
+++ b/include/net/rps.h
@@ -12,6 +12,7 @@
 
 extern struct static_key_false rps_needed;
 extern struct static_key_false rfs_needed;
+extern struct static_key_false rps_feat_llc_affinity;
 
 /*
  * This structure holds an RPS map which can be of variable length.  The
@@ -55,22 +56,7 @@ struct rps_sock_flow_table {
 
 #define RPS_NO_CPU 0xffff
 
-static inline void rps_record_sock_flow(rps_tag_ptr tag_ptr, u32 hash)
-{
-	unsigned int index = hash & rps_tag_to_mask(tag_ptr);
-	u32 val = hash & ~net_hotdata.rps_cpu_mask;
-	struct rps_sock_flow_table *table;
-
-	/* We only give a hint, preemption can change CPU under us */
-	val |= raw_smp_processor_id();
-
-	table = rps_tag_to_table(tag_ptr);
-	/* The following WRITE_ONCE() is paired with the READ_ONCE()
-	 * here, and another one in get_rps_cpu().
-	 */
-	if (READ_ONCE(table[index].ent) != val)
-		WRITE_ONCE(table[index].ent, val);
-}
+void rps_record_sock_flow(rps_tag_ptr tag_ptr, u32 hash);
 
 static inline void _sock_rps_record_flow_hash(__u32 hash)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 203dc36aaed5..630a7f21d8de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4964,6 +4964,8 @@ struct static_key_false rps_needed __read_mostly;
 EXPORT_SYMBOL(rps_needed);
 struct static_key_false rfs_needed __read_mostly;
 EXPORT_SYMBOL(rfs_needed);
+struct static_key_false rps_feat_llc_affinity __read_mostly;
+EXPORT_SYMBOL(rps_feat_llc_affinity);
 
 static u32 rfs_slot(u32 hash, rps_tag_ptr tag_ptr)
 {
@@ -5175,6 +5177,76 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	return cpu;
 }
 
+/**
+ * rps_record_cond - Determine if RPS flow table should be updated
+ * @old_val: Previous flow record value
+ * @new_val: Target flow record value
+ *
+ * Returns true if the record needs an update.
+ */
+static inline bool rps_record_cond(u32 old_val, u32 new_val)
+{
+	u32 old_cpu = old_val & ~net_hotdata.rps_cpu_mask;
+	u32 new_cpu = new_val & ~net_hotdata.rps_cpu_mask;
+
+	if (old_val == new_val)
+		return false;
+
+	/*
+	 * RPS LLC Affinity Feature:
+	 * Reduce RFS/ARFS flow updates by checking LLC affinity.
+	 *
+	 * Frequent flow table updates can trigger constant hardware steering
+	 * reconfigurations (e.g., ndo_rx_flow_steer), leading to significant
+	 * contention on driver internal locks (like mlx5's arfs_lock).
+	 *
+	 * This strategy only updates the flow record if it migrates across LLC
+	 * boundaries. This minimizes expensive hardware updates while preserving
+	 * cache locality for the application.
+	 */
+	if (static_branch_unlikely(&rps_feat_llc_affinity)) {
+		/* Force update if the recorded CPU is invalid or has gone offline */
+		if (old_cpu >= nr_cpu_ids || !cpu_active(old_cpu))
+			return true;
+
+		/*
+		 * Force an update if the current task is no longer permitted
+		 * to run on the old_cpu.
+		 */
+		if (!cpumask_test_cpu(old_cpu, current->cpus_ptr))
+			return true;
+
+		/*
+		 * If CPUs do not share a cache, allow the update to prevent
+		 * expensive remote memory accesses and cache misses.
+		 */
+		if (!cpus_share_cache(old_cpu, new_cpu))
+			return true;
+
+		return false;
+	}
+
+	return true;
+}
+
+void rps_record_sock_flow(rps_tag_ptr tag_ptr, u32 hash)
+{
+	unsigned int index = hash & rps_tag_to_mask(tag_ptr);
+	u32 val = hash & ~net_hotdata.rps_cpu_mask;
+	struct rps_sock_flow_table *table;
+
+	/* We only give a hint, preemption can change CPU under us */
+	val |= raw_smp_processor_id();
+
+	table = rps_tag_to_table(tag_ptr);
+	/* The following WRITE_ONCE() is paired with the READ_ONCE()
+	 * here, and another one in get_rps_cpu().
+	 */
+	if (rps_record_cond(READ_ONCE(table[index].ent), val))
+		WRITE_ONCE(table[index].ent, val);
+}
+EXPORT_SYMBOL(rps_record_sock_flow);
+
 #ifdef CONFIG_RFS_ACCEL
 
 /**
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 502705e04649..dbc99aea7bb0 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -210,6 +210,32 @@ static int rps_sock_flow_sysctl(const struct ctl_table *table, int write,
 	kvfree_rcu_mightsleep(tofree);
 	return ret;
 }
+
+static int rps_feat_llc_affinity_sysctl(const struct ctl_table *table, int write,
+					void *buffer, size_t *lenp, loff_t *ppos)
+{
+	u8 curr_state;
+	int ret;
+	const struct ctl_table tmp = {
+		.data = &curr_state,
+		.maxlen = sizeof(curr_state),
+		.mode = table->mode,
+		.extra1 = table->extra1,
+		.extra2 = table->extra2
+	};
+
+	curr_state = static_branch_unlikely(&rps_feat_llc_affinity) ? 1 : 0;
+
+	ret = proc_dou8vec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (write && ret == 0) {
+		if (curr_state && !static_branch_unlikely(&rps_feat_llc_affinity))
+			static_branch_enable(&rps_feat_llc_affinity);
+		else if (!curr_state && static_branch_unlikely(&rps_feat_llc_affinity))
+			static_branch_disable(&rps_feat_llc_affinity);
+	}
+
+	return ret;
+}
 #endif /* CONFIG_RPS */
 
 #ifdef CONFIG_NET_FLOW_LIMIT
@@ -531,6 +557,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= rps_sock_flow_sysctl
 	},
+	{
+		.procname	= "rps_feat_llc_affinity",
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler   = rps_feat_llc_affinity_sysctl,
+		.extra1     = SYSCTL_ZERO,
+		.extra2     = SYSCTL_ONE
+	},
 #endif
 #ifdef CONFIG_NET_FLOW_LIMIT
 	{
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v11 net-next 5/7] octeontx2-af: npc: cn20k: add subbank search order control
From: Ratheesh Kannoth @ 2026-04-14  3:46 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, linux-kernel, linux-rdma, sgoutham, andrew+netdev, davem,
	edumazet, kuba, donald.hunter, horms, jiri, chuck.lever, matttbe,
	cjubran, saeedm, leon, tariqt, mbloch, dtatulea
In-Reply-To: <b9ffa72d-ebe2-4fd1-b668-93620f206179@redhat.com>

On 2026-04-13 at 18:26:00, Paolo Abeni (pabeni@redhat.com) wrote:
> > +	xa_for_each(&npc_priv.xa_sb_free, index, v) {
> > +		val = xa_to_value(v);
> > +		fslots[fcnt][0] = index;
> > +		fslots[fcnt][1] = val;
> > +		xa_erase(&npc_priv.xa_sb_free, index);
> > +		fcnt++;
> > +	}
> > +
> > +	/* xa_store() is done under lock. If xa_store fails
> > +	 * ,no rollback is planned as it might also fail.
>
> Why do you need to go throuh erase and add loop? Why can't you directly
> xa_store() the new value? Note that xa_store() can fail due to memory
> pressure.
>
> Avoiding the previous erase will prevent deallocation and re allocation
> and will avoid any reasonable xa_store() failure.
ACK.

>
> AFAICS there are a few more items reported by sashiko, please have a look:
>
> https://sashiko.dev/#/patchset/20260409025055.1664053-1-rkannoth%40marvell.com
>
> /P
>

Patch 1: [PATCH v11 net-next 1/7] octeontx2-af: npc: cn20k: debugfs enhancements

>"+static u64 dstats[MAX_NUM_BANKS][MAX_SUBBANK_DEPTH * MAX_NUM_SUB_BANKS] = {};
>Is it safe to use a static global array here? If multiple RVU AF devices exist
>in the system, it looks like they might share this array and overwrite each
>other's hardware statistics."
There is only one single rvu AF device in the system. Ignore the comment.
as it is false positive.

>"Also, since there are no locks, could concurrent reads of the debugfs file by
>multiple threads cause a data race and corrupt the stored statistics?"
ACK. added a mutex lock.

>"Since en_map is added to the global npc_priv structure, is it ever explicitly
>cleared during device initialization?"
npc_priv is defined as static. So all fields are set to 0, by default.
Anyway, added bitmap clear to get rid of this AI review warning.

>"Does updating dstats inline during the seq_file show callback break the
>statistics if the buffer overflows?"
ACK. Used seq open with size() API to set required size of buffer at the start.
Added a check to see if there is a buffer over flow by counting buffers wrote into
seq_file. dstats are "delta stats", and used for debugging.

>"Does updating dstats inline during the seq_file show callback break the
>statistics if the buffer overflows?Does updating dstats inline during the seq_file show callback break the
>statistics if the buffer overflows?"
ACK.

>"When seq_read detects the overflow, it allocates a larger buffer and calls
>this show callback again from the beginning. During the retry,
>stats == dstats[bank][idx] will evaluate to true because the array was
>already updated, which skips all the previously overflowing entries."
ACK.

>"This isn't a bug, but using the ternary operator for its void side-effects
>is generally avoided in the kernel. Could this be written using an if/else
>block, or the assign_bit() macro instead?"
ACK.

>"Since en_map is added to the global npc_priv structure, is it ever explicitly
>cleared during device initialization?
>If the PCI device is unbound and rebound via sysfs, could en_map retain bits
>from the previous lifecycle and cause the new mismatch debugfs file to falsely
>report old enabled rules as mismatched?"
ACK. Reset bitmap to zero on deinit.

Patch 2: [PATCH v11 net-next 2/7] net/mlx5e: heap-allocate devlink param values

>Could we avoid the dynamic allocation entirely? Instead of storing all four
>values at once, could a single union devlink_param_value be declared inside
>the loop, populated, and immediately assigned to the respective config field?
>This would reduce the stack usage to a single union, avoiding both the frame
>size warning and the new memory allocation failure path."
ACK.

Patch 3: [PATCH v11 net-next 3/7] devlink: Change function syntax.

>"The commit message notes that exported APIs will be updated to pass the
>unions by pointer. Is this incomplete? Exported APIs like
>devl_param_driverinit_value_set and the validate callback in struct
>devlink_param still appear to take the union by value, which might continue
>to push the large structure onto the stack during validation."
ACK.

>"This isn't a bug, but it looks like an automated search and replace
>accidentally modified the word "values" in this comment."
ACK.

Patch 5: [PATCH v11 net-next 5/7] octeontx2-af: npc: cn20k: add subbank search order control

>Does this code overflow uslots[]?
>Looking at npc_priv_init(), num_subbanks is populated directly from the
>hardware register NPC_AF_CONST2. While it ensures the value is a power of 2,
>there doesn't seem to be a check that restricts it to MAX_NUM_SUB_BANKS.
>If the hardware reports a value larger than 32, ucnt could exceed the
>bounds of the stack-allocated uslots array.
ACK.

>Can this approach cause a permanent leak of the erased subbanks?
>If any xa_store() call fails, the function jumps to fail_used and returns
>without rolling back the state. Since the subbanks were already erased from
>the xarray into the local stack arrays, failing to re-insert them leaves the
>hardware allocator state partially empty and corrupts it for the lifetime of
>the system.
>Would it be safer to pre-allocate with xa_reserve(), or build new xarrays and
>swap them in on success?
Used xa_store without xa_erase. If xa_store fails due to memory pressure, error is
unrecoverable as this error can happen in between (for loop).

[PATCH v11 net-next 6/7] octeontx2-af: npc: cn20k: dynamically allocate and free default MCAM entries

>"The NIX_LF_DONT_FREE_DFT_IDXS flag is read directly from the mailbox message
>structure. Since mailbox messages can be initiated by Virtual Functions, what
>prevents an unprivileged VF from repeatedly allocating and freeing LFs with
>this flag set to exhaust the hardware MCAM entries?"
rvu_mbox_handler_nix_lf_alloc() ---> npc_cn20k_dft_rules_alloc(). npc_cn20k_dft_rules_alloc() will alloc default entries if and only if it is not already there. So if
VF calls alloc and free LF without setting the flag, the default entries wont be allocated or freed.

>Are the values in ptr[] virtual or physical MCAM indices? It appears that
>npc_cn20k_dft_rules_idx_get() retrieves virtual indices, but they are used
>here as direct indices into mcam->entry2pfvf_map and passed to
>npc_mcam_clear_bit(). Since those structures are sized for physical indices,
>could this cause an out-of-bounds memory corruption or an integer underflow?
default entries are allways allocated by setting "ref_entry" field in struct npc_mcam_alloc_entry_req. Then, rvu_mbox_handler_npc_mcam_alloc_entry() wont return a virtual
mcam index.

>If xa_erase() fails above and returns NULL, ptr[i] is not cleared and the
>code falls through to the free_rules label. Will this result in
>unconditionally calling npc_cn20k_idx_free() on the stale index, potentially
>causing a double-free?
ACK.

>Furthermore, if a VF manually frees its default MCAM rules via the
>NPC_MCAM_FREE_ENTRY mailbox command before this NIX LF teardown occurs,
>npc_cn20k_idx_free() will be called during that manual free. Since the manual
>free does not remove the index from xa_pf2dfl_rmap, could this teardown path
>fetch the same index and attempt to free it again?
default mcam rules are allocated in rvu_mbox_handler_nix_lf_alloc(). Not thru
NPC_MCAM_FREE_ENTRY. if it does intentionally, then it is a violation. we have dev_err() there, and need to debug at User point.

>Does the caller of this function properly handle negative error codes?
>For example, in npc_enadis_default_mce_entry() and
>npc_enadis_default_entries(), the returned index is passed directly to
>npc_enable_mcam_entry() and nix_update_mce_list() without checking for a
>negative value. This could lead to a WARN(1) in npc_enable_mcam_entry() or an
>out-of-bounds write in nix_update_mce_list().
We intentionally did the change to find out the flow which pass wrong mcam index.
So we need a splat using WARN(1)

>Here, index is a physical index from the bitmap iteration, but the values
>returned into dft_idxs[] by npc_cn20k_dft_rules_idx_get() are virtual
>indices. Will this comparison always fail, causing default rules to be
>erroneously physically freed?
No. default indexes are not virtual. This is ensured during allocation itself.

>Additionally, if the NIX LF is freed with NIX_LF_DONT_FREE_DFT_IDXS to
>preserve default rules, the ownership mapping is cleared here.
ACK.

>Upon
>re-allocation, npc_cn20k_dft_rules_alloc() will detect the rules in
>xa_pf2dfl_rmap but won't restore the ownership in entry2pfvf_map, meaning
>subsequent operations on these rules will fail verification.
ACK.

>Does this make the firmware layout dependent on the internal size of
>ikpu_action_entries?
Yes.
>If future kernel versions add new packet kinds and increase the size of
>this array, older firmware files will fail this bounds check and be rejected.
struct npc_kpu_profile_fwdata does not have a field to indicate the size of ikpu_action_entries.
We can't modify the structure as it would break backward compatability on old fw.

>ill this trigger a compiler warning or build failure on strict builds?
>The min() macro performs strict type checking, and fw_kpu->entries appears
>to be a signed int, while rvu->hw->npc_kpu_entries is an unsigned u16.
ACK.

>Could a negative value in fw_kpu->entries cause an integer underflow here?
>If fw_kpu->entries is read from untrusted firmware as a negative value, the
>offset calculation can underflow the size_t offset variable.
>This would bypass the subsequent bounds check because the wrapped offset
>plus hdr_sz wraps again to a small positive value.
>On the next iteration, calculating fw_kpu = fw->data + offset could result
>in an out-of-bounds memory read.
Addded check to return on invalid value.

>Does modifying profile->kpu here corrupt the global default profile state?
>Earlier in the flow, profile->kpu is initialized to point to the global
>static array npc_kpu_profiles. Allocating device-managed memory into
>profile->kpu[kpu].cam2 overwrites this global state with device-specific
>pointers.
>When the device is unbound and the memory is freed, could this leave dangling
>pointers in the global array for other RVU devices in the system? The same
>applies to the legacy firmware parsing path where cam[entry] is overwritten.

>We are not using profile->kpu after unbind and memory is freed. During reinit, these
>fields are intialized again. So there is no issue with it.
>
>Could this printk formatter read past the end of the profile name?
>The name array in the firmware header is 32 bytes. If a user provides a
>firmware file with exactly 32 non-null characters, the string will lack a
>null terminator.
>Printing this with %s can leak adjacent heap memory contents into the kernel
>log. Using %.32s would ensure the read stays within bounds.
ACK.

>Do these fields require an endianness conversion before use?
>The 16-bit values like dp0, dp1, and dp2 are read directly from the firmware
>blob.
>If the firmware payload uses little-endian byte order, applying these
>directly to hardware registers could result in misprogramming on big-endian
>architectures. Would it be safer to use le16_to_cpu() here?
s/w is validated only for little endian as HW is little endian. if big endian required,
we will provide seperate firmware for the same.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox