[PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64
@ 2025-12-17 23:35 Puranjay Mohan
  2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-17 23:35 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
	Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
Changes in V1->V2:
- Patch 2:
	- Put preempt_enable()/disable() around RMW accesses to mitigate
	  race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
	  bpf programs, preemption can cause no prog to execute.

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64.

It was discovered in [1] that per-CPU atomics that don't return a value
were extremely slow on some arm64 platforms, Catalin added a fix in
commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
per-CPU atomic operations") to solve this issue, but it seems to have
caused a regression on the fentry benchmark.

Using the fentry benchmark from the bpf selftests shows the following:

  ./tools/testing/selftests/bpf/bench trig-fentry

 +---------------------------------------------+------------------------+
 |               Configuration                 | Total Operations (M/s) |
 +---------------------------------------------+------------------------+
 | bpf-next/master with Catalin’s fix reverted |         51.862         |
 |---------------------------------------------|------------------------|
 | bpf-next/master                             |         43.067         |
 | bpf-next/master with this change            |         53.856         |
 +---------------------------------------------+------------------------+

All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.

This patch yields a 25% improvement in this benchmark compared to
bpf-next. Notably, reverting Catalin's fix also results in a performance
gain for this benchmark, which is interesting but expected.

For completeness, this benchmark was also run with the change enabled on
x86-64, which resulted in a 30% regression in the fentry benchmark. So,
it is only enabled on arm64.

[1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/

Puranjay Mohan (2):
  bpf: move recursion detection logic to helpers
  bpf: arm64: Optimize recursion detection by not using atomics

 include/linux/bpf.h      | 39 ++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/core.c        |  3 ++-
 kernel/bpf/trampoline.c  |  8 ++++----
 kernel/trace/bpf_trace.c |  4 ++--
 4 files changed, 46 insertions(+), 8 deletions(-)

base-commit: ec439c38013550420aecc15988ae6acb670838c1
-- 
2.47.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers
  2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
@ 2025-12-17 23:35 ` Puranjay Mohan
  2025-12-18 17:44   ` Yonghong Song
  2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
  2025-12-18  2:52 ` [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
  2 siblings, 1 reply; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-17 23:35 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
	Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.

This change makes no functional changes.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h      | 10 ++++++++++
 kernel/bpf/trampoline.c  |  8 ++++----
 kernel/trace/bpf_trace.c |  4 ++--
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bb3847caeae1..2da986136d26 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2004,6 +2004,16 @@ struct bpf_struct_ops_common_value {
 	enum bpf_struct_ops_state state;
 };
 
+static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
+{
+	return this_cpu_inc_return(*(prog->active)) == 1;
+}
+
+static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
+{
+	this_cpu_dec(*(prog->active));
+}
+
 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
 /* This macro helps developer to register a struct_ops type and generate
  * type information correctly. Developers should use this macro to register
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 976d89011b15..2a125d063e62 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -949,7 +949,7 @@ static u64 notrace __bpf_prog_enter_recur(struct bpf_prog *prog, struct bpf_tram
 
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
-	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+	if (unlikely(!bpf_prog_get_recursion_context(prog))) {
 		bpf_prog_inc_misses_counter(prog);
 		if (prog->aux->recursion_detected)
 			prog->aux->recursion_detected(prog);
@@ -993,7 +993,7 @@ static void notrace __bpf_prog_exit_recur(struct bpf_prog *prog, u64 start,
 	bpf_reset_run_ctx(run_ctx->saved_run_ctx);
 
 	update_prog_stats(prog, start);
-	this_cpu_dec(*(prog->active));
+	bpf_prog_put_recursion_context(prog);
 	rcu_read_unlock_migrate();
 }
 
@@ -1029,7 +1029,7 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog,
 
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
-	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+	if (unlikely(!bpf_prog_get_recursion_context(prog))) {
 		bpf_prog_inc_misses_counter(prog);
 		if (prog->aux->recursion_detected)
 			prog->aux->recursion_detected(prog);
@@ -1044,7 +1044,7 @@ void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start,
 	bpf_reset_run_ctx(run_ctx->saved_run_ctx);
 
 	update_prog_stats(prog, start);
-	this_cpu_dec(*(prog->active));
+	bpf_prog_put_recursion_context(prog);
 	migrate_enable();
 	rcu_read_unlock_trace();
 }
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe28d86f7c35..6e076485bf70 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2063,7 +2063,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
 	struct bpf_trace_run_ctx run_ctx;
 
 	cant_sleep();
-	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+	if (unlikely(!bpf_prog_get_recursion_context(prog))) {
 		bpf_prog_inc_misses_counter(prog);
 		goto out;
 	}
@@ -2077,7 +2077,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
 
 	bpf_reset_run_ctx(old_run_ctx);
 out:
-	this_cpu_dec(*(prog->active));
+	bpf_prog_put_recursion_context(prog);
 }
 
 #define UNPACK(...)			__VA_ARGS__
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
  2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
@ 2025-12-17 23:35 ` Puranjay Mohan
  2025-12-18 17:55   ` Yonghong Song
  2025-12-18  2:52 ` [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
  2 siblings, 1 reply; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-17 23:35 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
	Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element.  After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.

For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++---
 kernel/bpf/core.c   |  3 ++-
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2da986136d26..5ca2a761d9a1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -31,6 +31,7 @@
 #include <linux/static_call.h>
 #include <linux/memcontrol.h>
 #include <linux/cfi.h>
+#include <linux/unaligned.h>
 #include <asm/rqspinlock.h>
 
 struct bpf_verifier_env;
@@ -1746,6 +1747,8 @@ struct bpf_prog_aux {
 	struct bpf_map __rcu *st_ops_assoc;
 };
 
+#define BPF_NR_CONTEXTS        4       /* normal, softirq, hardirq, NMI */
+
 struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
@@ -1772,7 +1775,7 @@ struct bpf_prog {
 		u8 tag[BPF_TAG_SIZE];
 	};
 	struct bpf_prog_stats __percpu *stats;
-	int __percpu		*active;
+	u8 __percpu		*active;	/* u8[BPF_NR_CONTEXTS] for rerecursion protection */
 	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
@@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value {
 
 static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
 {
-	return this_cpu_inc_return(*(prog->active)) == 1;
+#ifdef CONFIG_ARM64
+	u8 rctx = interrupt_context_level();
+	u8 *active = this_cpu_ptr(prog->active);
+	u32 val;
+
+	preempt_disable();
+	active[rctx]++;
+	val = get_unaligned_le32(active);
+	preempt_enable();
+	if (val != BIT(rctx * 8))
+		return false;
+
+	return true;
+#else
+	return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1;
+#endif
 }
 
 static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
 {
-	this_cpu_dec(*(prog->active));
+#ifdef CONFIG_ARM64
+	u8 rctx = interrupt_context_level();
+	u8 *active = this_cpu_ptr(prog->active);
+
+	preempt_disable();
+	active[rctx]--;
+	preempt_enable();
+#else
+	this_cpu_dec(*(int __percpu *)(prog->active));
+#endif
 }
 
 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c66316e32563..b5063acfcf92 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 		vfree(fp);
 		return NULL;
 	}
-	fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
+	fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
+					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
 	if (!fp->active) {
 		vfree(fp);
 		kfree(aux);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64
  2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
  2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
  2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
@ 2025-12-18  2:52 ` Puranjay Mohan
  2 siblings, 0 replies; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-18  2:52 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	kernel-team, Catalin Marinas, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Wed, Dec 17, 2025 at 11:36 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
> Changes in V1->V2:
> - Patch 2:
>         - Put preempt_enable()/disable() around RMW accesses to mitigate
>           race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
>           bpf programs, preemption can cause no prog to execute.
>
> BPF programs detect recursion using a per-CPU 'active' flag in struct
> bpf_prog. The trampoline currently sets/clears this flag with atomic
> operations.
>
> On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
> operations are relatively slow. Unlike x86_64 - where per-CPU updates
> can avoid cross-core atomicity, arm64 LSE atomics are always atomic
> across all cores, which is unnecessary overhead for strictly per-CPU
> state.
>
> This patch removes atomics from the recursion detection path on arm64.
>
> It was discovered in [1] that per-CPU atomics that don't return a value
> were extremely slow on some arm64 platforms, Catalin added a fix in
> commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
> per-CPU atomic operations") to solve this issue, but it seems to have
> caused a regression on the fentry benchmark.
>
> Using the fentry benchmark from the bpf selftests shows the following:
>
>   ./tools/testing/selftests/bpf/bench trig-fentry
>
>  +---------------------------------------------+------------------------+
>  |               Configuration                 | Total Operations (M/s) |
>  +---------------------------------------------+------------------------+
>  | bpf-next/master with Catalin’s fix reverted |         51.862         |
>  |---------------------------------------------|------------------------|
>  | bpf-next/master                             |         43.067         |
>  | bpf-next/master with this change            |         53.856         |
>  +---------------------------------------------+------------------------+
>
> All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.
>


Here is some more data about other attach types:

+-----------------+-----------+-----------+----------+
|     Metric      |  Before   |   After   | % Diff   |
+-----------------+-----------+-----------+----------+
| fentry          |   43.149  |   53.948  | +25.03%  |
| fentry.s        |   41.831  |   50.937  | +21.76%  |
| rawtp           |   50.834  |   58.731  | +15.53%  |
| fexit           |   31.118  |   34.360  | +10.42%  |
| tp              |   39.536  |   41.632  |  +5.30%  |
| syscall-count   |    8.053  |    8.305  |  +3.13%  |
| fmodret         |   33.940  |   34.769  |  +2.44%  |
| kprobe          |    9.970  |    9.998  |  +0.28%  |
| usermode-count  |  224.886  |  224.839  |  -0.02%  |
| kernel-count    |  154.229  |  153.043  |  -0.77%  |
+-----------------+-----------+-----------+----------+

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers
  2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
@ 2025-12-18 17:44   ` Yonghong Song
  0 siblings, 0 replies; 8+ messages in thread
From: Yonghong Song @ 2025-12-18 17:44 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel



On 12/17/25 3:35 PM, Puranjay Mohan wrote:
> BPF programs detect recursion by doing atomic inc/dec on a per-cpu
> active counter from the trampoline. Create two helpers for operations on
> this active counter, this makes it easy to changes the recursion
> detection logic in future.
>
> This change makes no functional changes.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Yonghong Song <yonghong.song@linux.dev>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
@ 2025-12-18 17:55   ` Yonghong Song
  2025-12-19 16:40     ` Puranjay Mohan
  2025-12-19 18:23     ` Puranjay Mohan
  0 siblings, 2 replies; 8+ messages in thread
From: Yonghong Song @ 2025-12-18 17:55 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel



On 12/17/25 3:35 PM, Puranjay Mohan wrote:
> BPF programs detect recursion using a per-CPU 'active' flag in struct
> bpf_prog. The trampoline currently sets/clears this flag with atomic
> operations.
>
> On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
> operations are relatively slow. Unlike x86_64 - where per-CPU updates
> can avoid cross-core atomicity, arm64 LSE atomics are always atomic
> across all cores, which is unnecessary overhead for strictly per-CPU
> state.
>
> This patch removes atomics from the recursion detection path on arm64 by
> changing 'active' to a per-CPU array of four u8 counters, one per
> context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
> non-atomic increment/decrement on its element.  After increment,
> recursion is detected by reading the array as a u32 and verifying that
> only the expected element changed; any change in another element
> indicates inter-context recursion, and a value > 1 in the same element
> indicates same-context recursion.
>
> For example, starting from {0,0,0,0}, a normal-context trigger changes
> the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
> the program, the array becomes {1,0,0,1}. When the NMI context checks
> the u32 against the expected mask for normal (0x00000001), it observes
> 0x01000001 and correctly reports recursion. Same-context recursion is
> detected analogously.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

LGTM with a few nits below.

Acked-by: Yonghong Song <yonghong.song@linux.dev>

> ---
>   include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++---
>   kernel/bpf/core.c   |  3 ++-
>   2 files changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 2da986136d26..5ca2a761d9a1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -31,6 +31,7 @@
>   #include <linux/static_call.h>
>   #include <linux/memcontrol.h>
>   #include <linux/cfi.h>
> +#include <linux/unaligned.h>
>   #include <asm/rqspinlock.h>
>   
>   struct bpf_verifier_env;
> @@ -1746,6 +1747,8 @@ struct bpf_prog_aux {
>   	struct bpf_map __rcu *st_ops_assoc;
>   };
>   
> +#define BPF_NR_CONTEXTS        4       /* normal, softirq, hardirq, NMI */
> +
>   struct bpf_prog {
>   	u16			pages;		/* Number of allocated pages */
>   	u16			jited:1,	/* Is our filter JIT'ed? */
> @@ -1772,7 +1775,7 @@ struct bpf_prog {
>   		u8 tag[BPF_TAG_SIZE];
>   	};
>   	struct bpf_prog_stats __percpu *stats;
> -	int __percpu		*active;
> +	u8 __percpu		*active;	/* u8[BPF_NR_CONTEXTS] for rerecursion protection */
>   	unsigned int		(*bpf_func)(const void *ctx,
>   					    const struct bpf_insn *insn);
>   	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
> @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value {
>   
>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>   {
> -	return this_cpu_inc_return(*(prog->active)) == 1;
> +#ifdef CONFIG_ARM64
> +	u8 rctx = interrupt_context_level();
> +	u8 *active = this_cpu_ptr(prog->active);
> +	u32 val;
> +
> +	preempt_disable();
> +	active[rctx]++;
> +	val = get_unaligned_le32(active);

The 'active' already aligned with 8 (or 4 with my below suggestion).
The get_unaligned_le32() works, but maybe we could use le32_to_cpu()
instead. Maybe there is no performance difference between
get_unaligned_le32() and le32_to_cpu() so you pick get_unaligned_le32()?
It would be good to clarify in commit message if get_unaligned_le32()
is used.

> +	preempt_enable();
> +	if (val != BIT(rctx * 8))
> +		return false;
> +
> +	return true;
> +#else
> +	return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1;
> +#endif
>   }
>   
>   static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
>   {
> -	this_cpu_dec(*(prog->active));
> +#ifdef CONFIG_ARM64
> +	u8 rctx = interrupt_context_level();
> +	u8 *active = this_cpu_ptr(prog->active);
> +
> +	preempt_disable();
> +	active[rctx]--;
> +	preempt_enable();
> +#else
> +	this_cpu_dec(*(int __percpu *)(prog->active));
> +#endif
>   }
>   
>   #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index c66316e32563..b5063acfcf92 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
>   		vfree(fp);
>   		return NULL;
>   	}
> -	fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> +	fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> +					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));

Here, the alignment is 8. Can it be 4 since the above reads a 32bit value?

>   	if (!fp->active) {
>   		vfree(fp);
>   		kfree(aux);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-18 17:55   ` Yonghong Song
@ 2025-12-19 16:40     ` Puranjay Mohan
  2025-12-19 18:23     ` Puranjay Mohan
  1 sibling, 0 replies; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-19 16:40 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	kernel-team, Catalin Marinas, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Thu, Dec 18, 2025 at 5:56 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 12/17/25 3:35 PM, Puranjay Mohan wrote:
> > BPF programs detect recursion using a per-CPU 'active' flag in struct
> > bpf_prog. The trampoline currently sets/clears this flag with atomic
> > operations.
> >
> > On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
> > operations are relatively slow. Unlike x86_64 - where per-CPU updates
> > can avoid cross-core atomicity, arm64 LSE atomics are always atomic
> > across all cores, which is unnecessary overhead for strictly per-CPU
> > state.
> >
> > This patch removes atomics from the recursion detection path on arm64 by
> > changing 'active' to a per-CPU array of four u8 counters, one per
> > context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
> > non-atomic increment/decrement on its element.  After increment,
> > recursion is detected by reading the array as a u32 and verifying that
> > only the expected element changed; any change in another element
> > indicates inter-context recursion, and a value > 1 in the same element
> > indicates same-context recursion.
> >
> > For example, starting from {0,0,0,0}, a normal-context trigger changes
> > the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
> > the program, the array becomes {1,0,0,1}. When the NMI context checks
> > the u32 against the expected mask for normal (0x00000001), it observes
> > 0x01000001 and correctly reports recursion. Same-context recursion is
> > detected analogously.
> >
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>
> LGTM with a few nits below.
>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>
> > ---
> >   include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++---
> >   kernel/bpf/core.c   |  3 ++-
> >   2 files changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 2da986136d26..5ca2a761d9a1 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -31,6 +31,7 @@
> >   #include <linux/static_call.h>
> >   #include <linux/memcontrol.h>
> >   #include <linux/cfi.h>
> > +#include <linux/unaligned.h>
> >   #include <asm/rqspinlock.h>
> >
> >   struct bpf_verifier_env;
> > @@ -1746,6 +1747,8 @@ struct bpf_prog_aux {
> >       struct bpf_map __rcu *st_ops_assoc;
> >   };
> >
> > +#define BPF_NR_CONTEXTS        4       /* normal, softirq, hardirq, NMI */
> > +
> >   struct bpf_prog {
> >       u16                     pages;          /* Number of allocated pages */
> >       u16                     jited:1,        /* Is our filter JIT'ed? */
> > @@ -1772,7 +1775,7 @@ struct bpf_prog {
> >               u8 tag[BPF_TAG_SIZE];
> >       };
> >       struct bpf_prog_stats __percpu *stats;
> > -     int __percpu            *active;
> > +     u8 __percpu             *active;        /* u8[BPF_NR_CONTEXTS] for rerecursion protection */
> >       unsigned int            (*bpf_func)(const void *ctx,
> >                                           const struct bpf_insn *insn);
> >       struct bpf_prog_aux     *aux;           /* Auxiliary fields */
> > @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value {
> >
> >   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >   {
> > -     return this_cpu_inc_return(*(prog->active)) == 1;
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +     u32 val;
> > +
> > +     preempt_disable();
> > +     active[rctx]++;
> > +     val = get_unaligned_le32(active);
>
> The 'active' already aligned with 8 (or 4 with my below suggestion).
> The get_unaligned_le32() works, but maybe we could use le32_to_cpu()
> instead. Maybe there is no performance difference between
> get_unaligned_le32() and le32_to_cpu() so you pick get_unaligned_le32()?
> It would be good to clarify in commit message if get_unaligned_le32()
> is used.
>
> > +     preempt_enable();
> > +     if (val != BIT(rctx * 8))
> > +             return false;
> > +
> > +     return true;
> > +#else
> > +     return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1;
> > +#endif
> >   }
> >
> >   static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
> >   {
> > -     this_cpu_dec(*(prog->active));
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +
> > +     preempt_disable();
> > +     active[rctx]--;
> > +     preempt_enable();
> > +#else
> > +     this_cpu_dec(*(int __percpu *)(prog->active));
> > +#endif
> >   }
> >
> >   #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index c66316e32563..b5063acfcf92 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
> >               vfree(fp);
> >               return NULL;
> >       }
> > -     fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> > +     fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> > +                                     bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
>
> Here, the alignment is 8. Can it be 4 since the above reads a 32bit value?

Yes, It should be 4. Will change in next version and add your acked by.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-18 17:55   ` Yonghong Song
  2025-12-19 16:40     ` Puranjay Mohan
@ 2025-12-19 18:23     ` Puranjay Mohan
  1 sibling, 0 replies; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-19 18:23 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	kernel-team, Catalin Marinas, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Thu, Dec 18, 2025 at 5:56 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 12/17/25 3:35 PM, Puranjay Mohan wrote:
> > BPF programs detect recursion using a per-CPU 'active' flag in struct
> > bpf_prog. The trampoline currently sets/clears this flag with atomic
> > operations.
> >
> > On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
> > operations are relatively slow. Unlike x86_64 - where per-CPU updates
> > can avoid cross-core atomicity, arm64 LSE atomics are always atomic
> > across all cores, which is unnecessary overhead for strictly per-CPU
> > state.
> >
> > This patch removes atomics from the recursion detection path on arm64 by
> > changing 'active' to a per-CPU array of four u8 counters, one per
> > context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
> > non-atomic increment/decrement on its element.  After increment,
> > recursion is detected by reading the array as a u32 and verifying that
> > only the expected element changed; any change in another element
> > indicates inter-context recursion, and a value > 1 in the same element
> > indicates same-context recursion.
> >
> > For example, starting from {0,0,0,0}, a normal-context trigger changes
> > the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
> > the program, the array becomes {1,0,0,1}. When the NMI context checks
> > the u32 against the expected mask for normal (0x00000001), it observes
> > 0x01000001 and correctly reports recursion. Same-context recursion is
> > detected analogously.
> >
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>
> LGTM with a few nits below.
>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>
> > ---
> >   include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++---
> >   kernel/bpf/core.c   |  3 ++-
> >   2 files changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 2da986136d26..5ca2a761d9a1 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -31,6 +31,7 @@
> >   #include <linux/static_call.h>
> >   #include <linux/memcontrol.h>
> >   #include <linux/cfi.h>
> > +#include <linux/unaligned.h>
> >   #include <asm/rqspinlock.h>
> >
> >   struct bpf_verifier_env;
> > @@ -1746,6 +1747,8 @@ struct bpf_prog_aux {
> >       struct bpf_map __rcu *st_ops_assoc;
> >   };
> >
> > +#define BPF_NR_CONTEXTS        4       /* normal, softirq, hardirq, NMI */
> > +
> >   struct bpf_prog {
> >       u16                     pages;          /* Number of allocated pages */
> >       u16                     jited:1,        /* Is our filter JIT'ed? */
> > @@ -1772,7 +1775,7 @@ struct bpf_prog {
> >               u8 tag[BPF_TAG_SIZE];
> >       };
> >       struct bpf_prog_stats __percpu *stats;
> > -     int __percpu            *active;
> > +     u8 __percpu             *active;        /* u8[BPF_NR_CONTEXTS] for rerecursion protection */
> >       unsigned int            (*bpf_func)(const void *ctx,
> >                                           const struct bpf_insn *insn);
> >       struct bpf_prog_aux     *aux;           /* Auxiliary fields */
> > @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value {
> >
> >   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >   {
> > -     return this_cpu_inc_return(*(prog->active)) == 1;
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +     u32 val;
> > +
> > +     preempt_disable();
> > +     active[rctx]++;
> > +     val = get_unaligned_le32(active);
>
> The 'active' already aligned with 8 (or 4 with my below suggestion).
> The get_unaligned_le32() works, but maybe we could use le32_to_cpu()
> instead. Maybe there is no performance difference between
> get_unaligned_le32() and le32_to_cpu() so you pick get_unaligned_le32()?
> It would be good to clarify in commit message if get_unaligned_le32()
> is used.

I will just use val = le32_to_cpu(*(__le32 *)active);

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-12-19 18:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
2025-12-18 17:44   ` Yonghong Song
2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
2025-12-18 17:55   ` Yonghong Song
2025-12-19 16:40     ` Puranjay Mohan
2025-12-19 18:23     ` Puranjay Mohan
2025-12-18  2:52 ` [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox