* [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64
@ 2025-12-17 23:35 Puranjay Mohan
2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Puranjay Mohan @ 2025-12-17 23:35 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel
V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
Changes in V1->V2:
- Patch 2:
- Put preempt_enable()/disable() around RMW accesses to mitigate
race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
bpf programs, preemption can cause no prog to execute.
BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.
On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.
This patch removes atomics from the recursion detection path on arm64.
It was discovered in [1] that per-CPU atomics that don't return a value
were extremely slow on some arm64 platforms, Catalin added a fix in
commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
per-CPU atomic operations") to solve this issue, but it seems to have
caused a regression on the fentry benchmark.
Using the fentry benchmark from the bpf selftests shows the following:
./tools/testing/selftests/bpf/bench trig-fentry
+---------------------------------------------+------------------------+
| Configuration | Total Operations (M/s) |
+---------------------------------------------+------------------------+
| bpf-next/master with Catalin’s fix reverted | 51.862 |
|---------------------------------------------|------------------------|
| bpf-next/master | 43.067 |
| bpf-next/master with this change | 53.856 |
+---------------------------------------------+------------------------+
All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.
This patch yields a 25% improvement in this benchmark compared to
bpf-next. Notably, reverting Catalin's fix also results in a performance
gain for this benchmark, which is interesting but expected.
For completeness, this benchmark was also run with the change enabled on
x86-64, which resulted in a 30% regression in the fentry benchmark. So,
it is only enabled on arm64.
[1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
Puranjay Mohan (2):
bpf: move recursion detection logic to helpers
bpf: arm64: Optimize recursion detection by not using atomics
include/linux/bpf.h | 39 ++++++++++++++++++++++++++++++++++++++-
kernel/bpf/core.c | 3 ++-
kernel/bpf/trampoline.c | 8 ++++----
kernel/trace/bpf_trace.c | 4 ++--
4 files changed, 46 insertions(+), 8 deletions(-)
base-commit: ec439c38013550420aecc15988ae6acb670838c1
--
2.47.3
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers 2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan @ 2025-12-17 23:35 ` Puranjay Mohan 2025-12-18 17:44 ` Yonghong Song 2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan 2025-12-18 2:52 ` [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan 2 siblings, 1 reply; 8+ messages in thread From: Puranjay Mohan @ 2025-12-17 23:35 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel BPF programs detect recursion by doing atomic inc/dec on a per-cpu active counter from the trampoline. Create two helpers for operations on this active counter, this makes it easy to changes the recursion detection logic in future. This change makes no functional changes. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- include/linux/bpf.h | 10 ++++++++++ kernel/bpf/trampoline.c | 8 ++++---- kernel/trace/bpf_trace.c | 4 ++-- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index bb3847caeae1..2da986136d26 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2004,6 +2004,16 @@ struct bpf_struct_ops_common_value { enum bpf_struct_ops_state state; }; +static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog) +{ + return this_cpu_inc_return(*(prog->active)) == 1; +} + +static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog) +{ + this_cpu_dec(*(prog->active)); +} + #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) /* This macro helps developer to register a struct_ops type and generate * type information correctly. Developers should use this macro to register diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 976d89011b15..2a125d063e62 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -949,7 +949,7 @@ static u64 notrace __bpf_prog_enter_recur(struct bpf_prog *prog, struct bpf_tram run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx); - if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { + if (unlikely(!bpf_prog_get_recursion_context(prog))) { bpf_prog_inc_misses_counter(prog); if (prog->aux->recursion_detected) prog->aux->recursion_detected(prog); @@ -993,7 +993,7 @@ static void notrace __bpf_prog_exit_recur(struct bpf_prog *prog, u64 start, bpf_reset_run_ctx(run_ctx->saved_run_ctx); update_prog_stats(prog, start); - this_cpu_dec(*(prog->active)); + bpf_prog_put_recursion_context(prog); rcu_read_unlock_migrate(); } @@ -1029,7 +1029,7 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog, run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx); - if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { + if (unlikely(!bpf_prog_get_recursion_context(prog))) { bpf_prog_inc_misses_counter(prog); if (prog->aux->recursion_detected) prog->aux->recursion_detected(prog); @@ -1044,7 +1044,7 @@ void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start, bpf_reset_run_ctx(run_ctx->saved_run_ctx); update_prog_stats(prog, start); - this_cpu_dec(*(prog->active)); + bpf_prog_put_recursion_context(prog); migrate_enable(); rcu_read_unlock_trace(); } diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index fe28d86f7c35..6e076485bf70 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2063,7 +2063,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args) struct bpf_trace_run_ctx run_ctx; cant_sleep(); - if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { + if (unlikely(!bpf_prog_get_recursion_context(prog))) { bpf_prog_inc_misses_counter(prog); goto out; } @@ -2077,7 +2077,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args) bpf_reset_run_ctx(old_run_ctx); out: - this_cpu_dec(*(prog->active)); + bpf_prog_put_recursion_context(prog); } #define UNPACK(...) __VA_ARGS__ -- 2.47.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers 2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan @ 2025-12-18 17:44 ` Yonghong Song 0 siblings, 0 replies; 8+ messages in thread From: Yonghong Song @ 2025-12-18 17:44 UTC (permalink / raw) To: Puranjay Mohan, bpf Cc: Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel On 12/17/25 3:35 PM, Puranjay Mohan wrote: > BPF programs detect recursion by doing atomic inc/dec on a per-cpu > active counter from the trampoline. Create two helpers for operations on > this active counter, this makes it easy to changes the recursion > detection logic in future. > > This change makes no functional changes. > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics 2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan 2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan @ 2025-12-17 23:35 ` Puranjay Mohan 2025-12-18 17:55 ` Yonghong Song 2025-12-18 2:52 ` [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan 2 siblings, 1 reply; 8+ messages in thread From: Puranjay Mohan @ 2025-12-17 23:35 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel BPF programs detect recursion using a per-CPU 'active' flag in struct bpf_prog. The trampoline currently sets/clears this flag with atomic operations. On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic operations are relatively slow. Unlike x86_64 - where per-CPU updates can avoid cross-core atomicity, arm64 LSE atomics are always atomic across all cores, which is unnecessary overhead for strictly per-CPU state. This patch removes atomics from the recursion detection path on arm64 by changing 'active' to a per-CPU array of four u8 counters, one per context: {NMI, hard-irq, soft-irq, normal}. The running context uses a non-atomic increment/decrement on its element. After increment, recursion is detected by reading the array as a u32 and verifying that only the expected element changed; any change in another element indicates inter-context recursion, and a value > 1 in the same element indicates same-context recursion. For example, starting from {0,0,0,0}, a normal-context trigger changes the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers the program, the array becomes {1,0,0,1}. When the NMI context checks the u32 against the expected mask for normal (0x00000001), it observes 0x01000001 and correctly reports recursion. Same-context recursion is detected analogously. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++--- kernel/bpf/core.c | 3 ++- 2 files changed, 32 insertions(+), 4 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2da986136d26..5ca2a761d9a1 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -31,6 +31,7 @@ #include <linux/static_call.h> #include <linux/memcontrol.h> #include <linux/cfi.h> +#include <linux/unaligned.h> #include <asm/rqspinlock.h> struct bpf_verifier_env; @@ -1746,6 +1747,8 @@ struct bpf_prog_aux { struct bpf_map __rcu *st_ops_assoc; }; +#define BPF_NR_CONTEXTS 4 /* normal, softirq, hardirq, NMI */ + struct bpf_prog { u16 pages; /* Number of allocated pages */ u16 jited:1, /* Is our filter JIT'ed? */ @@ -1772,7 +1775,7 @@ struct bpf_prog { u8 tag[BPF_TAG_SIZE]; }; struct bpf_prog_stats __percpu *stats; - int __percpu *active; + u8 __percpu *active; /* u8[BPF_NR_CONTEXTS] for rerecursion protection */ unsigned int (*bpf_func)(const void *ctx, const struct bpf_insn *insn); struct bpf_prog_aux *aux; /* Auxiliary fields */ @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value { static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog) { - return this_cpu_inc_return(*(prog->active)) == 1; +#ifdef CONFIG_ARM64 + u8 rctx = interrupt_context_level(); + u8 *active = this_cpu_ptr(prog->active); + u32 val; + + preempt_disable(); + active[rctx]++; + val = get_unaligned_le32(active); + preempt_enable(); + if (val != BIT(rctx * 8)) + return false; + + return true; +#else + return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1; +#endif } static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog) { - this_cpu_dec(*(prog->active)); +#ifdef CONFIG_ARM64 + u8 rctx = interrupt_context_level(); + u8 *active = this_cpu_ptr(prog->active); + + preempt_disable(); + active[rctx]--; + preempt_enable(); +#else + this_cpu_dec(*(int __percpu *)(prog->active)); +#endif } #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index c66316e32563..b5063acfcf92 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag vfree(fp); return NULL; } - fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); + fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8, + bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); if (!fp->active) { vfree(fp); kfree(aux); -- 2.47.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics 2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan @ 2025-12-18 17:55 ` Yonghong Song 2025-12-19 16:40 ` Puranjay Mohan 2025-12-19 18:23 ` Puranjay Mohan 0 siblings, 2 replies; 8+ messages in thread From: Yonghong Song @ 2025-12-18 17:55 UTC (permalink / raw) To: Puranjay Mohan, bpf Cc: Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel On 12/17/25 3:35 PM, Puranjay Mohan wrote: > BPF programs detect recursion using a per-CPU 'active' flag in struct > bpf_prog. The trampoline currently sets/clears this flag with atomic > operations. > > On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic > operations are relatively slow. Unlike x86_64 - where per-CPU updates > can avoid cross-core atomicity, arm64 LSE atomics are always atomic > across all cores, which is unnecessary overhead for strictly per-CPU > state. > > This patch removes atomics from the recursion detection path on arm64 by > changing 'active' to a per-CPU array of four u8 counters, one per > context: {NMI, hard-irq, soft-irq, normal}. The running context uses a > non-atomic increment/decrement on its element. After increment, > recursion is detected by reading the array as a u32 and verifying that > only the expected element changed; any change in another element > indicates inter-context recursion, and a value > 1 in the same element > indicates same-context recursion. > > For example, starting from {0,0,0,0}, a normal-context trigger changes > the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers > the program, the array becomes {1,0,0,1}. When the NMI context checks > the u32 against the expected mask for normal (0x00000001), it observes > 0x01000001 and correctly reports recursion. Same-context recursion is > detected analogously. > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> LGTM with a few nits below. Acked-by: Yonghong Song <yonghong.song@linux.dev> > --- > include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++--- > kernel/bpf/core.c | 3 ++- > 2 files changed, 32 insertions(+), 4 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 2da986136d26..5ca2a761d9a1 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -31,6 +31,7 @@ > #include <linux/static_call.h> > #include <linux/memcontrol.h> > #include <linux/cfi.h> > +#include <linux/unaligned.h> > #include <asm/rqspinlock.h> > > struct bpf_verifier_env; > @@ -1746,6 +1747,8 @@ struct bpf_prog_aux { > struct bpf_map __rcu *st_ops_assoc; > }; > > +#define BPF_NR_CONTEXTS 4 /* normal, softirq, hardirq, NMI */ > + > struct bpf_prog { > u16 pages; /* Number of allocated pages */ > u16 jited:1, /* Is our filter JIT'ed? */ > @@ -1772,7 +1775,7 @@ struct bpf_prog { > u8 tag[BPF_TAG_SIZE]; > }; > struct bpf_prog_stats __percpu *stats; > - int __percpu *active; > + u8 __percpu *active; /* u8[BPF_NR_CONTEXTS] for rerecursion protection */ > unsigned int (*bpf_func)(const void *ctx, > const struct bpf_insn *insn); > struct bpf_prog_aux *aux; /* Auxiliary fields */ > @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value { > > static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog) > { > - return this_cpu_inc_return(*(prog->active)) == 1; > +#ifdef CONFIG_ARM64 > + u8 rctx = interrupt_context_level(); > + u8 *active = this_cpu_ptr(prog->active); > + u32 val; > + > + preempt_disable(); > + active[rctx]++; > + val = get_unaligned_le32(active); The 'active' already aligned with 8 (or 4 with my below suggestion). The get_unaligned_le32() works, but maybe we could use le32_to_cpu() instead. Maybe there is no performance difference between get_unaligned_le32() and le32_to_cpu() so you pick get_unaligned_le32()? It would be good to clarify in commit message if get_unaligned_le32() is used. > + preempt_enable(); > + if (val != BIT(rctx * 8)) > + return false; > + > + return true; > +#else > + return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1; > +#endif > } > > static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog) > { > - this_cpu_dec(*(prog->active)); > +#ifdef CONFIG_ARM64 > + u8 rctx = interrupt_context_level(); > + u8 *active = this_cpu_ptr(prog->active); > + > + preempt_disable(); > + active[rctx]--; > + preempt_enable(); > +#else > + this_cpu_dec(*(int __percpu *)(prog->active)); > +#endif > } > > #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index c66316e32563..b5063acfcf92 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag > vfree(fp); > return NULL; > } > - fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); > + fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8, > + bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); Here, the alignment is 8. Can it be 4 since the above reads a 32bit value? > if (!fp->active) { > vfree(fp); > kfree(aux); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics 2025-12-18 17:55 ` Yonghong Song @ 2025-12-19 16:40 ` Puranjay Mohan 2025-12-19 18:23 ` Puranjay Mohan 1 sibling, 0 replies; 8+ messages in thread From: Puranjay Mohan @ 2025-12-19 16:40 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, Dec 18, 2025 at 5:56 PM Yonghong Song <yonghong.song@linux.dev> wrote: > > > > On 12/17/25 3:35 PM, Puranjay Mohan wrote: > > BPF programs detect recursion using a per-CPU 'active' flag in struct > > bpf_prog. The trampoline currently sets/clears this flag with atomic > > operations. > > > > On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic > > operations are relatively slow. Unlike x86_64 - where per-CPU updates > > can avoid cross-core atomicity, arm64 LSE atomics are always atomic > > across all cores, which is unnecessary overhead for strictly per-CPU > > state. > > > > This patch removes atomics from the recursion detection path on arm64 by > > changing 'active' to a per-CPU array of four u8 counters, one per > > context: {NMI, hard-irq, soft-irq, normal}. The running context uses a > > non-atomic increment/decrement on its element. After increment, > > recursion is detected by reading the array as a u32 and verifying that > > only the expected element changed; any change in another element > > indicates inter-context recursion, and a value > 1 in the same element > > indicates same-context recursion. > > > > For example, starting from {0,0,0,0}, a normal-context trigger changes > > the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers > > the program, the array becomes {1,0,0,1}. When the NMI context checks > > the u32 against the expected mask for normal (0x00000001), it observes > > 0x01000001 and correctly reports recursion. Same-context recursion is > > detected analogously. > > > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > > LGTM with a few nits below. > > Acked-by: Yonghong Song <yonghong.song@linux.dev> > > > --- > > include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++--- > > kernel/bpf/core.c | 3 ++- > > 2 files changed, 32 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 2da986136d26..5ca2a761d9a1 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -31,6 +31,7 @@ > > #include <linux/static_call.h> > > #include <linux/memcontrol.h> > > #include <linux/cfi.h> > > +#include <linux/unaligned.h> > > #include <asm/rqspinlock.h> > > > > struct bpf_verifier_env; > > @@ -1746,6 +1747,8 @@ struct bpf_prog_aux { > > struct bpf_map __rcu *st_ops_assoc; > > }; > > > > +#define BPF_NR_CONTEXTS 4 /* normal, softirq, hardirq, NMI */ > > + > > struct bpf_prog { > > u16 pages; /* Number of allocated pages */ > > u16 jited:1, /* Is our filter JIT'ed? */ > > @@ -1772,7 +1775,7 @@ struct bpf_prog { > > u8 tag[BPF_TAG_SIZE]; > > }; > > struct bpf_prog_stats __percpu *stats; > > - int __percpu *active; > > + u8 __percpu *active; /* u8[BPF_NR_CONTEXTS] for rerecursion protection */ > > unsigned int (*bpf_func)(const void *ctx, > > const struct bpf_insn *insn); > > struct bpf_prog_aux *aux; /* Auxiliary fields */ > > @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value { > > > > static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog) > > { > > - return this_cpu_inc_return(*(prog->active)) == 1; > > +#ifdef CONFIG_ARM64 > > + u8 rctx = interrupt_context_level(); > > + u8 *active = this_cpu_ptr(prog->active); > > + u32 val; > > + > > + preempt_disable(); > > + active[rctx]++; > > + val = get_unaligned_le32(active); > > The 'active' already aligned with 8 (or 4 with my below suggestion). > The get_unaligned_le32() works, but maybe we could use le32_to_cpu() > instead. Maybe there is no performance difference between > get_unaligned_le32() and le32_to_cpu() so you pick get_unaligned_le32()? > It would be good to clarify in commit message if get_unaligned_le32() > is used. > > > + preempt_enable(); > > + if (val != BIT(rctx * 8)) > > + return false; > > + > > + return true; > > +#else > > + return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1; > > +#endif > > } > > > > static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog) > > { > > - this_cpu_dec(*(prog->active)); > > +#ifdef CONFIG_ARM64 > > + u8 rctx = interrupt_context_level(); > > + u8 *active = this_cpu_ptr(prog->active); > > + > > + preempt_disable(); > > + active[rctx]--; > > + preempt_enable(); > > +#else > > + this_cpu_dec(*(int __percpu *)(prog->active)); > > +#endif > > } > > > > #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > index c66316e32563..b5063acfcf92 100644 > > --- a/kernel/bpf/core.c > > +++ b/kernel/bpf/core.c > > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag > > vfree(fp); > > return NULL; > > } > > - fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); > > + fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8, > > + bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); > > Here, the alignment is 8. Can it be 4 since the above reads a 32bit value? Yes, It should be 4. Will change in next version and add your acked by. Thanks, Puranjay ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics 2025-12-18 17:55 ` Yonghong Song 2025-12-19 16:40 ` Puranjay Mohan @ 2025-12-19 18:23 ` Puranjay Mohan 1 sibling, 0 replies; 8+ messages in thread From: Puranjay Mohan @ 2025-12-19 18:23 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, Dec 18, 2025 at 5:56 PM Yonghong Song <yonghong.song@linux.dev> wrote: > > > > On 12/17/25 3:35 PM, Puranjay Mohan wrote: > > BPF programs detect recursion using a per-CPU 'active' flag in struct > > bpf_prog. The trampoline currently sets/clears this flag with atomic > > operations. > > > > On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic > > operations are relatively slow. Unlike x86_64 - where per-CPU updates > > can avoid cross-core atomicity, arm64 LSE atomics are always atomic > > across all cores, which is unnecessary overhead for strictly per-CPU > > state. > > > > This patch removes atomics from the recursion detection path on arm64 by > > changing 'active' to a per-CPU array of four u8 counters, one per > > context: {NMI, hard-irq, soft-irq, normal}. The running context uses a > > non-atomic increment/decrement on its element. After increment, > > recursion is detected by reading the array as a u32 and verifying that > > only the expected element changed; any change in another element > > indicates inter-context recursion, and a value > 1 in the same element > > indicates same-context recursion. > > > > For example, starting from {0,0,0,0}, a normal-context trigger changes > > the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers > > the program, the array becomes {1,0,0,1}. When the NMI context checks > > the u32 against the expected mask for normal (0x00000001), it observes > > 0x01000001 and correctly reports recursion. Same-context recursion is > > detected analogously. > > > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > > LGTM with a few nits below. > > Acked-by: Yonghong Song <yonghong.song@linux.dev> > > > --- > > include/linux/bpf.h | 33 ++++++++++++++++++++++++++++++--- > > kernel/bpf/core.c | 3 ++- > > 2 files changed, 32 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 2da986136d26..5ca2a761d9a1 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -31,6 +31,7 @@ > > #include <linux/static_call.h> > > #include <linux/memcontrol.h> > > #include <linux/cfi.h> > > +#include <linux/unaligned.h> > > #include <asm/rqspinlock.h> > > > > struct bpf_verifier_env; > > @@ -1746,6 +1747,8 @@ struct bpf_prog_aux { > > struct bpf_map __rcu *st_ops_assoc; > > }; > > > > +#define BPF_NR_CONTEXTS 4 /* normal, softirq, hardirq, NMI */ > > + > > struct bpf_prog { > > u16 pages; /* Number of allocated pages */ > > u16 jited:1, /* Is our filter JIT'ed? */ > > @@ -1772,7 +1775,7 @@ struct bpf_prog { > > u8 tag[BPF_TAG_SIZE]; > > }; > > struct bpf_prog_stats __percpu *stats; > > - int __percpu *active; > > + u8 __percpu *active; /* u8[BPF_NR_CONTEXTS] for rerecursion protection */ > > unsigned int (*bpf_func)(const void *ctx, > > const struct bpf_insn *insn); > > struct bpf_prog_aux *aux; /* Auxiliary fields */ > > @@ -2006,12 +2009,36 @@ struct bpf_struct_ops_common_value { > > > > static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog) > > { > > - return this_cpu_inc_return(*(prog->active)) == 1; > > +#ifdef CONFIG_ARM64 > > + u8 rctx = interrupt_context_level(); > > + u8 *active = this_cpu_ptr(prog->active); > > + u32 val; > > + > > + preempt_disable(); > > + active[rctx]++; > > + val = get_unaligned_le32(active); > > The 'active' already aligned with 8 (or 4 with my below suggestion). > The get_unaligned_le32() works, but maybe we could use le32_to_cpu() > instead. Maybe there is no performance difference between > get_unaligned_le32() and le32_to_cpu() so you pick get_unaligned_le32()? > It would be good to clarify in commit message if get_unaligned_le32() > is used. I will just use val = le32_to_cpu(*(__le32 *)active); Thanks, Puranjay ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan 2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan 2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan @ 2025-12-18 2:52 ` Puranjay Mohan 2 siblings, 0 replies; 8+ messages in thread From: Puranjay Mohan @ 2025-12-18 2:52 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Dec 17, 2025 at 11:36 PM Puranjay Mohan <puranjay@kernel.org> wrote: > > V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/ > Changes in V1->V2: > - Patch 2: > - Put preempt_enable()/disable() around RMW accesses to mitigate > race conditions. Because on CONFIG_PREEMPT_RCU and sleepable > bpf programs, preemption can cause no prog to execute. > > BPF programs detect recursion using a per-CPU 'active' flag in struct > bpf_prog. The trampoline currently sets/clears this flag with atomic > operations. > > On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic > operations are relatively slow. Unlike x86_64 - where per-CPU updates > can avoid cross-core atomicity, arm64 LSE atomics are always atomic > across all cores, which is unnecessary overhead for strictly per-CPU > state. > > This patch removes atomics from the recursion detection path on arm64. > > It was discovered in [1] that per-CPU atomics that don't return a value > were extremely slow on some arm64 platforms, Catalin added a fix in > commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return > per-CPU atomic operations") to solve this issue, but it seems to have > caused a regression on the fentry benchmark. > > Using the fentry benchmark from the bpf selftests shows the following: > > ./tools/testing/selftests/bpf/bench trig-fentry > > +---------------------------------------------+------------------------+ > | Configuration | Total Operations (M/s) | > +---------------------------------------------+------------------------+ > | bpf-next/master with Catalin’s fix reverted | 51.862 | > |---------------------------------------------|------------------------| > | bpf-next/master | 43.067 | > | bpf-next/master with this change | 53.856 | > +---------------------------------------------+------------------------+ > > All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus. > Here is some more data about other attach types: +-----------------+-----------+-----------+----------+ | Metric | Before | After | % Diff | +-----------------+-----------+-----------+----------+ | fentry | 43.149 | 53.948 | +25.03% | | fentry.s | 41.831 | 50.937 | +21.76% | | rawtp | 50.834 | 58.731 | +15.53% | | fexit | 31.118 | 34.360 | +10.42% | | tp | 39.536 | 41.632 | +5.30% | | syscall-count | 8.053 | 8.305 | +3.13% | | fmodret | 33.940 | 34.769 | +2.44% | | kprobe | 9.970 | 9.998 | +0.28% | | usermode-count | 224.886 | 224.839 | -0.02% | | kernel-count | 154.229 | 153.043 | -0.77% | +-----------------+-----------+-----------+----------+ ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-12-19 18:23 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-17 23:35 [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan 2025-12-17 23:35 ` [PATCH bpf-next v2 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan 2025-12-18 17:44 ` Yonghong Song 2025-12-17 23:35 ` [PATCH bpf-next v2 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan 2025-12-18 17:55 ` Yonghong Song 2025-12-19 16:40 ` Puranjay Mohan 2025-12-19 18:23 ` Puranjay Mohan 2025-12-18 2:52 ` [PATCH bpf-next v2 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox