* [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64
@ 2025-12-19 18:44 Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Puranjay Mohan @ 2025-12-19 18:44 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel
V2: https://lore.kernel.org/all/20251217233608.2374187-1-puranjay@kernel.org/
Changes in v2->v3:
- Added acked by Yonghong
- Patch 2:
- Change alignment of active from 8 to 4
- Use le32_to_cpu in place of get_unaligned_le32()
V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
Changes in V1->V2:
- Patch 2:
- Put preempt_enable()/disable() around RMW accesses to mitigate
race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
bpf programs, preemption can cause no bpf prog to execute in
case of recursion.
BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.
On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.
This patch removes atomics from the recursion detection path on arm64.
It was discovered in [1] that per-CPU atomics that don't return a value
were extremely slow on some arm64 platforms, Catalin added a fix in
commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
per-CPU atomic operations") to solve this issue, but it seems to have
caused a regression on the fentry benchmark.
Using the fentry benchmark from the bpf selftests shows the following:
./tools/testing/selftests/bpf/bench trig-fentry
+---------------------------------------------+------------------------+
| Configuration | Total Operations (M/s) |
+---------------------------------------------+------------------------+
| bpf-next/master with Catalin’s fix reverted | 51.770 |
|---------------------------------------------|------------------------|
| bpf-next/master | 43.271 |
| bpf-next/master with this change | 43.271 |
+---------------------------------------------+------------------------+
All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.
This patch yields a 25% improvement in this benchmark compared to
bpf-next. Notably, reverting Catalin's fix also results in a performance
gain for this benchmark, which is interesting but expected.
For completeness, this benchmark was also run with the change enabled on
x86-64, which resulted in a 30% regression in the fentry benchmark. So,
it is only enabled on arm64.
P.S. - Here is more data with other program types:
+-----------------+-----------+-----------+----------+
| Metric | Before | After | % Diff |
+-----------------+-----------+-----------+----------+
| fentry | 43.149 | 53.948 | +25.03% |
| fentry.s | 41.831 | 50.937 | +21.76% |
| rawtp | 50.834 | 58.731 | +15.53% |
| fexit | 31.118 | 34.360 | +10.42% |
| tp | 39.536 | 41.632 | +5.30% |
| syscall-count | 8.053 | 8.305 | +3.13% |
| fmodret | 33.940 | 34.769 | +2.44% |
| kprobe | 9.970 | 9.998 | +0.28% |
| usermode-count | 224.886 | 224.839 | -0.02% |
| kernel-count | 154.229 | 153.043 | -0.77% |
+-----------------+-----------+-----------+----------+
[1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
Puranjay Mohan (2):
bpf: move recursion detection logic to helpers
bpf: arm64: Optimize recursion detection by not using atomics
include/linux/bpf.h | 38 +++++++++++++++++++++++++++++++++++++-
kernel/bpf/core.c | 3 ++-
kernel/bpf/trampoline.c | 8 ++++----
kernel/trace/bpf_trace.c | 4 ++--
4 files changed, 45 insertions(+), 8 deletions(-)
base-commit: ec439c38013550420aecc15988ae6acb670838c1
--
2.47.3
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH bpf-next v3 1/2] bpf: move recursion detection logic to helpers
2025-12-19 18:44 [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
@ 2025-12-19 18:44 ` Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
2025-12-22 0:54 ` [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Alexei Starovoitov
2 siblings, 0 replies; 4+ messages in thread
From: Puranjay Mohan @ 2025-12-19 18:44 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel,
Yonghong Song
BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.
This commit makes no functional changes.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
include/linux/bpf.h | 10 ++++++++++
kernel/bpf/trampoline.c | 8 ++++----
kernel/trace/bpf_trace.c | 4 ++--
3 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bb3847caeae1..2da986136d26 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2004,6 +2004,16 @@ struct bpf_struct_ops_common_value {
enum bpf_struct_ops_state state;
};
+static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
+{
+ return this_cpu_inc_return(*(prog->active)) == 1;
+}
+
+static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
+{
+ this_cpu_dec(*(prog->active));
+}
+
#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
/* This macro helps developer to register a struct_ops type and generate
* type information correctly. Developers should use this macro to register
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 976d89011b15..2a125d063e62 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -949,7 +949,7 @@ static u64 notrace __bpf_prog_enter_recur(struct bpf_prog *prog, struct bpf_tram
run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
- if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+ if (unlikely(!bpf_prog_get_recursion_context(prog))) {
bpf_prog_inc_misses_counter(prog);
if (prog->aux->recursion_detected)
prog->aux->recursion_detected(prog);
@@ -993,7 +993,7 @@ static void notrace __bpf_prog_exit_recur(struct bpf_prog *prog, u64 start,
bpf_reset_run_ctx(run_ctx->saved_run_ctx);
update_prog_stats(prog, start);
- this_cpu_dec(*(prog->active));
+ bpf_prog_put_recursion_context(prog);
rcu_read_unlock_migrate();
}
@@ -1029,7 +1029,7 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog,
run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
- if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+ if (unlikely(!bpf_prog_get_recursion_context(prog))) {
bpf_prog_inc_misses_counter(prog);
if (prog->aux->recursion_detected)
prog->aux->recursion_detected(prog);
@@ -1044,7 +1044,7 @@ void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start,
bpf_reset_run_ctx(run_ctx->saved_run_ctx);
update_prog_stats(prog, start);
- this_cpu_dec(*(prog->active));
+ bpf_prog_put_recursion_context(prog);
migrate_enable();
rcu_read_unlock_trace();
}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe28d86f7c35..6e076485bf70 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2063,7 +2063,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
struct bpf_trace_run_ctx run_ctx;
cant_sleep();
- if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+ if (unlikely(!bpf_prog_get_recursion_context(prog))) {
bpf_prog_inc_misses_counter(prog);
goto out;
}
@@ -2077,7 +2077,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
bpf_reset_run_ctx(old_run_ctx);
out:
- this_cpu_dec(*(prog->active));
+ bpf_prog_put_recursion_context(prog);
}
#define UNPACK(...) __VA_ARGS__
--
2.47.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH bpf-next v3 2/2] bpf: arm64: Optimize recursion detection by not using atomics
2025-12-19 18:44 [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
@ 2025-12-19 18:44 ` Puranjay Mohan
2025-12-22 0:54 ` [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Alexei Starovoitov
2 siblings, 0 replies; 4+ messages in thread
From: Puranjay Mohan @ 2025-12-19 18:44 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel,
Yonghong Song
BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.
On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.
This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element. After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.
For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
include/linux/bpf.h | 32 +++++++++++++++++++++++++++++---
kernel/bpf/core.c | 3 ++-
2 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2da986136d26..da6a00dd313f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1746,6 +1746,8 @@ struct bpf_prog_aux {
struct bpf_map __rcu *st_ops_assoc;
};
+#define BPF_NR_CONTEXTS 4 /* normal, softirq, hardirq, NMI */
+
struct bpf_prog {
u16 pages; /* Number of allocated pages */
u16 jited:1, /* Is our filter JIT'ed? */
@@ -1772,7 +1774,7 @@ struct bpf_prog {
u8 tag[BPF_TAG_SIZE];
};
struct bpf_prog_stats __percpu *stats;
- int __percpu *active;
+ u8 __percpu *active; /* u8[BPF_NR_CONTEXTS] for recursion protection */
unsigned int (*bpf_func)(const void *ctx,
const struct bpf_insn *insn);
struct bpf_prog_aux *aux; /* Auxiliary fields */
@@ -2006,12 +2008,36 @@ struct bpf_struct_ops_common_value {
static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
{
- return this_cpu_inc_return(*(prog->active)) == 1;
+#ifdef CONFIG_ARM64
+ u8 rctx = interrupt_context_level();
+ u8 *active = this_cpu_ptr(prog->active);
+ u32 val;
+
+ preempt_disable();
+ active[rctx]++;
+ val = le32_to_cpu(*(__le32 *)active);
+ preempt_enable();
+ if (val != BIT(rctx * 8))
+ return false;
+
+ return true;
+#else
+ return this_cpu_inc_return(*(int __percpu *)(prog->active)) == 1;
+#endif
}
static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
{
- this_cpu_dec(*(prog->active));
+#ifdef CONFIG_ARM64
+ u8 rctx = interrupt_context_level();
+ u8 *active = this_cpu_ptr(prog->active);
+
+ preempt_disable();
+ active[rctx]--;
+ preempt_enable();
+#else
+ this_cpu_dec(*(int __percpu *)(prog->active));
+#endif
}
#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c66316e32563..e0b8a8a5aaa9 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
vfree(fp);
return NULL;
}
- fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
+ fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 4,
+ bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
if (!fp->active) {
vfree(fp);
kfree(aux);
--
2.47.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64
2025-12-19 18:44 [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
@ 2025-12-22 0:54 ` Alexei Starovoitov
2 siblings, 0 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2025-12-22 0:54 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Kernel Team, Catalin Marinas,
Will Deacon, Mark Rutland, linux-arm-kernel
On Fri, Dec 19, 2025 at 8:45 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> V2: https://lore.kernel.org/all/20251217233608.2374187-1-puranjay@kernel.org/
> Changes in v2->v3:
> - Added acked by Yonghong
> - Patch 2:
> - Change alignment of active from 8 to 4
> - Use le32_to_cpu in place of get_unaligned_le32()
>
> V1: https://lore.kernel.org/all/20251217162830.2597286-1-puranjay@kernel.org/
> Changes in V1->V2:
> - Patch 2:
> - Put preempt_enable()/disable() around RMW accesses to mitigate
> race conditions. Because on CONFIG_PREEMPT_RCU and sleepable
> bpf programs, preemption can cause no bpf prog to execute in
> case of recursion.
>
> BPF programs detect recursion using a per-CPU 'active' flag in struct
> bpf_prog. The trampoline currently sets/clears this flag with atomic
> operations.
>
> On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
> operations are relatively slow. Unlike x86_64 - where per-CPU updates
> can avoid cross-core atomicity, arm64 LSE atomics are always atomic
> across all cores, which is unnecessary overhead for strictly per-CPU
> state.
>
> This patch removes atomics from the recursion detection path on arm64.
>
> It was discovered in [1] that per-CPU atomics that don't return a value
> were extremely slow on some arm64 platforms, Catalin added a fix in
> commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
> per-CPU atomic operations") to solve this issue, but it seems to have
> caused a regression on the fentry benchmark.
>
> Using the fentry benchmark from the bpf selftests shows the following:
>
> ./tools/testing/selftests/bpf/bench trig-fentry
>
> +---------------------------------------------+------------------------+
> | Configuration | Total Operations (M/s) |
> +---------------------------------------------+------------------------+
> | bpf-next/master with Catalin’s fix reverted | 51.770 |
> |---------------------------------------------|------------------------|
> | bpf-next/master | 43.271 |
> | bpf-next/master with this change | 43.271 |
> +---------------------------------------------+------------------------+
>
> All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.
>
> This patch yields a 25% improvement in this benchmark compared to
> bpf-next. Notably, reverting Catalin's fix also results in a performance
> gain for this benchmark, which is interesting but expected.
>
> For completeness, this benchmark was also run with the change enabled on
> x86-64, which resulted in a 30% regression in the fentry benchmark. So,
> it is only enabled on arm64.
>
> P.S. - Here is more data with other program types:
>
> +-----------------+-----------+-----------+----------+
> | Metric | Before | After | % Diff |
> +-----------------+-----------+-----------+----------+
> | fentry | 43.149 | 53.948 | +25.03% |
> | fentry.s | 41.831 | 50.937 | +21.76% |
> | rawtp | 50.834 | 58.731 | +15.53% |
> | fexit | 31.118 | 34.360 | +10.42% |
> | tp | 39.536 | 41.632 | +5.30% |
> | syscall-count | 8.053 | 8.305 | +3.13% |
> | fmodret | 33.940 | 34.769 | +2.44% |
> | kprobe | 9.970 | 9.998 | +0.28% |
> | usermode-count | 224.886 | 224.839 | -0.02% |
> | kernel-count | 154.229 | 153.043 | -0.77% |
> +-----------------+-----------+-----------+----------+
>
> [1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
>
> Puranjay Mohan (2):
> bpf: move recursion detection logic to helpers
> bpf: arm64: Optimize recursion detection by not using atomics
>
> include/linux/bpf.h | 38 +++++++++++++++++++++++++++++++++++++-
> kernel/bpf/core.c | 3 ++-
> kernel/bpf/trampoline.c | 8 ++++----
> kernel/trace/bpf_trace.c | 4 ++--
> 4 files changed, 45 insertions(+), 8 deletions(-)
It was applied to bpf-next.
pw-bot is asleep.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-22 0:55 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 18:44 [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
2025-12-19 18:44 ` [PATCH bpf-next v3 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
2025-12-22 0:54 ` [PATCH bpf-next v3 0/2] bpf: Optimize recursion detection on arm64 Alexei Starovoitov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox