[PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64
@ 2025-12-17 16:28 Puranjay Mohan
  2025-12-17 16:28 ` [PATCH bpf-next 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
  2025-12-17 16:28 ` [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
  0 siblings, 2 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 16:28 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
	Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64.

It was discovered in [1] that per-CPU atomics that don't return a value
were extremely slow on some arm64 platforms, Catalin added a fix in
commit 535fdfc5a228 ("arm64: Use load LSE atomics for the non-return
per-CPU atomic operations") to solve this issue, but it seems to have
caused a regression on the fentry benchmark.

Using the fentry benchmark from the bpf selftests shows the following:

  ./tools/testing/selftests/bpf/bench trig-fentry

 +---------------------------------------------+------------------------+
 |               Configuration                 | Total Operations (M/s) |
 +---------------------------------------------+------------------------+
 | bpf-next/master with Catalin’s fix reverted |    40.033 ± 0.090      |
 |---------------------------------------------|------------------------|
 | bpf-next/master                             |    34.617 ± 0.018      |
 | bpf-next/master with this change            |    46.122 ± 0.020      |
 +---------------------------------------------+------------------------+

All benchmarks were run on a KVM based vm with Neoverse-V2 and 8 cpus.

This patch yields a 33% improvement in this benchmark compared to
bpf-next. Notably, reverting Catalin's fix also results in a performance
gain for this benchmark, which is interesting but expected.

For completeness, this benchmark was also run with the change enabled on
x86-64, which resulted in a 30% regression in the fentry benchmark. So,
it is only enabled on arm64.

[1] https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/

Puranjay Mohan (2):
  bpf: move recursion detection logic to helpers
  bpf: arm64: Optimize recursion detection by not using atomics

 include/linux/bpf.h      | 35 ++++++++++++++++++++++++++++++++++-
 kernel/bpf/core.c        |  3 ++-
 kernel/bpf/trampoline.c  |  8 ++++----
 kernel/trace/bpf_trace.c |  4 ++--
 4 files changed, 42 insertions(+), 8 deletions(-)

base-commit: ec439c38013550420aecc15988ae6acb670838c1
-- 
2.47.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 1/2] bpf: move recursion detection logic to helpers
  2025-12-17 16:28 [PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
@ 2025-12-17 16:28 ` Puranjay Mohan
  2025-12-17 16:28 ` [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
  1 sibling, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 16:28 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
	Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.

This change makes no functional changes.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h      | 10 ++++++++++
 kernel/bpf/trampoline.c  |  8 ++++----
 kernel/trace/bpf_trace.c |  4 ++--
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bb3847caeae1..2da986136d26 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2004,6 +2004,16 @@ struct bpf_struct_ops_common_value {
 	enum bpf_struct_ops_state state;
 };
 
+static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
+{
+	return this_cpu_inc_return(*(prog->active)) == 1;
+}
+
+static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
+{
+	this_cpu_dec(*(prog->active));
+}
+
 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
 /* This macro helps developer to register a struct_ops type and generate
  * type information correctly. Developers should use this macro to register
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 976d89011b15..2a125d063e62 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -949,7 +949,7 @@ static u64 notrace __bpf_prog_enter_recur(struct bpf_prog *prog, struct bpf_tram
 
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
-	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+	if (unlikely(!bpf_prog_get_recursion_context(prog))) {
 		bpf_prog_inc_misses_counter(prog);
 		if (prog->aux->recursion_detected)
 			prog->aux->recursion_detected(prog);
@@ -993,7 +993,7 @@ static void notrace __bpf_prog_exit_recur(struct bpf_prog *prog, u64 start,
 	bpf_reset_run_ctx(run_ctx->saved_run_ctx);
 
 	update_prog_stats(prog, start);
-	this_cpu_dec(*(prog->active));
+	bpf_prog_put_recursion_context(prog);
 	rcu_read_unlock_migrate();
 }
 
@@ -1029,7 +1029,7 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog,
 
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
-	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+	if (unlikely(!bpf_prog_get_recursion_context(prog))) {
 		bpf_prog_inc_misses_counter(prog);
 		if (prog->aux->recursion_detected)
 			prog->aux->recursion_detected(prog);
@@ -1044,7 +1044,7 @@ void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start,
 	bpf_reset_run_ctx(run_ctx->saved_run_ctx);
 
 	update_prog_stats(prog, start);
-	this_cpu_dec(*(prog->active));
+	bpf_prog_put_recursion_context(prog);
 	migrate_enable();
 	rcu_read_unlock_trace();
 }
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe28d86f7c35..6e076485bf70 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2063,7 +2063,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
 	struct bpf_trace_run_ctx run_ctx;
 
 	cant_sleep();
-	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
+	if (unlikely(!bpf_prog_get_recursion_context(prog))) {
 		bpf_prog_inc_misses_counter(prog);
 		goto out;
 	}
@@ -2077,7 +2077,7 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args)
 
 	bpf_reset_run_ctx(old_run_ctx);
 out:
-	this_cpu_dec(*(prog->active));
+	bpf_prog_put_recursion_context(prog);
 }
 
 #define UNPACK(...)			__VA_ARGS__
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 16:28 [PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
  2025-12-17 16:28 ` [PATCH bpf-next 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
@ 2025-12-17 16:28 ` Puranjay Mohan
  2025-12-17 16:56   ` bot+bpf-ci
  1 sibling, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 16:28 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team,
	Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element.  After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.

For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h | 25 ++++++++++++++++++++++++-
 kernel/bpf/core.c   |  3 ++-
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2da986136d26..654fb94bf60c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -31,6 +31,7 @@
 #include <linux/static_call.h>
 #include <linux/memcontrol.h>
 #include <linux/cfi.h>
+#include <linux/unaligned.h>
 #include <asm/rqspinlock.h>
 
 struct bpf_verifier_env;
@@ -1746,6 +1747,8 @@ struct bpf_prog_aux {
 	struct bpf_map __rcu *st_ops_assoc;
 };
 
+#define BPF_NR_CONTEXTS        4       /* normal, softirq, hardirq, NMI */
+
 struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
@@ -1772,7 +1775,7 @@ struct bpf_prog {
 		u8 tag[BPF_TAG_SIZE];
 	};
 	struct bpf_prog_stats __percpu *stats;
-	int __percpu		*active;
+	u8 __percpu		*active;	/* u8[BPF_NR_CONTEXTS] for rerecursion protection */
 	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
@@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
 
 static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
 {
+#ifdef CONFIG_ARM64
+	u8 rctx = interrupt_context_level();
+	u8 *active = this_cpu_ptr(prog->active);
+
+	active[rctx]++;
+	barrier();
+	if (get_unaligned_le32(active) != BIT(rctx * 8))
+		return false;
+
+	return true;
+#else
 	return this_cpu_inc_return(*(prog->active)) == 1;
+#endif
 }
 
 static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
 {
+#ifdef CONFIG_ARM64
+	u8 rctx = interrupt_context_level();
+	u8 *active = this_cpu_ptr(prog->active);
+
+	barrier();
+	active[rctx]--;
+#else
 	this_cpu_dec(*(prog->active));
+#endif
 }
 
 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c66316e32563..b5063acfcf92 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 		vfree(fp);
 		return NULL;
 	}
-	fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
+	fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
+					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
 	if (!fp->active) {
 		vfree(fp);
 		kfree(aux);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 16:28 ` [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
@ 2025-12-17 16:56   ` bot+bpf-ci
  2025-12-17 18:13     ` Puranjay Mohan
  2025-12-17 18:23     ` Yonghong Song
  0 siblings, 2 replies; 13+ messages in thread
From: bot+bpf-ci @ 2025-12-17 16:56 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, catalin.marinas, will, mark.rutland,
	linux-arm-kernel, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3589 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 2da986136d26..654fb94bf60c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>
>  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>  {
> +#ifdef CONFIG_ARM64
> +	u8 rctx = interrupt_context_level();
> +	u8 *active = this_cpu_ptr(prog->active);
> +
> +	active[rctx]++;
> +	barrier();
> +	if (get_unaligned_le32(active) != BIT(rctx * 8))
> +		return false;
> +
> +	return true;
> +#else
>  	return this_cpu_inc_return(*(prog->active)) == 1;
> +#endif
>  }

Can preemption between the increment and check cause a counter leak on
CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
(documented at include/linux/rcupdate.h:856).

Consider this scenario on an ARM64 system with PREEMPT_RCU:

1. Thread A increments active[0] to 1
2. Preemption occurs before Thread A reaches the check
3. Thread B on same CPU increments active[0] to 2
4. Thread B checks: sees 2 != BIT(0), returns false
5. Thread A resumes, checks: sees 2 != BIT(0), returns false
6. Both threads return false, neither runs BPF
7. Neither calls bpf_prog_put_recursion_context() (see
   __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
8. Counter permanently stuck at 2, all future BPF on this CPU fails

The old atomic code handled this correctly because this_cpu_inc_return()
completes atomically, ensuring Thread A reads the value 1 before Thread B
can interfere. With non-atomic operations, Thread A increments but hasn't
read the value yet, allowing Thread B to interfere and cause both threads
to see the modified value.

Is there a guarantee that same-context preemption cannot occur in the BPF
execution path on ARM64, or does this need additional synchronization?

> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>
>  static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
>  {
> +#ifdef CONFIG_ARM64
> +	u8 rctx = interrupt_context_level();
> +	u8 *active = this_cpu_ptr(prog->active);
> +
> +	barrier();
> +	active[rctx]--;
> +#else
>  	this_cpu_dec(*(prog->active));
> +#endif
>  }

The barrier() is placed before the decrement. Does this provide the needed
ordering guarantee, or should there also be a barrier after the decrement
to ensure visibility?

> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index c66316e32563..b5063acfcf92 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c

[ ... ]

> @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
>  		vfree(fp);
>  		return NULL;
>  	}
> -	fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> +	fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> +					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));

The allocation changes to u8[4] for all architectures, but the non-ARM64
code path still uses *(prog->active) which now dereferences only the first
byte. This changes the counter from int (typically 4 bytes, range ~2^31)
to u8 (1 byte, range 256). While 256 recursion levels is sufficient in
practice, is this intentional for non-ARM64 platforms, or should they
continue using a full int-sized allocation?

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20310266487

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 16:56   ` bot+bpf-ci
@ 2025-12-17 18:13     ` Puranjay Mohan
  2025-12-17 18:46       ` Alexei Starovoitov
  2025-12-17 18:23     ` Yonghong Song
  1 sibling, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 18:13 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	kernel-team, catalin.marinas, will, mark.rutland,
	linux-arm-kernel, yonghong.song, clm, ihor.solodrai

On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 2da986136d26..654fb94bf60c 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
>
> [ ... ]
>
> > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >
> >  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >  {
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +
> > +     active[rctx]++;
> > +     barrier();
> > +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> > +             return false;
> > +
> > +     return true;
> > +#else
> >       return this_cpu_inc_return(*(prog->active)) == 1;
> > +#endif
> >  }
>
> Can preemption between the increment and check cause a counter leak on
> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> (documented at include/linux/rcupdate.h:856).
>
> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>
> 1. Thread A increments active[0] to 1
> 2. Preemption occurs before Thread A reaches the check
> 3. Thread B on same CPU increments active[0] to 2
> 4. Thread B checks: sees 2 != BIT(0), returns false
> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> 6. Both threads return false, neither runs BPF
> 7. Neither calls bpf_prog_put_recursion_context() (see
>    __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> 8. Counter permanently stuck at 2, all future BPF on this CPU fails

Step 7 is incorrect. Looking at the JIT-generated code, the exit
function is ALWAYS called, regardless of whether the enter function
returns 0 or a start time:

  // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
  call bpf_trampoline_enter()    // Line 2998
  test rax, rax                   // Line 3006
  je skip_exec                    // Conditional jump
  ... BPF program execution ...   // Lines 3011-3023
  skip_exec:                      // Line 3037 (jump lands here)
  call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed

  The bpf_trampoline_exit() call is after the skip_exec label, so it
executes in both cases.

What Actually Happens:

  Initial state: active[0] = 0

  Thread A (normal context, rctx=0):
  1. active[0]++ → active[0] = 1
  2. Preempted before barrier()

  Thread B (scheduled on same CPU, normal context, rctx=0):
  3. active[0]++ → active[0] = 2
  4. barrier()
  5. get_unaligned_le32(active) → reads 0x00000002
  6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
  7. __bpf_prog_enter_recur returns 0
  8. JIT checks return value, skips BPF execution
  9. JIT ALWAYS calls __bpf_prog_exit_recur (see
arch/arm64/net/bpf_jit_comp.c:2362)
  10. bpf_prog_put_recursion_context(prog) executes
  11. barrier(), active[0]-- → active[0] = 1

  Thread A resumes:
  12. barrier()
  13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
decremented!)
  14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
  15. __bpf_prog_enter_recur returns start_time
  16. BPF program executes
  17. __bpf_prog_exit_recur called
  18. bpf_prog_put_recursion_context(prog) executes
  19. barrier(), active[0]-- → active[0] = 0 ✓

  Final State

  - Counter returns to 0 ✓
  - No leak ✓
  - Thread B detected interference and aborted ✓
  - Thread A executed successfully ✓
  - Only ONE thread executed the BPF program ✓


Now that I think of it, there is another race condition that leads to
NEITHER program running:

Consider this scenario on an arm64 system with PREEMPT_RCU:

1. Thread A increments active[0] from 0 to 1
2. Thread A is preempted before reaching barrier()
3. Thread B (same CPU, same context) increments active[0] from 1 to 2
4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
returns false
6. Both threads return false to __bpf_prog_enter_recur()
7. Both skip BPF program execution
8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
9. Neither BPF program executes, but the counter correctly returns to 0

This means the patch is changing the behaviour in case of recursion
from "One program gets to run" to
"At most one program gets to run", but given the performance benefits,
I think we can accept this change.
I am not sure what would change in reality if we allow both BPF
programs to be skipped in case of high interleaving
like above.

> The old atomic code handled this correctly because this_cpu_inc_return()
> completes atomically, ensuring Thread A reads the value 1 before Thread B
> can interfere. With non-atomic operations, Thread A increments but hasn't
> read the value yet, allowing Thread B to interfere and cause both threads
> to see the modified value.
>
> Is there a guarantee that same-context preemption cannot occur in the BPF
> execution path on ARM64, or does this need additional synchronization?
>
> > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >
> >  static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
> >  {
> > +#ifdef CONFIG_ARM64
> > +     u8 rctx = interrupt_context_level();
> > +     u8 *active = this_cpu_ptr(prog->active);
> > +
> > +     barrier();
> > +     active[rctx]--;
> > +#else
> >       this_cpu_dec(*(prog->active));
> > +#endif
> >  }
>
> The barrier() is placed before the decrement. Does this provide the needed
> ordering guarantee, or should there also be a barrier after the decrement
> to ensure visibility?
>
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index c66316e32563..b5063acfcf92 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
>
> [ ... ]
>
> > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
> >               vfree(fp);
> >               return NULL;
> >       }
> > -     fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> > +     fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> > +                                     bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
>
> The allocation changes to u8[4] for all architectures, but the non-ARM64
> code path still uses *(prog->active) which now dereferences only the first
> byte. This changes the counter from int (typically 4 bytes, range ~2^31)
> to u8 (1 byte, range 256). While 256 recursion levels is sufficient in
> practice, is this intentional for non-ARM64 platforms, or should they
> continue using a full int-sized allocation?

This was intentional so that we don't add another #ifdef here, 256
recursion levels are more than enough.
But to keep the previous recursion levels, I could change the non
ARM64 code to do something like:

this_cpu_dec(*(int __percpu *)(prog->active));

I can do this in the next version.

Thanks,
Puranjay


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 16:56   ` bot+bpf-ci
  2025-12-17 18:13     ` Puranjay Mohan
@ 2025-12-17 18:23     ` Yonghong Song
  2025-12-17 18:44       ` Puranjay Mohan
  1 sibling, 1 reply; 13+ messages in thread
From: Yonghong Song @ 2025-12-17 18:23 UTC (permalink / raw)
  To: bot+bpf-ci, puranjay, bpf
  Cc: puranjay12, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	kernel-team, catalin.marinas, will, mark.rutland,
	linux-arm-kernel, clm, ihor.solodrai



On 12/17/25 8:56 AM, bot+bpf-ci@kernel.org wrote:
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 2da986136d26..654fb94bf60c 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
> [ ... ]
>
>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>
>>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>>   {
>> +#ifdef CONFIG_ARM64
>> +	u8 rctx = interrupt_context_level();
>> +	u8 *active = this_cpu_ptr(prog->active);
>> +
>> +	active[rctx]++;
>> +	barrier();
>> +	if (get_unaligned_le32(active) != BIT(rctx * 8))
>> +		return false;
>> +
>> +	return true;
>> +#else
>>   	return this_cpu_inc_return(*(prog->active)) == 1;
>> +#endif
>>   }
> Can preemption between the increment and check cause a counter leak on
> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> (documented at include/linux/rcupdate.h:856).
>
> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>
> 1. Thread A increments active[0] to 1
> 2. Preemption occurs before Thread A reaches the check
> 3. Thread B on same CPU increments active[0] to 2
> 4. Thread B checks: sees 2 != BIT(0), returns false
> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> 6. Both threads return false, neither runs BPF
> 7. Neither calls bpf_prog_put_recursion_context() (see
>     __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>
> The old atomic code handled this correctly because this_cpu_inc_return()
> completes atomically, ensuring Thread A reads the value 1 before Thread B
> can interfere. With non-atomic operations, Thread A increments but hasn't
> read the value yet, allowing Thread B to interfere and cause both threads
> to see the modified value.
>
> Is there a guarantee that same-context preemption cannot occur in the BPF
> execution path on ARM64, or does this need additional synchronization?

AI is correct here. See below for another thread:
   https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev/T/#m906fd4502fbbedd4609c586122a393363003312a
where preempt_disable is necessary to prevent the above scenario.

But adding preempt_disable may impact the overall system-level performance.

Does this patch can improve performance for *all* ARM64 cpu versions?
Do you have numbers to show how much performance improvement?

>
>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>
>>   static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
>>   {
>> +#ifdef CONFIG_ARM64
>> +	u8 rctx = interrupt_context_level();
>> +	u8 *active = this_cpu_ptr(prog->active);
>> +
>> +	barrier();
>> +	active[rctx]--;
>> +#else
>>   	this_cpu_dec(*(prog->active));
>> +#endif
>>   }
[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 18:23     ` Yonghong Song
@ 2025-12-17 18:44       ` Puranjay Mohan
  2025-12-17 18:47         ` Alexei Starovoitov
  2025-12-17 19:45         ` Yonghong Song
  0 siblings, 2 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 18:44 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bot+bpf-ci, bpf, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	kernel-team, catalin.marinas, will, mark.rutland,
	linux-arm-kernel, clm, ihor.solodrai

On Wed, Dec 17, 2025 at 6:24 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 12/17/25 8:56 AM, bot+bpf-ci@kernel.org wrote:
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 2da986136d26..654fb94bf60c 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> > [ ... ]
> >
> >> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >>
> >>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >>   {
> >> +#ifdef CONFIG_ARM64
> >> +    u8 rctx = interrupt_context_level();
> >> +    u8 *active = this_cpu_ptr(prog->active);
> >> +
> >> +    active[rctx]++;
> >> +    barrier();
> >> +    if (get_unaligned_le32(active) != BIT(rctx * 8))
> >> +            return false;
> >> +
> >> +    return true;
> >> +#else
> >>      return this_cpu_inc_return(*(prog->active)) == 1;
> >> +#endif
> >>   }
> > Can preemption between the increment and check cause a counter leak on
> > CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> > rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> > (documented at include/linux/rcupdate.h:856).
> >
> > Consider this scenario on an ARM64 system with PREEMPT_RCU:
> >
> > 1. Thread A increments active[0] to 1
> > 2. Preemption occurs before Thread A reaches the check
> > 3. Thread B on same CPU increments active[0] to 2
> > 4. Thread B checks: sees 2 != BIT(0), returns false
> > 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> > 6. Both threads return false, neither runs BPF
> > 7. Neither calls bpf_prog_put_recursion_context() (see
> >     __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> > 8. Counter permanently stuck at 2, all future BPF on this CPU fails
> >
> > The old atomic code handled this correctly because this_cpu_inc_return()
> > completes atomically, ensuring Thread A reads the value 1 before Thread B
> > can interfere. With non-atomic operations, Thread A increments but hasn't
> > read the value yet, allowing Thread B to interfere and cause both threads
> > to see the modified value.
> >
> > Is there a guarantee that same-context preemption cannot occur in the BPF
> > execution path on ARM64, or does this need additional synchronization?
>
> AI is correct here. See below for another thread:
>    https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev/T/#m906fd4502fbbedd4609c586122a393363003312a
> where preempt_disable is necessary to prevent the above scenario.

See my other reply, the above scenario presented by AI is wrong
because step 7 is wrong.

> But adding preempt_disable may impact the overall system-level performance.
>
> Does this patch can improve performance for *all* ARM64 cpu versions?
> Do you have numbers to show how much performance improvement?

This should improve performance on all arm64 CPUs because atomics are
expensive because they are atomic across all cpus.

I see a 33% improvement in the fentry trigger benchmark, but I can do
more benchmarking.

> >
> >> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >>
> >>   static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
> >>   {
> >> +#ifdef CONFIG_ARM64
> >> +    u8 rctx = interrupt_context_level();
> >> +    u8 *active = this_cpu_ptr(prog->active);
> >> +
> >> +    barrier();
> >> +    active[rctx]--;
> >> +#else
> >>      this_cpu_dec(*(prog->active));
> >> +#endif
> >>   }
> [...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 18:13     ` Puranjay Mohan
@ 2025-12-17 18:46       ` Alexei Starovoitov
  2025-12-17 18:56         ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-17 18:46 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel, Yonghong Song,
	Chris Mason, Ihor Solodrai

On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
> >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 2da986136d26..654fb94bf60c 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> >
> > [ ... ]
> >
> > > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > >
> > >  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> > >  {
> > > +#ifdef CONFIG_ARM64
> > > +     u8 rctx = interrupt_context_level();
> > > +     u8 *active = this_cpu_ptr(prog->active);
> > > +
> > > +     active[rctx]++;
> > > +     barrier();
> > > +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> > > +             return false;
> > > +
> > > +     return true;
> > > +#else
> > >       return this_cpu_inc_return(*(prog->active)) == 1;
> > > +#endif
> > >  }
> >
> > Can preemption between the increment and check cause a counter leak on
> > CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> > rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> > (documented at include/linux/rcupdate.h:856).
> >
> > Consider this scenario on an ARM64 system with PREEMPT_RCU:
> >
> > 1. Thread A increments active[0] to 1
> > 2. Preemption occurs before Thread A reaches the check
> > 3. Thread B on same CPU increments active[0] to 2
> > 4. Thread B checks: sees 2 != BIT(0), returns false
> > 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> > 6. Both threads return false, neither runs BPF
> > 7. Neither calls bpf_prog_put_recursion_context() (see
> >    __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> > 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>
> Step 7 is incorrect. Looking at the JIT-generated code, the exit
> function is ALWAYS called, regardless of whether the enter function
> returns 0 or a start time:
>
>   // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
>   call bpf_trampoline_enter()    // Line 2998
>   test rax, rax                   // Line 3006
>   je skip_exec                    // Conditional jump
>   ... BPF program execution ...   // Lines 3011-3023
>   skip_exec:                      // Line 3037 (jump lands here)
>   call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed
>
>   The bpf_trampoline_exit() call is after the skip_exec label, so it
> executes in both cases.
>
> What Actually Happens:
>
>   Initial state: active[0] = 0
>
>   Thread A (normal context, rctx=0):
>   1. active[0]++ → active[0] = 1
>   2. Preempted before barrier()
>
>   Thread B (scheduled on same CPU, normal context, rctx=0):
>   3. active[0]++ → active[0] = 2
>   4. barrier()
>   5. get_unaligned_le32(active) → reads 0x00000002
>   6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
>   7. __bpf_prog_enter_recur returns 0
>   8. JIT checks return value, skips BPF execution
>   9. JIT ALWAYS calls __bpf_prog_exit_recur (see
> arch/arm64/net/bpf_jit_comp.c:2362)
>   10. bpf_prog_put_recursion_context(prog) executes
>   11. barrier(), active[0]-- → active[0] = 1
>
>   Thread A resumes:
>   12. barrier()
>   13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
> decremented!)
>   14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
>   15. __bpf_prog_enter_recur returns start_time
>   16. BPF program executes
>   17. __bpf_prog_exit_recur called
>   18. bpf_prog_put_recursion_context(prog) executes
>   19. barrier(), active[0]-- → active[0] = 0 ✓
>
>   Final State
>
>   - Counter returns to 0 ✓
>   - No leak ✓
>   - Thread B detected interference and aborted ✓
>   - Thread A executed successfully ✓
>   - Only ONE thread executed the BPF program ✓
>
>
> Now that I think of it, there is another race condition that leads to
> NEITHER program running:
>
> Consider this scenario on an arm64 system with PREEMPT_RCU:
>
> 1. Thread A increments active[0] from 0 to 1
> 2. Thread A is preempted before reaching barrier()
> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
> returns false
> 6. Both threads return false to __bpf_prog_enter_recur()
> 7. Both skip BPF program execution
> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
> 9. Neither BPF program executes, but the counter correctly returns to 0
>
> This means the patch is changing the behaviour in case of recursion
> from "One program gets to run" to
> "At most one program gets to run", but given the performance benefits,
> I think we can accept this change.

Agree. It's fine, but we can mitigate it, but doing this rctx trick
only when RCU is not preemptable. Which pretty much would mean
that PREEMPT_RT will use atomic and !RT will use rctx
and this 'no prog executes' will not happen.

> I am not sure what would change in reality if we allow both BPF
> programs to be skipped in case of high interleaving
> like above.
>
> > The old atomic code handled this correctly because this_cpu_inc_return()
> > completes atomically, ensuring Thread A reads the value 1 before Thread B
> > can interfere. With non-atomic operations, Thread A increments but hasn't
> > read the value yet, allowing Thread B to interfere and cause both threads
> > to see the modified value.
> >
> > Is there a guarantee that same-context preemption cannot occur in the BPF
> > execution path on ARM64, or does this need additional synchronization?
> >
> > > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > >
> > >  static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
> > >  {
> > > +#ifdef CONFIG_ARM64
> > > +     u8 rctx = interrupt_context_level();
> > > +     u8 *active = this_cpu_ptr(prog->active);
> > > +
> > > +     barrier();
> > > +     active[rctx]--;
> > > +#else
> > >       this_cpu_dec(*(prog->active));
> > > +#endif
> > >  }
> >
> > The barrier() is placed before the decrement. Does this provide the needed
> > ordering guarantee, or should there also be a barrier after the decrement
> > to ensure visibility?
> >
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index c66316e32563..b5063acfcf92 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> >
> > [ ... ]
> >
> > > @@ -112,7 +112,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
> > >               vfree(fp);
> > >               return NULL;
> > >       }
> > > -     fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> > > +     fp->active = __alloc_percpu_gfp(sizeof(u8[BPF_NR_CONTEXTS]), 8,
> > > +                                     bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
> >
> > The allocation changes to u8[4] for all architectures, but the non-ARM64
> > code path still uses *(prog->active) which now dereferences only the first
> > byte. This changes the counter from int (typically 4 bytes, range ~2^31)
> > to u8 (1 byte, range 256). While 256 recursion levels is sufficient in
> > practice, is this intentional for non-ARM64 platforms, or should they
> > continue using a full int-sized allocation?
>
> This was intentional so that we don't add another #ifdef here, 256
> recursion levels are more than enough.

256 will be fine for !RT.
So suggestion above will address this concern as well when 256+ tasks
are racing concurrently on this cpu.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 18:44       ` Puranjay Mohan
@ 2025-12-17 18:47         ` Alexei Starovoitov
  2025-12-17 19:45         ` Yonghong Song
  1 sibling, 0 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-17 18:47 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Yonghong Song, bot+bpf-ci, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel, Chris Mason,
	Ihor Solodrai

On Wed, Dec 17, 2025 at 10:44 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> On Wed, Dec 17, 2025 at 6:24 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> >
> >
> > On 12/17/25 8:56 AM, bot+bpf-ci@kernel.org wrote:
> > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > >> index 2da986136d26..654fb94bf60c 100644
> > >> --- a/include/linux/bpf.h
> > >> +++ b/include/linux/bpf.h
> > > [ ... ]
> > >
> > >> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > >>
> > >>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> > >>   {
> > >> +#ifdef CONFIG_ARM64
> > >> +    u8 rctx = interrupt_context_level();
> > >> +    u8 *active = this_cpu_ptr(prog->active);
> > >> +
> > >> +    active[rctx]++;
> > >> +    barrier();
> > >> +    if (get_unaligned_le32(active) != BIT(rctx * 8))
> > >> +            return false;
> > >> +
> > >> +    return true;
> > >> +#else
> > >>      return this_cpu_inc_return(*(prog->active)) == 1;
> > >> +#endif
> > >>   }
> > > Can preemption between the increment and check cause a counter leak on
> > > CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> > > rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> > > (documented at include/linux/rcupdate.h:856).
> > >
> > > Consider this scenario on an ARM64 system with PREEMPT_RCU:
> > >
> > > 1. Thread A increments active[0] to 1
> > > 2. Preemption occurs before Thread A reaches the check
> > > 3. Thread B on same CPU increments active[0] to 2
> > > 4. Thread B checks: sees 2 != BIT(0), returns false
> > > 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> > > 6. Both threads return false, neither runs BPF
> > > 7. Neither calls bpf_prog_put_recursion_context() (see
> > >     __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> > > 8. Counter permanently stuck at 2, all future BPF on this CPU fails
> > >
> > > The old atomic code handled this correctly because this_cpu_inc_return()
> > > completes atomically, ensuring Thread A reads the value 1 before Thread B
> > > can interfere. With non-atomic operations, Thread A increments but hasn't
> > > read the value yet, allowing Thread B to interfere and cause both threads
> > > to see the modified value.
> > >
> > > Is there a guarantee that same-context preemption cannot occur in the BPF
> > > execution path on ARM64, or does this need additional synchronization?
> >
> > AI is correct here. See below for another thread:
> >    https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev/T/#m906fd4502fbbedd4609c586122a393363003312a
> > where preempt_disable is necessary to prevent the above scenario.
>
> See my other reply, the above scenario presented by AI is wrong
> because step 7 is wrong.

yep. preempt_disable is not necessary here. perf buffers being
reused is a different issue.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 18:46       ` Alexei Starovoitov
@ 2025-12-17 18:56         ` Puranjay Mohan
  2025-12-17 21:31           ` Yonghong Song
  0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 18:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel, Yonghong Song,
	Chris Mason, Ihor Solodrai

On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >
> > On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
> > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 2da986136d26..654fb94bf60c 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > >
> > > [ ... ]
> > >
> > > > @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> > > >
> > > >  static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> > > >  {
> > > > +#ifdef CONFIG_ARM64
> > > > +     u8 rctx = interrupt_context_level();
> > > > +     u8 *active = this_cpu_ptr(prog->active);
> > > > +
> > > > +     active[rctx]++;
> > > > +     barrier();
> > > > +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> > > > +             return false;
> > > > +
> > > > +     return true;
> > > > +#else
> > > >       return this_cpu_inc_return(*(prog->active)) == 1;
> > > > +#endif
> > > >  }
> > >
> > > Can preemption between the increment and check cause a counter leak on
> > > CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> > > rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> > > (documented at include/linux/rcupdate.h:856).
> > >
> > > Consider this scenario on an ARM64 system with PREEMPT_RCU:
> > >
> > > 1. Thread A increments active[0] to 1
> > > 2. Preemption occurs before Thread A reaches the check
> > > 3. Thread B on same CPU increments active[0] to 2
> > > 4. Thread B checks: sees 2 != BIT(0), returns false
> > > 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> > > 6. Both threads return false, neither runs BPF
> > > 7. Neither calls bpf_prog_put_recursion_context() (see
> > >    __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> > > 8. Counter permanently stuck at 2, all future BPF on this CPU fails
> >
> > Step 7 is incorrect. Looking at the JIT-generated code, the exit
> > function is ALWAYS called, regardless of whether the enter function
> > returns 0 or a start time:
> >
> >   // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
> >   call bpf_trampoline_enter()    // Line 2998
> >   test rax, rax                   // Line 3006
> >   je skip_exec                    // Conditional jump
> >   ... BPF program execution ...   // Lines 3011-3023
> >   skip_exec:                      // Line 3037 (jump lands here)
> >   call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed
> >
> >   The bpf_trampoline_exit() call is after the skip_exec label, so it
> > executes in both cases.
> >
> > What Actually Happens:
> >
> >   Initial state: active[0] = 0
> >
> >   Thread A (normal context, rctx=0):
> >   1. active[0]++ → active[0] = 1
> >   2. Preempted before barrier()
> >
> >   Thread B (scheduled on same CPU, normal context, rctx=0):
> >   3. active[0]++ → active[0] = 2
> >   4. barrier()
> >   5. get_unaligned_le32(active) → reads 0x00000002
> >   6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
> >   7. __bpf_prog_enter_recur returns 0
> >   8. JIT checks return value, skips BPF execution
> >   9. JIT ALWAYS calls __bpf_prog_exit_recur (see
> > arch/arm64/net/bpf_jit_comp.c:2362)
> >   10. bpf_prog_put_recursion_context(prog) executes
> >   11. barrier(), active[0]-- → active[0] = 1
> >
> >   Thread A resumes:
> >   12. barrier()
> >   13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
> > decremented!)
> >   14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
> >   15. __bpf_prog_enter_recur returns start_time
> >   16. BPF program executes
> >   17. __bpf_prog_exit_recur called
> >   18. bpf_prog_put_recursion_context(prog) executes
> >   19. barrier(), active[0]-- → active[0] = 0 ✓
> >
> >   Final State
> >
> >   - Counter returns to 0 ✓
> >   - No leak ✓
> >   - Thread B detected interference and aborted ✓
> >   - Thread A executed successfully ✓
> >   - Only ONE thread executed the BPF program ✓
> >
> >
> > Now that I think of it, there is another race condition that leads to
> > NEITHER program running:
> >
> > Consider this scenario on an arm64 system with PREEMPT_RCU:
> >
> > 1. Thread A increments active[0] from 0 to 1
> > 2. Thread A is preempted before reaching barrier()
> > 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
> > 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
> > 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
> > returns false
> > 6. Both threads return false to __bpf_prog_enter_recur()
> > 7. Both skip BPF program execution
> > 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
> > 9. Neither BPF program executes, but the counter correctly returns to 0
> >
> > This means the patch is changing the behaviour in case of recursion
> > from "One program gets to run" to
> > "At most one program gets to run", but given the performance benefits,
> > I think we can accept this change.
>
> Agree. It's fine, but we can mitigate it, but doing this rctx trick
> only when RCU is not preemptable. Which pretty much would mean
> that PREEMPT_RT will use atomic and !RT will use rctx
> and this 'no prog executes' will not happen.


The issue is also with sleepable programs, they use
rcu_read_lock_trace() and can end up with
'no prog executes' scenario.

What do you think is the best approach for them?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 18:44       ` Puranjay Mohan
  2025-12-17 18:47         ` Alexei Starovoitov
@ 2025-12-17 19:45         ` Yonghong Song
  1 sibling, 0 replies; 13+ messages in thread
From: Yonghong Song @ 2025-12-17 19:45 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bot+bpf-ci, bpf, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	kernel-team, catalin.marinas, will, mark.rutland,
	linux-arm-kernel, clm, ihor.solodrai



On 12/17/25 10:44 AM, Puranjay Mohan wrote:
> On Wed, Dec 17, 2025 at 6:24 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>> On 12/17/25 8:56 AM, bot+bpf-ci@kernel.org wrote:
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index 2da986136d26..654fb94bf60c 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>> [ ... ]
>>>
>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>
>>>>    static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>>>>    {
>>>> +#ifdef CONFIG_ARM64
>>>> +    u8 rctx = interrupt_context_level();
>>>> +    u8 *active = this_cpu_ptr(prog->active);
>>>> +
>>>> +    active[rctx]++;
>>>> +    barrier();
>>>> +    if (get_unaligned_le32(active) != BIT(rctx * 8))
>>>> +            return false;
>>>> +
>>>> +    return true;
>>>> +#else
>>>>       return this_cpu_inc_return(*(prog->active)) == 1;
>>>> +#endif
>>>>    }
>>> Can preemption between the increment and check cause a counter leak on
>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
>>> (documented at include/linux/rcupdate.h:856).
>>>
>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>>>
>>> 1. Thread A increments active[0] to 1
>>> 2. Preemption occurs before Thread A reaches the check
>>> 3. Thread B on same CPU increments active[0] to 2
>>> 4. Thread B checks: sees 2 != BIT(0), returns false
>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
>>> 6. Both threads return false, neither runs BPF
>>> 7. Neither calls bpf_prog_put_recursion_context() (see
>>>      __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>>>
>>> The old atomic code handled this correctly because this_cpu_inc_return()
>>> completes atomically, ensuring Thread A reads the value 1 before Thread B
>>> can interfere. With non-atomic operations, Thread A increments but hasn't
>>> read the value yet, allowing Thread B to interfere and cause both threads
>>> to see the modified value.
>>>
>>> Is there a guarantee that same-context preemption cannot occur in the BPF
>>> execution path on ARM64, or does this need additional synchronization?
>> AI is correct here. See below for another thread:
>>     https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev/T/#m906fd4502fbbedd4609c586122a393363003312a
>> where preempt_disable is necessary to prevent the above scenario.
> See my other reply, the above scenario presented by AI is wrong
> because step 7 is wrong.

Thanks for explanation. I missed the case that bpf_trampoline_exit()
is always executed. In such cases, the active[rctx] counter will be
always corrected.

>
>> But adding preempt_disable may impact the overall system-level performance.
>>
>> Does this patch can improve performance for *all* ARM64 cpu versions?
>> Do you have numbers to show how much performance improvement?
> This should improve performance on all arm64 CPUs because atomics are
> expensive because they are atomic across all cpus.

Good to know. Thanks!

>
> I see a 33% improvement in the fentry trigger benchmark, but I can do
> more benchmarking.
>
>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>
>>>>    static inline void bpf_prog_put_recursion_context(struct bpf_prog *prog)
>>>>    {
>>>> +#ifdef CONFIG_ARM64
>>>> +    u8 rctx = interrupt_context_level();
>>>> +    u8 *active = this_cpu_ptr(prog->active);
>>>> +
>>>> +    barrier();
>>>> +    active[rctx]--;
>>>> +#else
>>>>       this_cpu_dec(*(prog->active));
>>>> +#endif
>>>>    }
>> [...]



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 18:56         ` Puranjay Mohan
@ 2025-12-17 21:31           ` Yonghong Song
  2025-12-17 22:29             ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Yonghong Song @ 2025-12-17 21:31 UTC (permalink / raw)
  To: Puranjay Mohan, Alexei Starovoitov
  Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel, Chris Mason,
	Ihor Solodrai



On 12/17/25 10:56 AM, Puranjay Mohan wrote:
> On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index 2da986136d26..654fb94bf60c 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>> [ ... ]
>>>>
>>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>>
>>>>>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>>>>>   {
>>>>> +#ifdef CONFIG_ARM64
>>>>> +     u8 rctx = interrupt_context_level();
>>>>> +     u8 *active = this_cpu_ptr(prog->active);
>>>>> +
>>>>> +     active[rctx]++;
>>>>> +     barrier();
>>>>> +     if (get_unaligned_le32(active) != BIT(rctx * 8))
>>>>> +             return false;
>>>>> +
>>>>> +     return true;
>>>>> +#else
>>>>>        return this_cpu_inc_return(*(prog->active)) == 1;
>>>>> +#endif
>>>>>   }
>>>> Can preemption between the increment and check cause a counter leak on
>>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
>>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
>>>> (documented at include/linux/rcupdate.h:856).
>>>>
>>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>>>>
>>>> 1. Thread A increments active[0] to 1
>>>> 2. Preemption occurs before Thread A reaches the check
>>>> 3. Thread B on same CPU increments active[0] to 2
>>>> 4. Thread B checks: sees 2 != BIT(0), returns false
>>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
>>>> 6. Both threads return false, neither runs BPF
>>>> 7. Neither calls bpf_prog_put_recursion_context() (see
>>>>     __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
>>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>>> Step 7 is incorrect. Looking at the JIT-generated code, the exit
>>> function is ALWAYS called, regardless of whether the enter function
>>> returns 0 or a start time:
>>>
>>>    // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
>>>    call bpf_trampoline_enter()    // Line 2998
>>>    test rax, rax                   // Line 3006
>>>    je skip_exec                    // Conditional jump
>>>    ... BPF program execution ...   // Lines 3011-3023
>>>    skip_exec:                      // Line 3037 (jump lands here)
>>>    call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed
>>>
>>>    The bpf_trampoline_exit() call is after the skip_exec label, so it
>>> executes in both cases.
>>>
>>> What Actually Happens:
>>>
>>>    Initial state: active[0] = 0
>>>
>>>    Thread A (normal context, rctx=0):
>>>    1. active[0]++ → active[0] = 1
>>>    2. Preempted before barrier()
>>>
>>>    Thread B (scheduled on same CPU, normal context, rctx=0):
>>>    3. active[0]++ → active[0] = 2
>>>    4. barrier()
>>>    5. get_unaligned_le32(active) → reads 0x00000002
>>>    6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
>>>    7. __bpf_prog_enter_recur returns 0
>>>    8. JIT checks return value, skips BPF execution
>>>    9. JIT ALWAYS calls __bpf_prog_exit_recur (see
>>> arch/arm64/net/bpf_jit_comp.c:2362)
>>>    10. bpf_prog_put_recursion_context(prog) executes
>>>    11. barrier(), active[0]-- → active[0] = 1
>>>
>>>    Thread A resumes:
>>>    12. barrier()
>>>    13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
>>> decremented!)
>>>    14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
>>>    15. __bpf_prog_enter_recur returns start_time
>>>    16. BPF program executes
>>>    17. __bpf_prog_exit_recur called
>>>    18. bpf_prog_put_recursion_context(prog) executes
>>>    19. barrier(), active[0]-- → active[0] = 0 ✓
>>>
>>>    Final State
>>>
>>>    - Counter returns to 0 ✓
>>>    - No leak ✓
>>>    - Thread B detected interference and aborted ✓
>>>    - Thread A executed successfully ✓
>>>    - Only ONE thread executed the BPF program ✓
>>>
>>>
>>> Now that I think of it, there is another race condition that leads to
>>> NEITHER program running:
>>>
>>> Consider this scenario on an arm64 system with PREEMPT_RCU:
>>>
>>> 1. Thread A increments active[0] from 0 to 1
>>> 2. Thread A is preempted before reaching barrier()
>>> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
>>> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
>>> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
>>> returns false
>>> 6. Both threads return false to __bpf_prog_enter_recur()
>>> 7. Both skip BPF program execution
>>> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
>>> 9. Neither BPF program executes, but the counter correctly returns to 0
>>>
>>> This means the patch is changing the behaviour in case of recursion
>>> from "One program gets to run" to
>>> "At most one program gets to run", but given the performance benefits,
>>> I think we can accept this change.
>> Agree. It's fine, but we can mitigate it, but doing this rctx trick
>> only when RCU is not preemptable. Which pretty much would mean
>> that PREEMPT_RT will use atomic and !RT will use rctx
>> and this 'no prog executes' will not happen.
>
> The issue is also with sleepable programs, they use
> rcu_read_lock_trace() and can end up with
> 'no prog executes' scenario.
>
> What do you think is the best approach for them?

For sleepable programs, maybe we can use the original approach like
   return this_cpu_inc_return(*(prog->active)) == 1;
?
This should solve the 'no prog execution' issue.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
  2025-12-17 21:31           ` Yonghong Song
@ 2025-12-17 22:29             ` Puranjay Mohan
  0 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-17 22:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, bot+bpf-ci, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Catalin Marinas,
	Will Deacon, Mark Rutland, linux-arm-kernel, Chris Mason,
	Ihor Solodrai

On Wed, Dec 17, 2025 at 9:32 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 12/17/25 10:56 AM, Puranjay Mohan wrote:
> > On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
> >>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >>>>> index 2da986136d26..654fb94bf60c 100644
> >>>>> --- a/include/linux/bpf.h
> >>>>> +++ b/include/linux/bpf.h
> >>>> [ ... ]
> >>>>
> >>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
> >>>>>
> >>>>>   static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
> >>>>>   {
> >>>>> +#ifdef CONFIG_ARM64
> >>>>> +     u8 rctx = interrupt_context_level();
> >>>>> +     u8 *active = this_cpu_ptr(prog->active);
> >>>>> +
> >>>>> +     active[rctx]++;
> >>>>> +     barrier();
> >>>>> +     if (get_unaligned_le32(active) != BIT(rctx * 8))
> >>>>> +             return false;
> >>>>> +
> >>>>> +     return true;
> >>>>> +#else
> >>>>>        return this_cpu_inc_return(*(prog->active)) == 1;
> >>>>> +#endif
> >>>>>   }
> >>>> Can preemption between the increment and check cause a counter leak on
> >>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
> >>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
> >>>> (documented at include/linux/rcupdate.h:856).
> >>>>
> >>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
> >>>>
> >>>> 1. Thread A increments active[0] to 1
> >>>> 2. Preemption occurs before Thread A reaches the check
> >>>> 3. Thread B on same CPU increments active[0] to 2
> >>>> 4. Thread B checks: sees 2 != BIT(0), returns false
> >>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
> >>>> 6. Both threads return false, neither runs BPF
> >>>> 7. Neither calls bpf_prog_put_recursion_context() (see
> >>>>     __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
> >>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
> >>> Step 7 is incorrect. Looking at the JIT-generated code, the exit
> >>> function is ALWAYS called, regardless of whether the enter function
> >>> returns 0 or a start time:
> >>>
> >>>    // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
> >>>    call bpf_trampoline_enter()    // Line 2998
> >>>    test rax, rax                   // Line 3006
> >>>    je skip_exec                    // Conditional jump
> >>>    ... BPF program execution ...   // Lines 3011-3023
> >>>    skip_exec:                      // Line 3037 (jump lands here)
> >>>    call bpf_trampoline_exit()      // Line 3049 - ALWAYS executed
> >>>
> >>>    The bpf_trampoline_exit() call is after the skip_exec label, so it
> >>> executes in both cases.
> >>>
> >>> What Actually Happens:
> >>>
> >>>    Initial state: active[0] = 0
> >>>
> >>>    Thread A (normal context, rctx=0):
> >>>    1. active[0]++ → active[0] = 1
> >>>    2. Preempted before barrier()
> >>>
> >>>    Thread B (scheduled on same CPU, normal context, rctx=0):
> >>>    3. active[0]++ → active[0] = 2
> >>>    4. barrier()
> >>>    5. get_unaligned_le32(active) → reads 0x00000002
> >>>    6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
> >>>    7. __bpf_prog_enter_recur returns 0
> >>>    8. JIT checks return value, skips BPF execution
> >>>    9. JIT ALWAYS calls __bpf_prog_exit_recur (see
> >>> arch/arm64/net/bpf_jit_comp.c:2362)
> >>>    10. bpf_prog_put_recursion_context(prog) executes
> >>>    11. barrier(), active[0]-- → active[0] = 1
> >>>
> >>>    Thread A resumes:
> >>>    12. barrier()
> >>>    13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
> >>> decremented!)
> >>>    14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
> >>>    15. __bpf_prog_enter_recur returns start_time
> >>>    16. BPF program executes
> >>>    17. __bpf_prog_exit_recur called
> >>>    18. bpf_prog_put_recursion_context(prog) executes
> >>>    19. barrier(), active[0]-- → active[0] = 0 ✓
> >>>
> >>>    Final State
> >>>
> >>>    - Counter returns to 0 ✓
> >>>    - No leak ✓
> >>>    - Thread B detected interference and aborted ✓
> >>>    - Thread A executed successfully ✓
> >>>    - Only ONE thread executed the BPF program ✓
> >>>
> >>>
> >>> Now that I think of it, there is another race condition that leads to
> >>> NEITHER program running:
> >>>
> >>> Consider this scenario on an arm64 system with PREEMPT_RCU:
> >>>
> >>> 1. Thread A increments active[0] from 0 to 1
> >>> 2. Thread A is preempted before reaching barrier()
> >>> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
> >>> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
> >>> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
> >>> returns false
> >>> 6. Both threads return false to __bpf_prog_enter_recur()
> >>> 7. Both skip BPF program execution
> >>> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
> >>> 9. Neither BPF program executes, but the counter correctly returns to 0
> >>>
> >>> This means the patch is changing the behaviour in case of recursion
> >>> from "One program gets to run" to
> >>> "At most one program gets to run", but given the performance benefits,
> >>> I think we can accept this change.
> >> Agree. It's fine, but we can mitigate it, but doing this rctx trick
> >> only when RCU is not preemptable. Which pretty much would mean
> >> that PREEMPT_RT will use atomic and !RT will use rctx
> >> and this 'no prog executes' will not happen.
> >
> > The issue is also with sleepable programs, they use
> > rcu_read_lock_trace() and can end up with
> > 'no prog executes' scenario.
> >
> > What do you think is the best approach for them?
>
> For sleepable programs, maybe we can use the original approach like
>    return this_cpu_inc_return(*(prog->active)) == 1;
> ?
> This should solve the 'no prog execution' issue.


I tried putting preempt_disable/enable() around inc+read in entry and
dec in exit:

New data, a little different config, so ignore exact values.

This patch:                           56.524M/s
This patch with preempt_disable():    53.856M/s
bpf-next/master:                      43.067M/s
bpf-next/master without Catalin fix:  51.862M/s


This is still very good and covers the sleepable case as well. So let's do this?

Thanks,
Puranjay


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-12-17 22:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 16:28 [PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-17 16:28 ` [PATCH bpf-next 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
2025-12-17 16:28 ` [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
2025-12-17 16:56   ` bot+bpf-ci
2025-12-17 18:13     ` Puranjay Mohan
2025-12-17 18:46       ` Alexei Starovoitov
2025-12-17 18:56         ` Puranjay Mohan
2025-12-17 21:31           ` Yonghong Song
2025-12-17 22:29             ` Puranjay Mohan
2025-12-17 18:23     ` Yonghong Song
2025-12-17 18:44       ` Puranjay Mohan
2025-12-17 18:47         ` Alexei Starovoitov
2025-12-17 19:45         ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).