From: Yonghong Song <yonghong.song@linux.dev>
To: Puranjay Mohan <puranjay12@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bot+bpf-ci@kernel.org, bpf <bpf@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Kernel Team <kernel-team@meta.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
Chris Mason <clm@meta.com>,
Ihor Solodrai <ihor.solodrai@linux.dev>
Subject: Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
Date: Wed, 17 Dec 2025 13:31:58 -0800 [thread overview]
Message-ID: <a0c04178-2159-4475-9be8-93320ffc2138@linux.dev> (raw)
In-Reply-To: <CANk7y0jNj0SDOBr=3n_0jhQbLzaj--yVUF4oDA-ManQG-=bkhw@mail.gmail.com>
On 12/17/25 10:56 AM, Puranjay Mohan wrote:
> On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index 2da986136d26..654fb94bf60c 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>> [ ... ]
>>>>
>>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>>
>>>>> static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>>>>> {
>>>>> +#ifdef CONFIG_ARM64
>>>>> + u8 rctx = interrupt_context_level();
>>>>> + u8 *active = this_cpu_ptr(prog->active);
>>>>> +
>>>>> + active[rctx]++;
>>>>> + barrier();
>>>>> + if (get_unaligned_le32(active) != BIT(rctx * 8))
>>>>> + return false;
>>>>> +
>>>>> + return true;
>>>>> +#else
>>>>> return this_cpu_inc_return(*(prog->active)) == 1;
>>>>> +#endif
>>>>> }
>>>> Can preemption between the increment and check cause a counter leak on
>>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
>>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
>>>> (documented at include/linux/rcupdate.h:856).
>>>>
>>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>>>>
>>>> 1. Thread A increments active[0] to 1
>>>> 2. Preemption occurs before Thread A reaches the check
>>>> 3. Thread B on same CPU increments active[0] to 2
>>>> 4. Thread B checks: sees 2 != BIT(0), returns false
>>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
>>>> 6. Both threads return false, neither runs BPF
>>>> 7. Neither calls bpf_prog_put_recursion_context() (see
>>>> __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
>>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>>> Step 7 is incorrect. Looking at the JIT-generated code, the exit
>>> function is ALWAYS called, regardless of whether the enter function
>>> returns 0 or a start time:
>>>
>>> // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
>>> call bpf_trampoline_enter() // Line 2998
>>> test rax, rax // Line 3006
>>> je skip_exec // Conditional jump
>>> ... BPF program execution ... // Lines 3011-3023
>>> skip_exec: // Line 3037 (jump lands here)
>>> call bpf_trampoline_exit() // Line 3049 - ALWAYS executed
>>>
>>> The bpf_trampoline_exit() call is after the skip_exec label, so it
>>> executes in both cases.
>>>
>>> What Actually Happens:
>>>
>>> Initial state: active[0] = 0
>>>
>>> Thread A (normal context, rctx=0):
>>> 1. active[0]++ → active[0] = 1
>>> 2. Preempted before barrier()
>>>
>>> Thread B (scheduled on same CPU, normal context, rctx=0):
>>> 3. active[0]++ → active[0] = 2
>>> 4. barrier()
>>> 5. get_unaligned_le32(active) → reads 0x00000002
>>> 6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
>>> 7. __bpf_prog_enter_recur returns 0
>>> 8. JIT checks return value, skips BPF execution
>>> 9. JIT ALWAYS calls __bpf_prog_exit_recur (see
>>> arch/arm64/net/bpf_jit_comp.c:2362)
>>> 10. bpf_prog_put_recursion_context(prog) executes
>>> 11. barrier(), active[0]-- → active[0] = 1
>>>
>>> Thread A resumes:
>>> 12. barrier()
>>> 13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
>>> decremented!)
>>> 14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
>>> 15. __bpf_prog_enter_recur returns start_time
>>> 16. BPF program executes
>>> 17. __bpf_prog_exit_recur called
>>> 18. bpf_prog_put_recursion_context(prog) executes
>>> 19. barrier(), active[0]-- → active[0] = 0 ✓
>>>
>>> Final State
>>>
>>> - Counter returns to 0 ✓
>>> - No leak ✓
>>> - Thread B detected interference and aborted ✓
>>> - Thread A executed successfully ✓
>>> - Only ONE thread executed the BPF program ✓
>>>
>>>
>>> Now that I think of it, there is another race condition that leads to
>>> NEITHER program running:
>>>
>>> Consider this scenario on an arm64 system with PREEMPT_RCU:
>>>
>>> 1. Thread A increments active[0] from 0 to 1
>>> 2. Thread A is preempted before reaching barrier()
>>> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
>>> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
>>> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
>>> returns false
>>> 6. Both threads return false to __bpf_prog_enter_recur()
>>> 7. Both skip BPF program execution
>>> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
>>> 9. Neither BPF program executes, but the counter correctly returns to 0
>>>
>>> This means the patch is changing the behaviour in case of recursion
>>> from "One program gets to run" to
>>> "At most one program gets to run", but given the performance benefits,
>>> I think we can accept this change.
>> Agree. It's fine, but we can mitigate it, but doing this rctx trick
>> only when RCU is not preemptable. Which pretty much would mean
>> that PREEMPT_RT will use atomic and !RT will use rctx
>> and this 'no prog executes' will not happen.
>
> The issue is also with sleepable programs, they use
> rcu_read_lock_trace() and can end up with
> 'no prog executes' scenario.
>
> What do you think is the best approach for them?
For sleepable programs, maybe we can use the original approach like
return this_cpu_inc_return(*(prog->active)) == 1;
?
This should solve the 'no prog execution' issue.
next prev parent reply other threads:[~2025-12-17 21:32 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-17 16:28 [PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-17 16:28 ` [PATCH bpf-next 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
2025-12-17 16:28 ` [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
2025-12-17 16:56 ` bot+bpf-ci
2025-12-17 18:13 ` Puranjay Mohan
2025-12-17 18:46 ` Alexei Starovoitov
2025-12-17 18:56 ` Puranjay Mohan
2025-12-17 21:31 ` Yonghong Song [this message]
2025-12-17 22:29 ` Puranjay Mohan
2025-12-17 18:23 ` Yonghong Song
2025-12-17 18:44 ` Puranjay Mohan
2025-12-17 18:47 ` Alexei Starovoitov
2025-12-17 19:45 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a0c04178-2159-4475-9be8-93320ffc2138@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bot+bpf-ci@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=clm@meta.com \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=ihor.solodrai@linux.dev \
--cc=kernel-team@meta.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=mark.rutland@arm.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=puranjay12@gmail.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.