From: Yonghong Song <yonghong.song@linux.dev>
To: Puranjay Mohan <puranjay12@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bot+bpf-ci@kernel.org, bpf <bpf@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Kernel Team <kernel-team@meta.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
Chris Mason <clm@meta.com>,
Ihor Solodrai <ihor.solodrai@linux.dev>
Subject: Re: [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics
Date: Wed, 17 Dec 2025 13:31:58 -0800 [thread overview]
Message-ID: <a0c04178-2159-4475-9be8-93320ffc2138@linux.dev> (raw)
In-Reply-To: <CANk7y0jNj0SDOBr=3n_0jhQbLzaj--yVUF4oDA-ManQG-=bkhw@mail.gmail.com>
On 12/17/25 10:56 AM, Puranjay Mohan wrote:
> On Wed, Dec 17, 2025 at 6:46 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Dec 17, 2025 at 10:13 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>> On Wed, Dec 17, 2025 at 4:56 PM <bot+bpf-ci@kernel.org> wrote:
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index 2da986136d26..654fb94bf60c 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>> [ ... ]
>>>>
>>>>> @@ -2006,12 +2009,32 @@ struct bpf_struct_ops_common_value {
>>>>>
>>>>> static inline bool bpf_prog_get_recursion_context(struct bpf_prog *prog)
>>>>> {
>>>>> +#ifdef CONFIG_ARM64
>>>>> + u8 rctx = interrupt_context_level();
>>>>> + u8 *active = this_cpu_ptr(prog->active);
>>>>> +
>>>>> + active[rctx]++;
>>>>> + barrier();
>>>>> + if (get_unaligned_le32(active) != BIT(rctx * 8))
>>>>> + return false;
>>>>> +
>>>>> + return true;
>>>>> +#else
>>>>> return this_cpu_inc_return(*(prog->active)) == 1;
>>>>> +#endif
>>>>> }
>>>> Can preemption between the increment and check cause a counter leak on
>>>> CONFIG_PREEMPT_RCU systems? The trampoline calls this function under
>>>> rcu_read_lock_dont_migrate() which allows preemption on PREEMPT_RCU
>>>> (documented at include/linux/rcupdate.h:856).
>>>>
>>>> Consider this scenario on an ARM64 system with PREEMPT_RCU:
>>>>
>>>> 1. Thread A increments active[0] to 1
>>>> 2. Preemption occurs before Thread A reaches the check
>>>> 3. Thread B on same CPU increments active[0] to 2
>>>> 4. Thread B checks: sees 2 != BIT(0), returns false
>>>> 5. Thread A resumes, checks: sees 2 != BIT(0), returns false
>>>> 6. Both threads return false, neither runs BPF
>>>> 7. Neither calls bpf_prog_put_recursion_context() (see
>>>> __bpf_prog_enter_recur() at kernel/bpf/trampoline.c:952)
>>>> 8. Counter permanently stuck at 2, all future BPF on this CPU fails
>>> Step 7 is incorrect. Looking at the JIT-generated code, the exit
>>> function is ALWAYS called, regardless of whether the enter function
>>> returns 0 or a start time:
>>>
>>> // x86 JIT at arch/x86/net/bpf_jit_comp.c:2998-3050
>>> call bpf_trampoline_enter() // Line 2998
>>> test rax, rax // Line 3006
>>> je skip_exec // Conditional jump
>>> ... BPF program execution ... // Lines 3011-3023
>>> skip_exec: // Line 3037 (jump lands here)
>>> call bpf_trampoline_exit() // Line 3049 - ALWAYS executed
>>>
>>> The bpf_trampoline_exit() call is after the skip_exec label, so it
>>> executes in both cases.
>>>
>>> What Actually Happens:
>>>
>>> Initial state: active[0] = 0
>>>
>>> Thread A (normal context, rctx=0):
>>> 1. active[0]++ → active[0] = 1
>>> 2. Preempted before barrier()
>>>
>>> Thread B (scheduled on same CPU, normal context, rctx=0):
>>> 3. active[0]++ → active[0] = 2
>>> 4. barrier()
>>> 5. get_unaligned_le32(active) → reads 0x00000002
>>> 6. Check: 0x00000002 != BIT(0) = 0x00000001 → returns false
>>> 7. __bpf_prog_enter_recur returns 0
>>> 8. JIT checks return value, skips BPF execution
>>> 9. JIT ALWAYS calls __bpf_prog_exit_recur (see
>>> arch/arm64/net/bpf_jit_comp.c:2362)
>>> 10. bpf_prog_put_recursion_context(prog) executes
>>> 11. barrier(), active[0]-- → active[0] = 1
>>>
>>> Thread A resumes:
>>> 12. barrier()
>>> 13. get_unaligned_le32(active) → reads 0x00000001 (Thread B already
>>> decremented!)
>>> 14. Check: 0x00000001 == BIT(0) = 0x00000001 → returns true ✓
>>> 15. __bpf_prog_enter_recur returns start_time
>>> 16. BPF program executes
>>> 17. __bpf_prog_exit_recur called
>>> 18. bpf_prog_put_recursion_context(prog) executes
>>> 19. barrier(), active[0]-- → active[0] = 0 ✓
>>>
>>> Final State
>>>
>>> - Counter returns to 0 ✓
>>> - No leak ✓
>>> - Thread B detected interference and aborted ✓
>>> - Thread A executed successfully ✓
>>> - Only ONE thread executed the BPF program ✓
>>>
>>>
>>> Now that I think of it, there is another race condition that leads to
>>> NEITHER program running:
>>>
>>> Consider this scenario on an arm64 system with PREEMPT_RCU:
>>>
>>> 1. Thread A increments active[0] from 0 to 1
>>> 2. Thread A is preempted before reaching barrier()
>>> 3. Thread B (same CPU, same context) increments active[0] from 1 to 2
>>> 4. Thread B executes barrier() and checks: sees 2 != BIT(0), returns false
>>> 5. Thread A resumes, executes barrier() and checks: sees 2 != BIT(0),
>>> returns false
>>> 6. Both threads return false to __bpf_prog_enter_recur()
>>> 7. Both skip BPF program execution
>>> 8. Both call bpf_prog_put_recursion_context() and decrement: 2->1->0
>>> 9. Neither BPF program executes, but the counter correctly returns to 0
>>>
>>> This means the patch is changing the behaviour in case of recursion
>>> from "One program gets to run" to
>>> "At most one program gets to run", but given the performance benefits,
>>> I think we can accept this change.
>> Agree. It's fine, but we can mitigate it, but doing this rctx trick
>> only when RCU is not preemptable. Which pretty much would mean
>> that PREEMPT_RT will use atomic and !RT will use rctx
>> and this 'no prog executes' will not happen.
>
> The issue is also with sleepable programs, they use
> rcu_read_lock_trace() and can end up with
> 'no prog executes' scenario.
>
> What do you think is the best approach for them?
For sleepable programs, maybe we can use the original approach like
return this_cpu_inc_return(*(prog->active)) == 1;
?
This should solve the 'no prog execution' issue.
next prev parent reply other threads:[~2025-12-17 21:32 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-17 16:28 [PATCH bpf-next 0/2] bpf: Optimize recursion detection on arm64 Puranjay Mohan
2025-12-17 16:28 ` [PATCH bpf-next 1/2] bpf: move recursion detection logic to helpers Puranjay Mohan
2025-12-17 16:28 ` [PATCH bpf-next 2/2] bpf: arm64: Optimize recursion detection by not using atomics Puranjay Mohan
2025-12-17 16:56 ` bot+bpf-ci
2025-12-17 18:13 ` Puranjay Mohan
2025-12-17 18:46 ` Alexei Starovoitov
2025-12-17 18:56 ` Puranjay Mohan
2025-12-17 21:31 ` Yonghong Song [this message]
2025-12-17 22:29 ` Puranjay Mohan
2025-12-17 18:23 ` Yonghong Song
2025-12-17 18:44 ` Puranjay Mohan
2025-12-17 18:47 ` Alexei Starovoitov
2025-12-17 19:45 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a0c04178-2159-4475-9be8-93320ffc2138@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bot+bpf-ci@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=clm@meta.com \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=ihor.solodrai@linux.dev \
--cc=kernel-team@meta.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=mark.rutland@arm.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=puranjay12@gmail.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).