Are BPF programs preemptible?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Are BPF programs preemptible?
@ 2023-01-23  9:21 Yaniv Agman
  2023-01-23 10:46 ` Jakub Sitnicki
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-23  9:21 UTC (permalink / raw)
  To: bpf

Hello!

Several places state that eBPF programs cannot be preempted by the
kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain), however,
I did see a strange behavior where an eBPF percpu map gets overridden,
and I'm trying to figure out if it's due to a bug in my program or
some misunderstanding I have about eBPF. What caught my eye was a
sentence in a LWN article (https://lwn.net/Articles/812503/) that
says: "Alexei thankfully enlightened me recently over a beer that the
real intent here is to guarantee that the program runs to completion
on the same CPU where it started".

So my question is - are BPF programs guaranteed to run from start to
end without being interrupted at all or the only guarantee I get is
that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can
interrupt their run?

If the only guarantee is no migration, it means that a percpu map
cannot be safely used by two different BPF programs that can preempt
each other (e.g. some kprobe and a network cgroup program).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23  9:21 Are BPF programs preemptible? Yaniv Agman
@ 2023-01-23 10:46 ` Jakub Sitnicki
  2023-01-23 12:30   ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Sitnicki @ 2023-01-23 10:46 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: bpf

On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote:
> Hello!
>
> Several places state that eBPF programs cannot be preempted by the
> kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain), however,
> I did see a strange behavior where an eBPF percpu map gets overridden,
> and I'm trying to figure out if it's due to a bug in my program or
> some misunderstanding I have about eBPF. What caught my eye was a
> sentence in a LWN article (https://lwn.net/Articles/812503/) that
> says: "Alexei thankfully enlightened me recently over a beer that the
> real intent here is to guarantee that the program runs to completion
> on the same CPU where it started".
>
> So my question is - are BPF programs guaranteed to run from start to
> end without being interrupted at all or the only guarantee I get is
> that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can
> interrupt their run?
>
> If the only guarantee is no migration, it means that a percpu map
> cannot be safely used by two different BPF programs that can preempt
> each other (e.g. some kprobe and a network cgroup program).

Since v5.7 BPF program runners use migrate_disable() instead of
preempt_disable(). See commit 2a916f2f546c ("bpf: Use
migrate_disable/enable in array macros and cgroup/lirc code.") [1].

But at that time migrate_disable() was merely an alias for
preempt_disable() on !CONFIG_PREEMPT_RT kernels.

Since v5.11 migrate_disable() does no longer disable preemption on
!CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make
migrate_disable/enable() independent of RT") [2].

So, yes, you are right, but it depends on the kernel version.

PS. The migrate_disable vs per-CPU data problem is also covered in [3].

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0
[3]: https://lwn.net/Articles/836503/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 10:46 ` Jakub Sitnicki
@ 2023-01-23 12:30   ` Yaniv Agman
  2023-01-23 17:02     ` Yonghong Song
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-23 12:30 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: bpf

Ok, thanks Jakub for the answer and references.
I must say that I am very surprised though. First, most of the
documentation for BPF says that preemption is disabled, like the
reference I gave [1] and even the bpf-helpers man page [2] says "Note
that all programs run with preemption disabled..." for the
bpf_get_smp_processor_id() helper. I think this is something that
deserves more attention since many eBPF developers are still under the
assumption that eBPF programs are non-preemptible, and running their
programs on newer kernels might be broken.

I'm trying to figure out how I can solve this issue in our case - is
it correct to assume that no more than one preemption can happen
during a run of my bpf program? If so, I can try to write a percpu
buffer with 2 entries, and give the second entry to the program that
interrupted the first one. But even then, I will need to find a way to
know if my program currently interrupts the run of another program -
is there a way to do that? Maybe checking if the current context is of
an interrupt, can this be done? Any other suggestions to solve this
problem?

[1]: https://docs.cilium.io/en/latest/bpf/toolchain
[2]: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

Thanks,
Yaniv

‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-12:54 מאת ‪Jakub Sitnicki‬‏
<‪jakub@cloudflare.com‬‏>:‬
>
> On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote:
> > Hello!
> >
> > Several places state that eBPF programs cannot be preempted by the
> > kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain), however,
> > I did see a strange behavior where an eBPF percpu map gets overridden,
> > and I'm trying to figure out if it's due to a bug in my program or
> > some misunderstanding I have about eBPF. What caught my eye was a
> > sentence in a LWN article (https://lwn.net/Articles/812503/) that
> > says: "Alexei thankfully enlightened me recently over a beer that the
> > real intent here is to guarantee that the program runs to completion
> > on the same CPU where it started".
> >
> > So my question is - are BPF programs guaranteed to run from start to
> > end without being interrupted at all or the only guarantee I get is
> > that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can
> > interrupt their run?
> >
> > If the only guarantee is no migration, it means that a percpu map
> > cannot be safely used by two different BPF programs that can preempt
> > each other (e.g. some kprobe and a network cgroup program).
>
> Since v5.7 BPF program runners use migrate_disable() instead of
> preempt_disable(). See commit 2a916f2f546c ("bpf: Use
> migrate_disable/enable in array macros and cgroup/lirc code.") [1].
>
> But at that time migrate_disable() was merely an alias for
> preempt_disable() on !CONFIG_PREEMPT_RT kernels.
>
> Since v5.11 migrate_disable() does no longer disable preemption on
> !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make
> migrate_disable/enable() independent of RT") [2].
>
> So, yes, you are right, but it depends on the kernel version.
>
> PS. The migrate_disable vs per-CPU data problem is also covered in [3].
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0
> [3]: https://lwn.net/Articles/836503/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 12:30   ` Yaniv Agman
@ 2023-01-23 17:02     ` Yonghong Song
  2023-01-23 17:32       ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Yonghong Song @ 2023-01-23 17:02 UTC (permalink / raw)
  To: Yaniv Agman, Jakub Sitnicki; +Cc: bpf



On 1/23/23 4:30 AM, Yaniv Agman wrote:
> Ok, thanks Jakub for the answer and references.
> I must say that I am very surprised though. First, most of the
> documentation for BPF says that preemption is disabled, like the
> reference I gave [1] and even the bpf-helpers man page [2] says "Note
> that all programs run with preemption disabled..." for the
> bpf_get_smp_processor_id() helper. I think this is something that
> deserves more attention since many eBPF developers are still under the
> assumption that eBPF programs are non-preemptible, and running their
> programs on newer kernels might be broken.

It would be great if you can send a patch to fix these
out-dated comments!

> 
> I'm trying to figure out how I can solve this issue in our case - is
> it correct to assume that no more than one preemption can happen
> during a run of my bpf program? If so, I can try to write a percpu

No. It is possible that you have more than one preemption during the
same prog run. There is no restriction on this.

> buffer with 2 entries, and give the second entry to the program that
> interrupted the first one. But even then, I will need to find a way to
> know if my program currently interrupts the run of another program -
> is there a way to do that? Maybe checking if the current context is of
> an interrupt, can this be done? Any other suggestions to solve this
> problem?
> 
> [1]: https://docs.cilium.io/en/latest/bpf/toolchain
> [2]: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
> 
> Thanks,
> Yaniv
> 
> ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-12:54 מאת ‪Jakub Sitnicki‬‏
> <‪jakub@cloudflare.com‬‏>:‬
>>
>> On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote:
>>> Hello!
>>>
>>> Several places state that eBPF programs cannot be preempted by the
>>> kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain ), however,
>>> I did see a strange behavior where an eBPF percpu map gets overridden,
>>> and I'm trying to figure out if it's due to a bug in my program or
>>> some misunderstanding I have about eBPF. What caught my eye was a
>>> sentence in a LWN article (https://lwn.net/Articles/812503/ ) that
>>> says: "Alexei thankfully enlightened me recently over a beer that the
>>> real intent here is to guarantee that the program runs to completion
>>> on the same CPU where it started".
>>>
>>> So my question is - are BPF programs guaranteed to run from start to
>>> end without being interrupted at all or the only guarantee I get is
>>> that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can
>>> interrupt their run?
>>>
>>> If the only guarantee is no migration, it means that a percpu map
>>> cannot be safely used by two different BPF programs that can preempt
>>> each other (e.g. some kprobe and a network cgroup program).
>>
>> Since v5.7 BPF program runners use migrate_disable() instead of
>> preempt_disable(). See commit 2a916f2f546c ("bpf: Use
>> migrate_disable/enable in array macros and cgroup/lirc code.") [1].
>>
>> But at that time migrate_disable() was merely an alias for
>> preempt_disable() on !CONFIG_PREEMPT_RT kernels.
>>
>> Since v5.11 migrate_disable() does no longer disable preemption on
>> !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make
>> migrate_disable/enable() independent of RT") [2].
>>
>> So, yes, you are right, but it depends on the kernel version.
>>
>> PS. The migrate_disable vs per-CPU data problem is also covered in [3].
>>
>> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb
>> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0
>> [3]: https://lwn.net/Articles/836503/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 17:02     ` Yonghong Song
@ 2023-01-23 17:32       ` Yaniv Agman
  2023-01-23 20:06         ` Martin KaFai Lau
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-23 17:32 UTC (permalink / raw)
  To: Yonghong Song; +Cc: Jakub Sitnicki, bpf

‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-19:02 מאת ‪Yonghong Song‬‏ <‪yhs@meta.com‬‏>:‬
>
>
>
> On 1/23/23 4:30 AM, Yaniv Agman wrote:
> > Ok, thanks Jakub for the answer and references.
> > I must say that I am very surprised though. First, most of the
> > documentation for BPF says that preemption is disabled, like the
> > reference I gave [1] and even the bpf-helpers man page [2] says "Note
> > that all programs run with preemption disabled..." for the
> > bpf_get_smp_processor_id() helper. I think this is something that
> > deserves more attention since many eBPF developers are still under the
> > assumption that eBPF programs are non-preemptible, and running their
> > programs on newer kernels might be broken.
>
> It would be great if you can send a patch to fix these
> out-dated comments!
>
> >
> > I'm trying to figure out how I can solve this issue in our case - is
> > it correct to assume that no more than one preemption can happen
> > during a run of my bpf program? If so, I can try to write a percpu
>
> No. It is possible that you have more than one preemption during the
> same prog run. There is no restriction on this.
>

Any other suggestions for a solution to this problem?
For us, it is a regression caused by this change, but I didn't find
any proper alternative to use

> > buffer with 2 entries, and give the second entry to the program that
> > interrupted the first one. But even then, I will need to find a way to
> > know if my program currently interrupts the run of another program -
> > is there a way to do that? Maybe checking if the current context is of
> > an interrupt, can this be done? Any other suggestions to solve this
> > problem?
> >
> > [1]: https://docs.cilium.io/en/latest/bpf/toolchain
> > [2]: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
> >
> > Thanks,
> > Yaniv
> >
> > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-12:54 מאת ‪Jakub Sitnicki‬‏
> > <‪jakub@cloudflare.com‬‏>:‬
> >>
> >> On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote:
> >>> Hello!
> >>>
> >>> Several places state that eBPF programs cannot be preempted by the
> >>> kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain ), however,
> >>> I did see a strange behavior where an eBPF percpu map gets overridden,
> >>> and I'm trying to figure out if it's due to a bug in my program or
> >>> some misunderstanding I have about eBPF. What caught my eye was a
> >>> sentence in a LWN article (https://lwn.net/Articles/812503/ ) that
> >>> says: "Alexei thankfully enlightened me recently over a beer that the
> >>> real intent here is to guarantee that the program runs to completion
> >>> on the same CPU where it started".
> >>>
> >>> So my question is - are BPF programs guaranteed to run from start to
> >>> end without being interrupted at all or the only guarantee I get is
> >>> that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can
> >>> interrupt their run?
> >>>
> >>> If the only guarantee is no migration, it means that a percpu map
> >>> cannot be safely used by two different BPF programs that can preempt
> >>> each other (e.g. some kprobe and a network cgroup program).
> >>
> >> Since v5.7 BPF program runners use migrate_disable() instead of
> >> preempt_disable(). See commit 2a916f2f546c ("bpf: Use
> >> migrate_disable/enable in array macros and cgroup/lirc code.") [1].
> >>
> >> But at that time migrate_disable() was merely an alias for
> >> preempt_disable() on !CONFIG_PREEMPT_RT kernels.
> >>
> >> Since v5.11 migrate_disable() does no longer disable preemption on
> >> !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make
> >> migrate_disable/enable() independent of RT") [2].
> >>
> >> So, yes, you are right, but it depends on the kernel version.
> >>
> >> PS. The migrate_disable vs per-CPU data problem is also covered in [3].
> >>
> >> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb
> >> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0
> >> [3]: https://lwn.net/Articles/836503/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 17:32       ` Yaniv Agman
@ 2023-01-23 20:06         ` Martin KaFai Lau
  2023-01-23 21:01           ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2023-01-23 20:06 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, bpf, Yonghong Song

On 1/23/23 9:32 AM, Yaniv Agman wrote:
>>> interrupted the first one. But even then, I will need to find a way to
>>> know if my program currently interrupts the run of another program -
>>> is there a way to do that?
May be a percpu atomic counter to see if the bpf prog has been re-entered on the 
same cpu.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 20:06         ` Martin KaFai Lau
@ 2023-01-23 21:01           ` Yaniv Agman
  2023-01-23 21:22             ` Jakub Sitnicki
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-23 21:01 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Jakub Sitnicki, bpf, Yonghong Song

‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
<‪martin.lau@linux.dev‬‏>:‬
>
> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> >>> interrupted the first one. But even then, I will need to find a way to
> >>> know if my program currently interrupts the run of another program -
> >>> is there a way to do that?
> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> same cpu.

Not sure I understand how this will help. If I want to save local
program data on a percpu map and I see that the counter is bigger then
zero, should I ignore the event?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 21:01           ` Yaniv Agman
@ 2023-01-23 21:22             ` Jakub Sitnicki
  2023-01-23 21:56               ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Sitnicki @ 2023-01-23 21:22 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Martin KaFai Lau, bpf, Yonghong Song

On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> <‪martin.lau@linux.dev‬‏>:‬
>>
>> On 1/23/23 9:32 AM, Yaniv Agman wrote:
>> >>> interrupted the first one. But even then, I will need to find a way to
>> >>> know if my program currently interrupts the run of another program -
>> >>> is there a way to do that?
>> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
>> same cpu.
>
> Not sure I understand how this will help. If I want to save local
> program data on a percpu map and I see that the counter is bigger then
> zero, should I ignore the event?

map_update w/ BPF_F_LOCK disables preemption, if you're after updating
an entry atomically. But it can't be used with PERCPU maps today.
Perhaps that's needed now too.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 21:22             ` Jakub Sitnicki
@ 2023-01-23 21:56               ` Yaniv Agman
  2023-01-24 12:30                 ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-23 21:56 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
<‪jakub@cloudflare.com‬‏>:‬
>
> On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > <‪martin.lau@linux.dev‬‏>:‬
> >>
> >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> >> >>> interrupted the first one. But even then, I will need to find a way to
> >> >>> know if my program currently interrupts the run of another program -
> >> >>> is there a way to do that?
> >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> >> same cpu.
> >
> > Not sure I understand how this will help. If I want to save local
> > program data on a percpu map and I see that the counter is bigger then
> > zero, should I ignore the event?
>
> map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> an entry atomically. But it can't be used with PERCPU maps today.
> Perhaps that's needed now too.

Yep. I think what is needed here is the ability to disable preemption
from the bpf program - maybe even adding a helper for that?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-23 21:56               ` Yaniv Agman
@ 2023-01-24 12:30                 ` Alexei Starovoitov
  2023-01-24 15:47                   ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-01-24 12:30 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
>
> ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> <‪jakub@cloudflare.com‬‏>:‬
> >
> > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > <‪martin.lau@linux.dev‬‏>:‬
> > >>
> > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > >> >>> interrupted the first one. But even then, I will need to find a way to
> > >> >>> know if my program currently interrupts the run of another program -
> > >> >>> is there a way to do that?
> > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > >> same cpu.
> > >
> > > Not sure I understand how this will help. If I want to save local
> > > program data on a percpu map and I see that the counter is bigger then
> > > zero, should I ignore the event?
> >
> > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > an entry atomically. But it can't be used with PERCPU maps today.
> > Perhaps that's needed now too.
>
> Yep. I think what is needed here is the ability to disable preemption
> from the bpf program - maybe even adding a helper for that?

I'm not sure what the issue is here.
Old preempt_disable() doesn't mean that one bpf program won't ever
be interrupted by another bpf prog.
Like networking bpf prog in old preempt_disable can call into something
where there is a kprobe and another tracing bpf prog will be called.
The same can happen after we switched to migrate_disable.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-24 12:30                 ` Alexei Starovoitov
@ 2023-01-24 15:47                   ` Yaniv Agman
  2023-01-24 17:24                     ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-24 15:47 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@gmail.com‬‏>:‬
>
> On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> >
> > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > <‪jakub@cloudflare.com‬‏>:‬
> > >
> > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > <‪martin.lau@linux.dev‬‏>:‬
> > > >>
> > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > >> >>> know if my program currently interrupts the run of another program -
> > > >> >>> is there a way to do that?
> > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > >> same cpu.
> > > >
> > > > Not sure I understand how this will help. If I want to save local
> > > > program data on a percpu map and I see that the counter is bigger then
> > > > zero, should I ignore the event?
> > >
> > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > an entry atomically. But it can't be used with PERCPU maps today.
> > > Perhaps that's needed now too.
> >
> > Yep. I think what is needed here is the ability to disable preemption
> > from the bpf program - maybe even adding a helper for that?
>
> I'm not sure what the issue is here.
> Old preempt_disable() doesn't mean that one bpf program won't ever
> be interrupted by another bpf prog.
> Like networking bpf prog in old preempt_disable can call into something
> where there is a kprobe and another tracing bpf prog will be called.
> The same can happen after we switched to migrate_disable.

One difference here is that in what you describe the programmer can
know in advance which functions might call others and avoid that or
use other percpu maps, but if preemption can happen between functions
which are not related to one another (don't have a relation of caller
and callee), then the programmer can't have control over it

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-24 15:47                   ` Yaniv Agman
@ 2023-01-24 17:24                     ` Alexei Starovoitov
  2023-01-24 17:38                       ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-01-24 17:24 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote:
>
> ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@gmail.com‬‏>:‬
> >
> > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> > >
> > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > <‪jakub@cloudflare.com‬‏>:‬
> > > >
> > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > <‪martin.lau@linux.dev‬‏>:‬
> > > > >>
> > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > >> >>> know if my program currently interrupts the run of another program -
> > > > >> >>> is there a way to do that?
> > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > >> same cpu.
> > > > >
> > > > > Not sure I understand how this will help. If I want to save local
> > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > zero, should I ignore the event?
> > > >
> > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > Perhaps that's needed now too.
> > >
> > > Yep. I think what is needed here is the ability to disable preemption
> > > from the bpf program - maybe even adding a helper for that?
> >
> > I'm not sure what the issue is here.
> > Old preempt_disable() doesn't mean that one bpf program won't ever
> > be interrupted by another bpf prog.
> > Like networking bpf prog in old preempt_disable can call into something
> > where there is a kprobe and another tracing bpf prog will be called.
> > The same can happen after we switched to migrate_disable.
>
> One difference here is that in what you describe the programmer can
> know in advance which functions might call others and avoid that or
> use other percpu maps, but if preemption can happen between functions
> which are not related to one another (don't have a relation of caller
> and callee), then the programmer can't have control over it

Could you give a specific example?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-24 17:24                     ` Alexei Starovoitov
@ 2023-01-24 17:38                       ` Yaniv Agman
  2023-01-25  0:04                         ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-24 17:38 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@gmail.com‬‏>:‬
>
> On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> >
> > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > >
> > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > >
> > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > <‪jakub@cloudflare.com‬‏>:‬
> > > > >
> > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > <‪martin.lau@linux.dev‬‏>:‬
> > > > > >>
> > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > >> >>> is there a way to do that?
> > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > >> same cpu.
> > > > > >
> > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > zero, should I ignore the event?
> > > > >
> > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > Perhaps that's needed now too.
> > > >
> > > > Yep. I think what is needed here is the ability to disable preemption
> > > > from the bpf program - maybe even adding a helper for that?
> > >
> > > I'm not sure what the issue is here.
> > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > be interrupted by another bpf prog.
> > > Like networking bpf prog in old preempt_disable can call into something
> > > where there is a kprobe and another tracing bpf prog will be called.
> > > The same can happen after we switched to migrate_disable.
> >
> > One difference here is that in what you describe the programmer can
> > know in advance which functions might call others and avoid that or
> > use other percpu maps, but if preemption can happen between functions
> > which are not related to one another (don't have a relation of caller
> > and callee), then the programmer can't have control over it
>
> Could you give a specific example?

Sure. I can give two examples from places where we saw such corruption:

1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
and a percpu scratch map to prepare the event for submit. When we also
added a TRACEPOINT to trace sched_process_free (where we also use this
scratch percpu map), the security_file_open events got corrupted and
we didn't know what was happening (was very hard to debug)
2. same was happening when kprobes were combined with cgroup_skb
programs to trace network events

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-24 17:38                       ` Yaniv Agman
@ 2023-01-25  0:04                         ` Alexei Starovoitov
  2023-01-25 16:39                           ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-01-25  0:04 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote:
>
> ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@gmail.com‬‏>:‬
> >
> > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > >
> > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > > >
> > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > > >
> > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > <‪jakub@cloudflare.com‬‏>:‬
> > > > > >
> > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > <‪martin.lau@linux.dev‬‏>:‬
> > > > > > >>
> > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > >> >>> is there a way to do that?
> > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > >> same cpu.
> > > > > > >
> > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > zero, should I ignore the event?
> > > > > >
> > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > Perhaps that's needed now too.
> > > > >
> > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > from the bpf program - maybe even adding a helper for that?
> > > >
> > > > I'm not sure what the issue is here.
> > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > be interrupted by another bpf prog.
> > > > Like networking bpf prog in old preempt_disable can call into something
> > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > The same can happen after we switched to migrate_disable.
> > >
> > > One difference here is that in what you describe the programmer can
> > > know in advance which functions might call others and avoid that or
> > > use other percpu maps, but if preemption can happen between functions
> > > which are not related to one another (don't have a relation of caller
> > > and callee), then the programmer can't have control over it
> >
> > Could you give a specific example?
>
> Sure. I can give two examples from places where we saw such corruption:
>
> 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> and a percpu scratch map to prepare the event for submit. When we also
> added a TRACEPOINT to trace sched_process_free (where we also use this
> scratch percpu map), the security_file_open events got corrupted and
> we didn't know what was happening (was very hard to debug)

bpf_lsm_file_open is sleepable.
We never did preempt_disable there.
kprobe on security_file_open works, but it's a bit of a hack.
With preempt->migrate
the delayed_put_task_struct->trace_sched_process_free can happen
on the same cpu if prog gets preempted in preemtable kernel, but something
is fishy here.
Non-sleepable bpf progs always run in rcu cs while
delayed_put_task_struct is call_rcu(),
so if you've used a per-cpu scratch buffer for a duration
of bpf prog (either old preempt_disable or new migrate_disable)
the trace_sched_process_free won't be called until prog is finished.

If you're using per-cpu scratch buffer to remember the data in one
execution of kprobe-bpf and consuming in the next then all bets are off.
This is going to break sooner or later.

> 2. same was happening when kprobes were combined with cgroup_skb
> programs to trace network events

irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
is it can run while cgroup-bpf hasn't finished.

Then there are differences in kernel configs: no preempt, full preempt,
preempt_rcu and differences in rcu_tasks_trace implementation
over the years. You can only assume that the validity of the pointer
to bpf per-cpu array and bpf per-cpu hash, but no assumption
about concurrent access to the same map if you share the same map
across different programs.
See BPF_F_LOCK earlier suggestion. It might be what you need.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-25  0:04                         ` Alexei Starovoitov
@ 2023-01-25 16:39                           ` Yaniv Agman
  2023-01-25 18:52                             ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-25 16:39 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@gmail.com‬‏>:‬
>
> On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> >
> > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > >
> > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > >
> > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > > > >
> > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > > > >
> > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > > <‪jakub@cloudflare.com‬‏>:‬
> > > > > > >
> > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > > <‪martin.lau@linux.dev‬‏>:‬
> > > > > > > >>
> > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > > >> >>> is there a way to do that?
> > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > > >> same cpu.
> > > > > > > >
> > > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > > zero, should I ignore the event?
> > > > > > >
> > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > > Perhaps that's needed now too.
> > > > > >
> > > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > > from the bpf program - maybe even adding a helper for that?
> > > > >
> > > > > I'm not sure what the issue is here.
> > > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > > be interrupted by another bpf prog.
> > > > > Like networking bpf prog in old preempt_disable can call into something
> > > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > > The same can happen after we switched to migrate_disable.
> > > >
> > > > One difference here is that in what you describe the programmer can
> > > > know in advance which functions might call others and avoid that or
> > > > use other percpu maps, but if preemption can happen between functions
> > > > which are not related to one another (don't have a relation of caller
> > > > and callee), then the programmer can't have control over it
> > >
> > > Could you give a specific example?
> >
> > Sure. I can give two examples from places where we saw such corruption:
> >
> > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> > and a percpu scratch map to prepare the event for submit. When we also
> > added a TRACEPOINT to trace sched_process_free (where we also use this
> > scratch percpu map), the security_file_open events got corrupted and
> > we didn't know what was happening (was very hard to debug)
>
> bpf_lsm_file_open is sleepable.
> We never did preempt_disable there.
> kprobe on security_file_open works, but it's a bit of a hack.
> With preempt->migrate
> the delayed_put_task_struct->trace_sched_process_free can happen
> on the same cpu if prog gets preempted in preemtable kernel, but something
> is fishy here.
> Non-sleepable bpf progs always run in rcu cs while
> delayed_put_task_struct is call_rcu(),
> so if you've used a per-cpu scratch buffer for a duration
> of bpf prog (either old preempt_disable or new migrate_disable)
> the trace_sched_process_free won't be called until prog is finished.
>
> If you're using per-cpu scratch buffer to remember the data in one
> execution of kprobe-bpf and consuming in the next then all bets are off.
> This is going to break sooner or later.
>

Yes, I agree this is fishy, but we only use the per-cpu scratch map
for the duration of a bpf program, assuming tail-calls are considered
to be part of the same program.
We've even spotted the place where preemption happens to be right
after calling bpf_perf_event_submit() helper by placing bpf_printk in
the security_file_open kprobe. Before the call to the helper the
buffer was ok, but after it got corrupted (and there were more lines
in the program after the call to this helper).

> > 2. same was happening when kprobes were combined with cgroup_skb
> > programs to trace network events
>
> irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
> is it can run while cgroup-bpf hasn't finished.
>

Actually what we saw here is that the ingress skb program that runs
from ksoftirq/idle context corrupts a percpu map shared with a
raw_sys_exit tracepoint used to submit execve/openat events.

> Then there are differences in kernel configs: no preempt, full preempt,
> preempt_rcu and differences in rcu_tasks_trace implementation
> over the years. You can only assume that the validity of the pointer
> to bpf per-cpu array and bpf per-cpu hash, but no assumption
> about concurrent access to the same map if you share the same map
> across different programs.

I think that is the main point here. Thanks for clarifying this.

> See BPF_F_LOCK earlier suggestion. It might be what you need.

I am not sure how exactly this can help. The size of the percpu buffer
we use is 32kb and we don't use the map_update helper but the pointer
returned from the lookup to update the map in several places in the
program.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-25 16:39                           ` Yaniv Agman
@ 2023-01-25 18:52                             ` Alexei Starovoitov
  2023-01-25 19:59                               ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-01-25 18:52 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@gmail.com> wrote:
>
> ‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@gmail.com‬‏>:‬
> >
> > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > >
> > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> > > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > > >
> > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > > >
> > > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > > > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > > > > >
> > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > > > > >
> > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > > > <‪jakub@cloudflare.com‬‏>:‬
> > > > > > > >
> > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > > > <‪martin.lau@linux.dev‬‏>:‬
> > > > > > > > >>
> > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > > > >> >>> is there a way to do that?
> > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > > > >> same cpu.
> > > > > > > > >
> > > > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > > > zero, should I ignore the event?
> > > > > > > >
> > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > > > Perhaps that's needed now too.
> > > > > > >
> > > > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > > > from the bpf program - maybe even adding a helper for that?
> > > > > >
> > > > > > I'm not sure what the issue is here.
> > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > > > be interrupted by another bpf prog.
> > > > > > Like networking bpf prog in old preempt_disable can call into something
> > > > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > > > The same can happen after we switched to migrate_disable.
> > > > >
> > > > > One difference here is that in what you describe the programmer can
> > > > > know in advance which functions might call others and avoid that or
> > > > > use other percpu maps, but if preemption can happen between functions
> > > > > which are not related to one another (don't have a relation of caller
> > > > > and callee), then the programmer can't have control over it
> > > >
> > > > Could you give a specific example?
> > >
> > > Sure. I can give two examples from places where we saw such corruption:
> > >
> > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> > > and a percpu scratch map to prepare the event for submit. When we also
> > > added a TRACEPOINT to trace sched_process_free (where we also use this
> > > scratch percpu map), the security_file_open events got corrupted and
> > > we didn't know what was happening (was very hard to debug)
> >
> > bpf_lsm_file_open is sleepable.
> > We never did preempt_disable there.
> > kprobe on security_file_open works, but it's a bit of a hack.
> > With preempt->migrate
> > the delayed_put_task_struct->trace_sched_process_free can happen
> > on the same cpu if prog gets preempted in preemtable kernel, but something
> > is fishy here.
> > Non-sleepable bpf progs always run in rcu cs while
> > delayed_put_task_struct is call_rcu(),
> > so if you've used a per-cpu scratch buffer for a duration
> > of bpf prog (either old preempt_disable or new migrate_disable)
> > the trace_sched_process_free won't be called until prog is finished.
> >
> > If you're using per-cpu scratch buffer to remember the data in one
> > execution of kprobe-bpf and consuming in the next then all bets are off.
> > This is going to break sooner or later.
> >
>
> Yes, I agree this is fishy, but we only use the per-cpu scratch map
> for the duration of a bpf program, assuming tail-calls are considered
> to be part of the same program.
> We've even spotted the place where preemption happens to be right
> after calling bpf_perf_event_submit() helper by placing bpf_printk in
> the security_file_open kprobe. Before the call to the helper the
> buffer was ok, but after it got corrupted (and there were more lines
> in the program after the call to this helper).

I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise
which will send an ipi to the current cpu which may resched.

I'm still missing why kprobe in security_file_open has to share
per-cpu scratch buffer with kprobe in trace_sched_process_free.
Why not use a scratch buffer per program?

>
> > > 2. same was happening when kprobes were combined with cgroup_skb
> > > programs to trace network events
> >
> > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
> > is it can run while cgroup-bpf hasn't finished.
> >
>
> Actually what we saw here is that the ingress skb program that runs
> from ksoftirq/idle context corrupts a percpu map shared with a
> raw_sys_exit tracepoint used to submit execve/openat events.
>
> > Then there are differences in kernel configs: no preempt, full preempt,
> > preempt_rcu and differences in rcu_tasks_trace implementation
> > over the years. You can only assume that the validity of the pointer
> > to bpf per-cpu array and bpf per-cpu hash, but no assumption
> > about concurrent access to the same map if you share the same map
> > across different programs.
>
> I think that is the main point here. Thanks for clarifying this.
>
> > See BPF_F_LOCK earlier suggestion. It might be what you need.
>
> I am not sure how exactly this can help. The size of the percpu buffer
> we use is 32kb and we don't use the map_update helper but the pointer
> returned from the lookup to update the map in several places in the
> program.

It wasn't clear what you do with the scratch buffer.
BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without
a race with bpf prog that would use bpf_spin_lock() to access the element.
Looks like that's not the case you had in mind.
You just stream these buffers to user space via bpf_perf_event_output().

Anyway back to preempt_disable(). Think of it as a giant spin_lock
that covers the whole program. In preemptable kernels it hurts
tail latency and fairness, and is completely unacceptable in RT.
That's why we moved to migrate_disable.
Technically we can add bpf_preempt_disable() kfunc, but if we do that
we'll be back to square one. The issues with preemptions and RT
will reappear. So let's figure out a different solution.
Why not use a scratch buffer per program ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-25 18:52                             ` Alexei Starovoitov
@ 2023-01-25 19:59                               ` Yaniv Agman
  2023-01-26  2:22                                 ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-25 19:59 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-20:52 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@gmail.com‬‏>:‬
>
> On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> >
> > ‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת ‪Alexei Starovoitov‬‏
> > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > >
> > > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > >
> > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> > > > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > > > >
> > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > > > >
> > > > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > > > > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > > > > > >
> > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > > > > > >
> > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > > > > <‪jakub@cloudflare.com‬‏>:‬
> > > > > > > > >
> > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > > > > <‪martin.lau@linux.dev‬‏>:‬
> > > > > > > > > >>
> > > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > > > > >> >>> is there a way to do that?
> > > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > > > > >> same cpu.
> > > > > > > > > >
> > > > > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > > > > zero, should I ignore the event?
> > > > > > > > >
> > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > > > > Perhaps that's needed now too.
> > > > > > > >
> > > > > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > > > > from the bpf program - maybe even adding a helper for that?
> > > > > > >
> > > > > > > I'm not sure what the issue is here.
> > > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > > > > be interrupted by another bpf prog.
> > > > > > > Like networking bpf prog in old preempt_disable can call into something
> > > > > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > > > > The same can happen after we switched to migrate_disable.
> > > > > >
> > > > > > One difference here is that in what you describe the programmer can
> > > > > > know in advance which functions might call others and avoid that or
> > > > > > use other percpu maps, but if preemption can happen between functions
> > > > > > which are not related to one another (don't have a relation of caller
> > > > > > and callee), then the programmer can't have control over it
> > > > >
> > > > > Could you give a specific example?
> > > >
> > > > Sure. I can give two examples from places where we saw such corruption:
> > > >
> > > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> > > > and a percpu scratch map to prepare the event for submit. When we also
> > > > added a TRACEPOINT to trace sched_process_free (where we also use this
> > > > scratch percpu map), the security_file_open events got corrupted and
> > > > we didn't know what was happening (was very hard to debug)
> > >
> > > bpf_lsm_file_open is sleepable.
> > > We never did preempt_disable there.
> > > kprobe on security_file_open works, but it's a bit of a hack.
> > > With preempt->migrate
> > > the delayed_put_task_struct->trace_sched_process_free can happen
> > > on the same cpu if prog gets preempted in preemtable kernel, but something
> > > is fishy here.
> > > Non-sleepable bpf progs always run in rcu cs while
> > > delayed_put_task_struct is call_rcu(),
> > > so if you've used a per-cpu scratch buffer for a duration
> > > of bpf prog (either old preempt_disable or new migrate_disable)
> > > the trace_sched_process_free won't be called until prog is finished.
> > >
> > > If you're using per-cpu scratch buffer to remember the data in one
> > > execution of kprobe-bpf and consuming in the next then all bets are off.
> > > This is going to break sooner or later.
> > >
> >
> > Yes, I agree this is fishy, but we only use the per-cpu scratch map
> > for the duration of a bpf program, assuming tail-calls are considered
> > to be part of the same program.
> > We've even spotted the place where preemption happens to be right
> > after calling bpf_perf_event_submit() helper by placing bpf_printk in
> > the security_file_open kprobe. Before the call to the helper the
> > buffer was ok, but after it got corrupted (and there were more lines
> > in the program after the call to this helper).
>
> I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise
> which will send an ipi to the current cpu which may resched.
>
> I'm still missing why kprobe in security_file_open has to share
> per-cpu scratch buffer with kprobe in trace_sched_process_free.
> Why not use a scratch buffer per program?
>

Well, since all the documentation I read about eBPF says that BPF
programs are non-preemptible, I assumed it will be ok if we use the
same per-cpu scratch buffer for all programs as long as we use it only
in the scope of the program, but now I understand this assumption (and
documentation) was wrong.

> >
> > > > 2. same was happening when kprobes were combined with cgroup_skb
> > > > programs to trace network events
> > >
> > > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
> > > is it can run while cgroup-bpf hasn't finished.
> > >
> >
> > Actually what we saw here is that the ingress skb program that runs
> > from ksoftirq/idle context corrupts a percpu map shared with a
> > raw_sys_exit tracepoint used to submit execve/openat events.
> >
> > > Then there are differences in kernel configs: no preempt, full preempt,
> > > preempt_rcu and differences in rcu_tasks_trace implementation
> > > over the years. You can only assume that the validity of the pointer
> > > to bpf per-cpu array and bpf per-cpu hash, but no assumption
> > > about concurrent access to the same map if you share the same map
> > > across different programs.
> >
> > I think that is the main point here. Thanks for clarifying this.
> >
> > > See BPF_F_LOCK earlier suggestion. It might be what you need.
> >
> > I am not sure how exactly this can help. The size of the percpu buffer
> > we use is 32kb and we don't use the map_update helper but the pointer
> > returned from the lookup to update the map in several places in the
> > program.
>
> It wasn't clear what you do with the scratch buffer.
> BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without
> a race with bpf prog that would use bpf_spin_lock() to access the element.
> Looks like that's not the case you had in mind.
> You just stream these buffers to user space via bpf_perf_event_output().
>

exactly

> Anyway back to preempt_disable(). Think of it as a giant spin_lock
> that covers the whole program. In preemptable kernels it hurts
> tail latency and fairness, and is completely unacceptable in RT.
> That's why we moved to migrate_disable.
> Technically we can add bpf_preempt_disable() kfunc, but if we do that
> we'll be back to square one. The issues with preemptions and RT
> will reappear. So let's figure out a different solution.
> Why not use a scratch buffer per program ?

Totally understand the reason for avoiding preemption disable,
especially in RT kernels.
I believe the answer for why not to use a scratch buffer per program
will simply be memory space.
In our use-case, Tracee [1], we let the user choose whatever events to
trace for a specific workload.
This list of events is very big, and we have many BPF programs
attached to different places of the kernel.
Let's assume that we have 100 events, and for each event we have a
different BPF program.
Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
cpu, and ~100MB per 32 CPUs, which is more common for our case.
Since we always add new events to Tracee, this will also not be scalable.
Yet, if there is no other solution, I believe we will go in that direction

[1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-25 19:59                               ` Yaniv Agman
@ 2023-01-26  2:22                                 ` Alexei Starovoitov
  2023-01-26  6:59                                   ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-01-26  2:22 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote:
>
> > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > that covers the whole program. In preemptable kernels it hurts
> > tail latency and fairness, and is completely unacceptable in RT.
> > That's why we moved to migrate_disable.
> > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > we'll be back to square one. The issues with preemptions and RT
> > will reappear. So let's figure out a different solution.
> > Why not use a scratch buffer per program ?
>
> Totally understand the reason for avoiding preemption disable,
> especially in RT kernels.
> I believe the answer for why not to use a scratch buffer per program
> will simply be memory space.
> In our use-case, Tracee [1], we let the user choose whatever events to
> trace for a specific workload.
> This list of events is very big, and we have many BPF programs
> attached to different places of the kernel.
> Let's assume that we have 100 events, and for each event we have a
> different BPF program.
> Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> cpu, and ~100MB per 32 CPUs, which is more common for our case.

Well, 100 bpf progs consume at least a page each,
so you might want one program attached to all events.

> Since we always add new events to Tracee, this will also not be scalable.
> Yet, if there is no other solution, I believe we will go in that direction
>
> [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c

you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?

Insead of scratch_map per program, use atomic per-cpu counter
for recursion.
You'll have 3 levels in the worst case.
So it will be:
BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
On prog entry increment the recursion counter, on exit decrement.
And use that particular scratch_t in the prog.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-26  2:22                                 ` Alexei Starovoitov
@ 2023-01-26  6:59                                   ` Yaniv Agman
  2023-01-26 15:29                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Yaniv Agman @ 2023-01-26  6:59 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@gmail.com‬‏>:‬
>
> On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> >
> > > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > > that covers the whole program. In preemptable kernels it hurts
> > > tail latency and fairness, and is completely unacceptable in RT.
> > > That's why we moved to migrate_disable.
> > > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > > we'll be back to square one. The issues with preemptions and RT
> > > will reappear. So let's figure out a different solution.
> > > Why not use a scratch buffer per program ?
> >
> > Totally understand the reason for avoiding preemption disable,
> > especially in RT kernels.
> > I believe the answer for why not to use a scratch buffer per program
> > will simply be memory space.
> > In our use-case, Tracee [1], we let the user choose whatever events to
> > trace for a specific workload.
> > This list of events is very big, and we have many BPF programs
> > attached to different places of the kernel.
> > Let's assume that we have 100 events, and for each event we have a
> > different BPF program.
> > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> > cpu, and ~100MB per 32 CPUs, which is more common for our case.
>
> Well, 100 bpf progs consume at least a page each,
> so you might want one program attached to all events.
>

Unfortunately, I don't think that would be possible. We still need to
support kernels with 4096 instructions limit.
We may add some generic programs for events with simpler logic, but
even then, support for bpf cookies needed for such programs was only
added in more recent kernels.

> > Since we always add new events to Tracee, this will also not be scalable.
> > Yet, if there is no other solution, I believe we will go in that direction
> >
> > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
>
> you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?

We actually have 3 different percpu maps there shared between
different programs so we will have to take care of them all.

>
> Insead of scratch_map per program, use atomic per-cpu counter
> for recursion.
> You'll have 3 levels in the worst case.

Is it guaranteed that no more than 3 levels exist?
I suggested a similar solution with 2 levels at the beginning of this
thread, but Yonghong Song replied that there is no restriction on
this.

> So it will be:
> BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
> On prog entry increment the recursion counter, on exit decrement.
> And use that particular scratch_t in the prog.

I'm just afraid of potential deadlocks. For sure we will need to
decrement the counter on each return path, but can it happen that a
bpf program crashes, leaving the counter non-zero? I know that's the
job of the verifier to make sure such things won't happen, but can we
be sure of that?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-26  6:59                                   ` Yaniv Agman
@ 2023-01-26 15:29                                     ` Alexei Starovoitov
  2023-01-26 17:51                                       ` Yaniv Agman
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-01-26 15:29 UTC (permalink / raw)
  To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@gmail.com> wrote:
>
> ‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@gmail.com‬‏>:‬
> >
> > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > >
> > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > > > that covers the whole program. In preemptable kernels it hurts
> > > > tail latency and fairness, and is completely unacceptable in RT.
> > > > That's why we moved to migrate_disable.
> > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > > > we'll be back to square one. The issues with preemptions and RT
> > > > will reappear. So let's figure out a different solution.
> > > > Why not use a scratch buffer per program ?
> > >
> > > Totally understand the reason for avoiding preemption disable,
> > > especially in RT kernels.
> > > I believe the answer for why not to use a scratch buffer per program
> > > will simply be memory space.
> > > In our use-case, Tracee [1], we let the user choose whatever events to
> > > trace for a specific workload.
> > > This list of events is very big, and we have many BPF programs
> > > attached to different places of the kernel.
> > > Let's assume that we have 100 events, and for each event we have a
> > > different BPF program.
> > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> > > cpu, and ~100MB per 32 CPUs, which is more common for our case.
> >
> > Well, 100 bpf progs consume at least a page each,
> > so you might want one program attached to all events.
> >
>
> Unfortunately, I don't think that would be possible. We still need to
> support kernels with 4096 instructions limit.
> We may add some generic programs for events with simpler logic, but
> even then, support for bpf cookies needed for such programs was only
> added in more recent kernels.
>
> > > Since we always add new events to Tracee, this will also not be scalable.
> > > Yet, if there is no other solution, I believe we will go in that direction
> > >
> > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
> >
> > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?
>
> We actually have 3 different percpu maps there shared between
> different programs so we will have to take care of them all.
>
> >
> > Insead of scratch_map per program, use atomic per-cpu counter
> > for recursion.
> > You'll have 3 levels in the worst case.
>
> Is it guaranteed that no more than 3 levels exist?
> I suggested a similar solution with 2 levels at the beginning of this
> thread, but Yonghong Song replied that there is no restriction on
> this.

There are no restrictions, but I doubt you can construct a case
where you'll see more than 3 in practice.

> > So it will be:
> > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
> > On prog entry increment the recursion counter, on exit decrement.
> > And use that particular scratch_t in the prog.
>
> I'm just afraid of potential deadlocks. For sure we will need to
> decrement the counter on each return path, but can it happen that a
> bpf program crashes, leaving the counter non-zero? I know that's the
> job of the verifier to make sure such things won't happen, but can we
> be sure of that?

if you don't trust the verifier you shouldn't be using bpf in the first place.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Are BPF programs preemptible?
  2023-01-26 15:29                                     ` Alexei Starovoitov
@ 2023-01-26 17:51                                       ` Yaniv Agman
  0 siblings, 0 replies; 21+ messages in thread
From: Yaniv Agman @ 2023-01-26 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song

‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-17:29 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@gmail.com‬‏>:‬
>
> On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@gmail.com> wrote:
> >
> > ‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת ‪Alexei Starovoitov‬‏
> > <‪alexei.starovoitov@gmail.com‬‏>:‬
> > >
> > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote:
> > > >
> > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > > > > that covers the whole program. In preemptable kernels it hurts
> > > > > tail latency and fairness, and is completely unacceptable in RT.
> > > > > That's why we moved to migrate_disable.
> > > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > > > > we'll be back to square one. The issues with preemptions and RT
> > > > > will reappear. So let's figure out a different solution.
> > > > > Why not use a scratch buffer per program ?
> > > >
> > > > Totally understand the reason for avoiding preemption disable,
> > > > especially in RT kernels.
> > > > I believe the answer for why not to use a scratch buffer per program
> > > > will simply be memory space.
> > > > In our use-case, Tracee [1], we let the user choose whatever events to
> > > > trace for a specific workload.
> > > > This list of events is very big, and we have many BPF programs
> > > > attached to different places of the kernel.
> > > > Let's assume that we have 100 events, and for each event we have a
> > > > different BPF program.
> > > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> > > > cpu, and ~100MB per 32 CPUs, which is more common for our case.
> > >
> > > Well, 100 bpf progs consume at least a page each,
> > > so you might want one program attached to all events.
> > >
> >
> > Unfortunately, I don't think that would be possible. We still need to
> > support kernels with 4096 instructions limit.
> > We may add some generic programs for events with simpler logic, but
> > even then, support for bpf cookies needed for such programs was only
> > added in more recent kernels.
> >
> > > > Since we always add new events to Tracee, this will also not be scalable.
> > > > Yet, if there is no other solution, I believe we will go in that direction
> > > >
> > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
> > >
> > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?
> >
> > We actually have 3 different percpu maps there shared between
> > different programs so we will have to take care of them all.
> >
> > >
> > > Insead of scratch_map per program, use atomic per-cpu counter
> > > for recursion.
> > > You'll have 3 levels in the worst case.
> >
> > Is it guaranteed that no more than 3 levels exist?
> > I suggested a similar solution with 2 levels at the beginning of this
> > thread, but Yonghong Song replied that there is no restriction on
> > this.
>
> There are no restrictions, but I doubt you can construct a case
> where you'll see more than 3 in practice.
>

Ok, I see.

> > > So it will be:
> > > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
> > > On prog entry increment the recursion counter, on exit decrement.
> > > And use that particular scratch_t in the prog.
> >
> > I'm just afraid of potential deadlocks. For sure we will need to
> > decrement the counter on each return path, but can it happen that a
> > bpf program crashes, leaving the counter non-zero? I know that's the
> > job of the verifier to make sure such things won't happen, but can we
> > be sure of that?
>
> if you don't trust the verifier you shouldn't be using bpf in the first place.

Oh believe me I trust it, but I must be honest and say that I don't
know about all the intricacies of the kernel, especially when it comes
to scheduling and preemption.
So I am just trying to think if there is a path where this mechanism
can break, and express my concern about the potential for errors or
bugs that may lead to unexpected behaviors.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-01-26 17:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-23  9:21 Are BPF programs preemptible? Yaniv Agman
2023-01-23 10:46 ` Jakub Sitnicki
2023-01-23 12:30   ` Yaniv Agman
2023-01-23 17:02     ` Yonghong Song
2023-01-23 17:32       ` Yaniv Agman
2023-01-23 20:06         ` Martin KaFai Lau
2023-01-23 21:01           ` Yaniv Agman
2023-01-23 21:22             ` Jakub Sitnicki
2023-01-23 21:56               ` Yaniv Agman
2023-01-24 12:30                 ` Alexei Starovoitov
2023-01-24 15:47                   ` Yaniv Agman
2023-01-24 17:24                     ` Alexei Starovoitov
2023-01-24 17:38                       ` Yaniv Agman
2023-01-25  0:04                         ` Alexei Starovoitov
2023-01-25 16:39                           ` Yaniv Agman
2023-01-25 18:52                             ` Alexei Starovoitov
2023-01-25 19:59                               ` Yaniv Agman
2023-01-26  2:22                                 ` Alexei Starovoitov
2023-01-26  6:59                                   ` Yaniv Agman
2023-01-26 15:29                                     ` Alexei Starovoitov
2023-01-26 17:51                                       ` Yaniv Agman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.