* Are BPF programs preemptible? @ 2023-01-23 9:21 Yaniv Agman 2023-01-23 10:46 ` Jakub Sitnicki 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-23 9:21 UTC (permalink / raw) To: bpf Hello! Several places state that eBPF programs cannot be preempted by the kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain), however, I did see a strange behavior where an eBPF percpu map gets overridden, and I'm trying to figure out if it's due to a bug in my program or some misunderstanding I have about eBPF. What caught my eye was a sentence in a LWN article (https://lwn.net/Articles/812503/) that says: "Alexei thankfully enlightened me recently over a beer that the real intent here is to guarantee that the program runs to completion on the same CPU where it started". So my question is - are BPF programs guaranteed to run from start to end without being interrupted at all or the only guarantee I get is that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can interrupt their run? If the only guarantee is no migration, it means that a percpu map cannot be safely used by two different BPF programs that can preempt each other (e.g. some kprobe and a network cgroup program). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 9:21 Are BPF programs preemptible? Yaniv Agman @ 2023-01-23 10:46 ` Jakub Sitnicki 2023-01-23 12:30 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Jakub Sitnicki @ 2023-01-23 10:46 UTC (permalink / raw) To: Yaniv Agman; +Cc: bpf On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote: > Hello! > > Several places state that eBPF programs cannot be preempted by the > kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain), however, > I did see a strange behavior where an eBPF percpu map gets overridden, > and I'm trying to figure out if it's due to a bug in my program or > some misunderstanding I have about eBPF. What caught my eye was a > sentence in a LWN article (https://lwn.net/Articles/812503/) that > says: "Alexei thankfully enlightened me recently over a beer that the > real intent here is to guarantee that the program runs to completion > on the same CPU where it started". > > So my question is - are BPF programs guaranteed to run from start to > end without being interrupted at all or the only guarantee I get is > that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can > interrupt their run? > > If the only guarantee is no migration, it means that a percpu map > cannot be safely used by two different BPF programs that can preempt > each other (e.g. some kprobe and a network cgroup program). Since v5.7 BPF program runners use migrate_disable() instead of preempt_disable(). See commit 2a916f2f546c ("bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.") [1]. But at that time migrate_disable() was merely an alias for preempt_disable() on !CONFIG_PREEMPT_RT kernels. Since v5.11 migrate_disable() does no longer disable preemption on !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make migrate_disable/enable() independent of RT") [2]. So, yes, you are right, but it depends on the kernel version. PS. The migrate_disable vs per-CPU data problem is also covered in [3]. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0 [3]: https://lwn.net/Articles/836503/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 10:46 ` Jakub Sitnicki @ 2023-01-23 12:30 ` Yaniv Agman 2023-01-23 17:02 ` Yonghong Song 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-23 12:30 UTC (permalink / raw) To: Jakub Sitnicki; +Cc: bpf Ok, thanks Jakub for the answer and references. I must say that I am very surprised though. First, most of the documentation for BPF says that preemption is disabled, like the reference I gave [1] and even the bpf-helpers man page [2] says "Note that all programs run with preemption disabled..." for the bpf_get_smp_processor_id() helper. I think this is something that deserves more attention since many eBPF developers are still under the assumption that eBPF programs are non-preemptible, and running their programs on newer kernels might be broken. I'm trying to figure out how I can solve this issue in our case - is it correct to assume that no more than one preemption can happen during a run of my bpf program? If so, I can try to write a percpu buffer with 2 entries, and give the second entry to the program that interrupted the first one. But even then, I will need to find a way to know if my program currently interrupts the run of another program - is there a way to do that? Maybe checking if the current context is of an interrupt, can this be done? Any other suggestions to solve this problem? [1]: https://docs.cilium.io/en/latest/bpf/toolchain [2]: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html Thanks, Yaniv בתאריך יום ב׳, 23 בינו׳ 2023 ב-12:54 מאת Jakub Sitnicki <jakub@cloudflare.com>: > > On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote: > > Hello! > > > > Several places state that eBPF programs cannot be preempted by the > > kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain), however, > > I did see a strange behavior where an eBPF percpu map gets overridden, > > and I'm trying to figure out if it's due to a bug in my program or > > some misunderstanding I have about eBPF. What caught my eye was a > > sentence in a LWN article (https://lwn.net/Articles/812503/) that > > says: "Alexei thankfully enlightened me recently over a beer that the > > real intent here is to guarantee that the program runs to completion > > on the same CPU where it started". > > > > So my question is - are BPF programs guaranteed to run from start to > > end without being interrupted at all or the only guarantee I get is > > that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can > > interrupt their run? > > > > If the only guarantee is no migration, it means that a percpu map > > cannot be safely used by two different BPF programs that can preempt > > each other (e.g. some kprobe and a network cgroup program). > > Since v5.7 BPF program runners use migrate_disable() instead of > preempt_disable(). See commit 2a916f2f546c ("bpf: Use > migrate_disable/enable in array macros and cgroup/lirc code.") [1]. > > But at that time migrate_disable() was merely an alias for > preempt_disable() on !CONFIG_PREEMPT_RT kernels. > > Since v5.11 migrate_disable() does no longer disable preemption on > !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make > migrate_disable/enable() independent of RT") [2]. > > So, yes, you are right, but it depends on the kernel version. > > PS. The migrate_disable vs per-CPU data problem is also covered in [3]. > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0 > [3]: https://lwn.net/Articles/836503/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 12:30 ` Yaniv Agman @ 2023-01-23 17:02 ` Yonghong Song 2023-01-23 17:32 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Yonghong Song @ 2023-01-23 17:02 UTC (permalink / raw) To: Yaniv Agman, Jakub Sitnicki; +Cc: bpf On 1/23/23 4:30 AM, Yaniv Agman wrote: > Ok, thanks Jakub for the answer and references. > I must say that I am very surprised though. First, most of the > documentation for BPF says that preemption is disabled, like the > reference I gave [1] and even the bpf-helpers man page [2] says "Note > that all programs run with preemption disabled..." for the > bpf_get_smp_processor_id() helper. I think this is something that > deserves more attention since many eBPF developers are still under the > assumption that eBPF programs are non-preemptible, and running their > programs on newer kernels might be broken. It would be great if you can send a patch to fix these out-dated comments! > > I'm trying to figure out how I can solve this issue in our case - is > it correct to assume that no more than one preemption can happen > during a run of my bpf program? If so, I can try to write a percpu No. It is possible that you have more than one preemption during the same prog run. There is no restriction on this. > buffer with 2 entries, and give the second entry to the program that > interrupted the first one. But even then, I will need to find a way to > know if my program currently interrupts the run of another program - > is there a way to do that? Maybe checking if the current context is of > an interrupt, can this be done? Any other suggestions to solve this > problem? > > [1]: https://docs.cilium.io/en/latest/bpf/toolchain > [2]: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html > > Thanks, > Yaniv > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-12:54 מאת Jakub Sitnicki > <jakub@cloudflare.com>: >> >> On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote: >>> Hello! >>> >>> Several places state that eBPF programs cannot be preempted by the >>> kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain ), however, >>> I did see a strange behavior where an eBPF percpu map gets overridden, >>> and I'm trying to figure out if it's due to a bug in my program or >>> some misunderstanding I have about eBPF. What caught my eye was a >>> sentence in a LWN article (https://lwn.net/Articles/812503/ ) that >>> says: "Alexei thankfully enlightened me recently over a beer that the >>> real intent here is to guarantee that the program runs to completion >>> on the same CPU where it started". >>> >>> So my question is - are BPF programs guaranteed to run from start to >>> end without being interrupted at all or the only guarantee I get is >>> that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can >>> interrupt their run? >>> >>> If the only guarantee is no migration, it means that a percpu map >>> cannot be safely used by two different BPF programs that can preempt >>> each other (e.g. some kprobe and a network cgroup program). >> >> Since v5.7 BPF program runners use migrate_disable() instead of >> preempt_disable(). See commit 2a916f2f546c ("bpf: Use >> migrate_disable/enable in array macros and cgroup/lirc code.") [1]. >> >> But at that time migrate_disable() was merely an alias for >> preempt_disable() on !CONFIG_PREEMPT_RT kernels. >> >> Since v5.11 migrate_disable() does no longer disable preemption on >> !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make >> migrate_disable/enable() independent of RT") [2]. >> >> So, yes, you are right, but it depends on the kernel version. >> >> PS. The migrate_disable vs per-CPU data problem is also covered in [3]. >> >> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb >> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0 >> [3]: https://lwn.net/Articles/836503/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 17:02 ` Yonghong Song @ 2023-01-23 17:32 ` Yaniv Agman 2023-01-23 20:06 ` Martin KaFai Lau 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-23 17:32 UTC (permalink / raw) To: Yonghong Song; +Cc: Jakub Sitnicki, bpf בתאריך יום ב׳, 23 בינו׳ 2023 ב-19:02 מאת Yonghong Song <yhs@meta.com>: > > > > On 1/23/23 4:30 AM, Yaniv Agman wrote: > > Ok, thanks Jakub for the answer and references. > > I must say that I am very surprised though. First, most of the > > documentation for BPF says that preemption is disabled, like the > > reference I gave [1] and even the bpf-helpers man page [2] says "Note > > that all programs run with preemption disabled..." for the > > bpf_get_smp_processor_id() helper. I think this is something that > > deserves more attention since many eBPF developers are still under the > > assumption that eBPF programs are non-preemptible, and running their > > programs on newer kernels might be broken. > > It would be great if you can send a patch to fix these > out-dated comments! > > > > > I'm trying to figure out how I can solve this issue in our case - is > > it correct to assume that no more than one preemption can happen > > during a run of my bpf program? If so, I can try to write a percpu > > No. It is possible that you have more than one preemption during the > same prog run. There is no restriction on this. > Any other suggestions for a solution to this problem? For us, it is a regression caused by this change, but I didn't find any proper alternative to use > > buffer with 2 entries, and give the second entry to the program that > > interrupted the first one. But even then, I will need to find a way to > > know if my program currently interrupts the run of another program - > > is there a way to do that? Maybe checking if the current context is of > > an interrupt, can this be done? Any other suggestions to solve this > > problem? > > > > [1]: https://docs.cilium.io/en/latest/bpf/toolchain > > [2]: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html > > > > Thanks, > > Yaniv > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-12:54 מאת Jakub Sitnicki > > <jakub@cloudflare.com>: > >> > >> On Mon, Jan 23, 2023 at 11:21 AM +02, Yaniv Agman wrote: > >>> Hello! > >>> > >>> Several places state that eBPF programs cannot be preempted by the > >>> kernel (e.g. https://docs.cilium.io/en/latest/bpf/toolchain ), however, > >>> I did see a strange behavior where an eBPF percpu map gets overridden, > >>> and I'm trying to figure out if it's due to a bug in my program or > >>> some misunderstanding I have about eBPF. What caught my eye was a > >>> sentence in a LWN article (https://lwn.net/Articles/812503/ ) that > >>> says: "Alexei thankfully enlightened me recently over a beer that the > >>> real intent here is to guarantee that the program runs to completion > >>> on the same CPU where it started". > >>> > >>> So my question is - are BPF programs guaranteed to run from start to > >>> end without being interrupted at all or the only guarantee I get is > >>> that they run on the same CPU but IRQs (NMIs, soft irqs, whatever) can > >>> interrupt their run? > >>> > >>> If the only guarantee is no migration, it means that a percpu map > >>> cannot be safely used by two different BPF programs that can preempt > >>> each other (e.g. some kprobe and a network cgroup program). > >> > >> Since v5.7 BPF program runners use migrate_disable() instead of > >> preempt_disable(). See commit 2a916f2f546c ("bpf: Use > >> migrate_disable/enable in array macros and cgroup/lirc code.") [1]. > >> > >> But at that time migrate_disable() was merely an alias for > >> preempt_disable() on !CONFIG_PREEMPT_RT kernels. > >> > >> Since v5.11 migrate_disable() does no longer disable preemption on > >> !CONFIG_PREEMPT_RT kernels. See commit 74d862b682f5 ("sched: Make > >> migrate_disable/enable() independent of RT") [2]. > >> > >> So, yes, you are right, but it depends on the kernel version. > >> > >> PS. The migrate_disable vs per-CPU data problem is also covered in [3]. > >> > >> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a916f2f546ca1c1e3323e2a4269307f6d9890eb > >> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d862b682f51e45d25b95b1ecf212428a4967b0 > >> [3]: https://lwn.net/Articles/836503/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 17:32 ` Yaniv Agman @ 2023-01-23 20:06 ` Martin KaFai Lau 2023-01-23 21:01 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Martin KaFai Lau @ 2023-01-23 20:06 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, bpf, Yonghong Song On 1/23/23 9:32 AM, Yaniv Agman wrote: >>> interrupted the first one. But even then, I will need to find a way to >>> know if my program currently interrupts the run of another program - >>> is there a way to do that? May be a percpu atomic counter to see if the bpf prog has been re-entered on the same cpu. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 20:06 ` Martin KaFai Lau @ 2023-01-23 21:01 ` Yaniv Agman 2023-01-23 21:22 ` Jakub Sitnicki 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-23 21:01 UTC (permalink / raw) To: Martin KaFai Lau; +Cc: Jakub Sitnicki, bpf, Yonghong Song בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau <martin.lau@linux.dev>: > > On 1/23/23 9:32 AM, Yaniv Agman wrote: > >>> interrupted the first one. But even then, I will need to find a way to > >>> know if my program currently interrupts the run of another program - > >>> is there a way to do that? > May be a percpu atomic counter to see if the bpf prog has been re-entered on the > same cpu. Not sure I understand how this will help. If I want to save local program data on a percpu map and I see that the counter is bigger then zero, should I ignore the event? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 21:01 ` Yaniv Agman @ 2023-01-23 21:22 ` Jakub Sitnicki 2023-01-23 21:56 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Jakub Sitnicki @ 2023-01-23 21:22 UTC (permalink / raw) To: Yaniv Agman; +Cc: Martin KaFai Lau, bpf, Yonghong Song On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > <martin.lau@linux.dev>: >> >> On 1/23/23 9:32 AM, Yaniv Agman wrote: >> >>> interrupted the first one. But even then, I will need to find a way to >> >>> know if my program currently interrupts the run of another program - >> >>> is there a way to do that? >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the >> same cpu. > > Not sure I understand how this will help. If I want to save local > program data on a percpu map and I see that the counter is bigger then > zero, should I ignore the event? map_update w/ BPF_F_LOCK disables preemption, if you're after updating an entry atomically. But it can't be used with PERCPU maps today. Perhaps that's needed now too. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 21:22 ` Jakub Sitnicki @ 2023-01-23 21:56 ` Yaniv Agman 2023-01-24 12:30 ` Alexei Starovoitov 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-23 21:56 UTC (permalink / raw) To: Jakub Sitnicki; +Cc: Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki <jakub@cloudflare.com>: > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > <martin.lau@linux.dev>: > >> > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > >> >>> interrupted the first one. But even then, I will need to find a way to > >> >>> know if my program currently interrupts the run of another program - > >> >>> is there a way to do that? > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > >> same cpu. > > > > Not sure I understand how this will help. If I want to save local > > program data on a percpu map and I see that the counter is bigger then > > zero, should I ignore the event? > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > an entry atomically. But it can't be used with PERCPU maps today. > Perhaps that's needed now too. Yep. I think what is needed here is the ability to disable preemption from the bpf program - maybe even adding a helper for that? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-23 21:56 ` Yaniv Agman @ 2023-01-24 12:30 ` Alexei Starovoitov 2023-01-24 15:47 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Alexei Starovoitov @ 2023-01-24 12:30 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > <jakub@cloudflare.com>: > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > <martin.lau@linux.dev>: > > >> > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > >> >>> interrupted the first one. But even then, I will need to find a way to > > >> >>> know if my program currently interrupts the run of another program - > > >> >>> is there a way to do that? > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > >> same cpu. > > > > > > Not sure I understand how this will help. If I want to save local > > > program data on a percpu map and I see that the counter is bigger then > > > zero, should I ignore the event? > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > an entry atomically. But it can't be used with PERCPU maps today. > > Perhaps that's needed now too. > > Yep. I think what is needed here is the ability to disable preemption > from the bpf program - maybe even adding a helper for that? I'm not sure what the issue is here. Old preempt_disable() doesn't mean that one bpf program won't ever be interrupted by another bpf prog. Like networking bpf prog in old preempt_disable can call into something where there is a kprobe and another tracing bpf prog will be called. The same can happen after we switched to migrate_disable. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-24 12:30 ` Alexei Starovoitov @ 2023-01-24 15:47 ` Yaniv Agman 2023-01-24 17:24 ` Alexei Starovoitov 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-24 15:47 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > <jakub@cloudflare.com>: > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > <martin.lau@linux.dev>: > > > >> > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > >> >>> know if my program currently interrupts the run of another program - > > > >> >>> is there a way to do that? > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > >> same cpu. > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > program data on a percpu map and I see that the counter is bigger then > > > > zero, should I ignore the event? > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > an entry atomically. But it can't be used with PERCPU maps today. > > > Perhaps that's needed now too. > > > > Yep. I think what is needed here is the ability to disable preemption > > from the bpf program - maybe even adding a helper for that? > > I'm not sure what the issue is here. > Old preempt_disable() doesn't mean that one bpf program won't ever > be interrupted by another bpf prog. > Like networking bpf prog in old preempt_disable can call into something > where there is a kprobe and another tracing bpf prog will be called. > The same can happen after we switched to migrate_disable. One difference here is that in what you describe the programmer can know in advance which functions might call others and avoid that or use other percpu maps, but if preemption can happen between functions which are not related to one another (don't have a relation of caller and callee), then the programmer can't have control over it ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-24 15:47 ` Yaniv Agman @ 2023-01-24 17:24 ` Alexei Starovoitov 2023-01-24 17:38 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Alexei Starovoitov @ 2023-01-24 17:24 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > <alexei.starovoitov@gmail.com>: > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > <jakub@cloudflare.com>: > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > <martin.lau@linux.dev>: > > > > >> > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > >> >>> know if my program currently interrupts the run of another program - > > > > >> >>> is there a way to do that? > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > >> same cpu. > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > zero, should I ignore the event? > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > Perhaps that's needed now too. > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > from the bpf program - maybe even adding a helper for that? > > > > I'm not sure what the issue is here. > > Old preempt_disable() doesn't mean that one bpf program won't ever > > be interrupted by another bpf prog. > > Like networking bpf prog in old preempt_disable can call into something > > where there is a kprobe and another tracing bpf prog will be called. > > The same can happen after we switched to migrate_disable. > > One difference here is that in what you describe the programmer can > know in advance which functions might call others and avoid that or > use other percpu maps, but if preemption can happen between functions > which are not related to one another (don't have a relation of caller > and callee), then the programmer can't have control over it Could you give a specific example? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-24 17:24 ` Alexei Starovoitov @ 2023-01-24 17:38 ` Yaniv Agman 2023-01-25 0:04 ` Alexei Starovoitov 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-24 17:38 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > > <alexei.starovoitov@gmail.com>: > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > > <jakub@cloudflare.com>: > > > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > > <martin.lau@linux.dev>: > > > > > >> > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > > >> >>> know if my program currently interrupts the run of another program - > > > > > >> >>> is there a way to do that? > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > > >> same cpu. > > > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > > zero, should I ignore the event? > > > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > > Perhaps that's needed now too. > > > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > > from the bpf program - maybe even adding a helper for that? > > > > > > I'm not sure what the issue is here. > > > Old preempt_disable() doesn't mean that one bpf program won't ever > > > be interrupted by another bpf prog. > > > Like networking bpf prog in old preempt_disable can call into something > > > where there is a kprobe and another tracing bpf prog will be called. > > > The same can happen after we switched to migrate_disable. > > > > One difference here is that in what you describe the programmer can > > know in advance which functions might call others and avoid that or > > use other percpu maps, but if preemption can happen between functions > > which are not related to one another (don't have a relation of caller > > and callee), then the programmer can't have control over it > > Could you give a specific example? Sure. I can give two examples from places where we saw such corruption: 1. We use kprobes to trace some LSM hooks, e.g. security_file_open, and a percpu scratch map to prepare the event for submit. When we also added a TRACEPOINT to trace sched_process_free (where we also use this scratch percpu map), the security_file_open events got corrupted and we didn't know what was happening (was very hard to debug) 2. same was happening when kprobes were combined with cgroup_skb programs to trace network events ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-24 17:38 ` Yaniv Agman @ 2023-01-25 0:04 ` Alexei Starovoitov 2023-01-25 16:39 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Alexei Starovoitov @ 2023-01-25 0:04 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת Alexei Starovoitov > <alexei.starovoitov@gmail.com>: > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > > > <alexei.starovoitov@gmail.com>: > > > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > > > <jakub@cloudflare.com>: > > > > > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > > > <martin.lau@linux.dev>: > > > > > > >> > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > > > >> >>> know if my program currently interrupts the run of another program - > > > > > > >> >>> is there a way to do that? > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > > > >> same cpu. > > > > > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > > > zero, should I ignore the event? > > > > > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > > > Perhaps that's needed now too. > > > > > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > > > from the bpf program - maybe even adding a helper for that? > > > > > > > > I'm not sure what the issue is here. > > > > Old preempt_disable() doesn't mean that one bpf program won't ever > > > > be interrupted by another bpf prog. > > > > Like networking bpf prog in old preempt_disable can call into something > > > > where there is a kprobe and another tracing bpf prog will be called. > > > > The same can happen after we switched to migrate_disable. > > > > > > One difference here is that in what you describe the programmer can > > > know in advance which functions might call others and avoid that or > > > use other percpu maps, but if preemption can happen between functions > > > which are not related to one another (don't have a relation of caller > > > and callee), then the programmer can't have control over it > > > > Could you give a specific example? > > Sure. I can give two examples from places where we saw such corruption: > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open, > and a percpu scratch map to prepare the event for submit. When we also > added a TRACEPOINT to trace sched_process_free (where we also use this > scratch percpu map), the security_file_open events got corrupted and > we didn't know what was happening (was very hard to debug) bpf_lsm_file_open is sleepable. We never did preempt_disable there. kprobe on security_file_open works, but it's a bit of a hack. With preempt->migrate the delayed_put_task_struct->trace_sched_process_free can happen on the same cpu if prog gets preempted in preemtable kernel, but something is fishy here. Non-sleepable bpf progs always run in rcu cs while delayed_put_task_struct is call_rcu(), so if you've used a per-cpu scratch buffer for a duration of bpf prog (either old preempt_disable or new migrate_disable) the trace_sched_process_free won't be called until prog is finished. If you're using per-cpu scratch buffer to remember the data in one execution of kprobe-bpf and consuming in the next then all bets are off. This is going to break sooner or later. > 2. same was happening when kprobes were combined with cgroup_skb > programs to trace network events irqs can interrupt old preempt_disable, so depending on where kprobe-bpf is it can run while cgroup-bpf hasn't finished. Then there are differences in kernel configs: no preempt, full preempt, preempt_rcu and differences in rcu_tasks_trace implementation over the years. You can only assume that the validity of the pointer to bpf per-cpu array and bpf per-cpu hash, but no assumption about concurrent access to the same map if you share the same map across different programs. See BPF_F_LOCK earlier suggestion. It might be what you need. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-25 0:04 ` Alexei Starovoitov @ 2023-01-25 16:39 ` Yaniv Agman 2023-01-25 18:52 ` Alexei Starovoitov 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-25 16:39 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת Alexei Starovoitov > > <alexei.starovoitov@gmail.com>: > > > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com>: > > > > > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > > > > <jakub@cloudflare.com>: > > > > > > > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > > > > <martin.lau@linux.dev>: > > > > > > > >> > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > > > > >> >>> know if my program currently interrupts the run of another program - > > > > > > > >> >>> is there a way to do that? > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > > > > >> same cpu. > > > > > > > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > > > > zero, should I ignore the event? > > > > > > > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > > > > Perhaps that's needed now too. > > > > > > > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > > > > from the bpf program - maybe even adding a helper for that? > > > > > > > > > > I'm not sure what the issue is here. > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever > > > > > be interrupted by another bpf prog. > > > > > Like networking bpf prog in old preempt_disable can call into something > > > > > where there is a kprobe and another tracing bpf prog will be called. > > > > > The same can happen after we switched to migrate_disable. > > > > > > > > One difference here is that in what you describe the programmer can > > > > know in advance which functions might call others and avoid that or > > > > use other percpu maps, but if preemption can happen between functions > > > > which are not related to one another (don't have a relation of caller > > > > and callee), then the programmer can't have control over it > > > > > > Could you give a specific example? > > > > Sure. I can give two examples from places where we saw such corruption: > > > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open, > > and a percpu scratch map to prepare the event for submit. When we also > > added a TRACEPOINT to trace sched_process_free (where we also use this > > scratch percpu map), the security_file_open events got corrupted and > > we didn't know what was happening (was very hard to debug) > > bpf_lsm_file_open is sleepable. > We never did preempt_disable there. > kprobe on security_file_open works, but it's a bit of a hack. > With preempt->migrate > the delayed_put_task_struct->trace_sched_process_free can happen > on the same cpu if prog gets preempted in preemtable kernel, but something > is fishy here. > Non-sleepable bpf progs always run in rcu cs while > delayed_put_task_struct is call_rcu(), > so if you've used a per-cpu scratch buffer for a duration > of bpf prog (either old preempt_disable or new migrate_disable) > the trace_sched_process_free won't be called until prog is finished. > > If you're using per-cpu scratch buffer to remember the data in one > execution of kprobe-bpf and consuming in the next then all bets are off. > This is going to break sooner or later. > Yes, I agree this is fishy, but we only use the per-cpu scratch map for the duration of a bpf program, assuming tail-calls are considered to be part of the same program. We've even spotted the place where preemption happens to be right after calling bpf_perf_event_submit() helper by placing bpf_printk in the security_file_open kprobe. Before the call to the helper the buffer was ok, but after it got corrupted (and there were more lines in the program after the call to this helper). > > 2. same was happening when kprobes were combined with cgroup_skb > > programs to trace network events > > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf > is it can run while cgroup-bpf hasn't finished. > Actually what we saw here is that the ingress skb program that runs from ksoftirq/idle context corrupts a percpu map shared with a raw_sys_exit tracepoint used to submit execve/openat events. > Then there are differences in kernel configs: no preempt, full preempt, > preempt_rcu and differences in rcu_tasks_trace implementation > over the years. You can only assume that the validity of the pointer > to bpf per-cpu array and bpf per-cpu hash, but no assumption > about concurrent access to the same map if you share the same map > across different programs. I think that is the main point here. Thanks for clarifying this. > See BPF_F_LOCK earlier suggestion. It might be what you need. I am not sure how exactly this can help. The size of the percpu buffer we use is 32kb and we don't use the map_update helper but the pointer returned from the lookup to update the map in several places in the program. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-25 16:39 ` Yaniv Agman @ 2023-01-25 18:52 ` Alexei Starovoitov 2023-01-25 19:59 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Alexei Starovoitov @ 2023-01-25 18:52 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת Alexei Starovoitov > <alexei.starovoitov@gmail.com>: > > > > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת Alexei Starovoitov > > > <alexei.starovoitov@gmail.com>: > > > > > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com>: > > > > > > > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > > > > > <jakub@cloudflare.com>: > > > > > > > > > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > > > > > <martin.lau@linux.dev>: > > > > > > > > >> > > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > > > > > >> >>> know if my program currently interrupts the run of another program - > > > > > > > > >> >>> is there a way to do that? > > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > > > > > >> same cpu. > > > > > > > > > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > > > > > zero, should I ignore the event? > > > > > > > > > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > > > > > Perhaps that's needed now too. > > > > > > > > > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > > > > > from the bpf program - maybe even adding a helper for that? > > > > > > > > > > > > I'm not sure what the issue is here. > > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever > > > > > > be interrupted by another bpf prog. > > > > > > Like networking bpf prog in old preempt_disable can call into something > > > > > > where there is a kprobe and another tracing bpf prog will be called. > > > > > > The same can happen after we switched to migrate_disable. > > > > > > > > > > One difference here is that in what you describe the programmer can > > > > > know in advance which functions might call others and avoid that or > > > > > use other percpu maps, but if preemption can happen between functions > > > > > which are not related to one another (don't have a relation of caller > > > > > and callee), then the programmer can't have control over it > > > > > > > > Could you give a specific example? > > > > > > Sure. I can give two examples from places where we saw such corruption: > > > > > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open, > > > and a percpu scratch map to prepare the event for submit. When we also > > > added a TRACEPOINT to trace sched_process_free (where we also use this > > > scratch percpu map), the security_file_open events got corrupted and > > > we didn't know what was happening (was very hard to debug) > > > > bpf_lsm_file_open is sleepable. > > We never did preempt_disable there. > > kprobe on security_file_open works, but it's a bit of a hack. > > With preempt->migrate > > the delayed_put_task_struct->trace_sched_process_free can happen > > on the same cpu if prog gets preempted in preemtable kernel, but something > > is fishy here. > > Non-sleepable bpf progs always run in rcu cs while > > delayed_put_task_struct is call_rcu(), > > so if you've used a per-cpu scratch buffer for a duration > > of bpf prog (either old preempt_disable or new migrate_disable) > > the trace_sched_process_free won't be called until prog is finished. > > > > If you're using per-cpu scratch buffer to remember the data in one > > execution of kprobe-bpf and consuming in the next then all bets are off. > > This is going to break sooner or later. > > > > Yes, I agree this is fishy, but we only use the per-cpu scratch map > for the duration of a bpf program, assuming tail-calls are considered > to be part of the same program. > We've even spotted the place where preemption happens to be right > after calling bpf_perf_event_submit() helper by placing bpf_printk in > the security_file_open kprobe. Before the call to the helper the > buffer was ok, but after it got corrupted (and there were more lines > in the program after the call to this helper). I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise which will send an ipi to the current cpu which may resched. I'm still missing why kprobe in security_file_open has to share per-cpu scratch buffer with kprobe in trace_sched_process_free. Why not use a scratch buffer per program? > > > > 2. same was happening when kprobes were combined with cgroup_skb > > > programs to trace network events > > > > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf > > is it can run while cgroup-bpf hasn't finished. > > > > Actually what we saw here is that the ingress skb program that runs > from ksoftirq/idle context corrupts a percpu map shared with a > raw_sys_exit tracepoint used to submit execve/openat events. > > > Then there are differences in kernel configs: no preempt, full preempt, > > preempt_rcu and differences in rcu_tasks_trace implementation > > over the years. You can only assume that the validity of the pointer > > to bpf per-cpu array and bpf per-cpu hash, but no assumption > > about concurrent access to the same map if you share the same map > > across different programs. > > I think that is the main point here. Thanks for clarifying this. > > > See BPF_F_LOCK earlier suggestion. It might be what you need. > > I am not sure how exactly this can help. The size of the percpu buffer > we use is 32kb and we don't use the map_update helper but the pointer > returned from the lookup to update the map in several places in the > program. It wasn't clear what you do with the scratch buffer. BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without a race with bpf prog that would use bpf_spin_lock() to access the element. Looks like that's not the case you had in mind. You just stream these buffers to user space via bpf_perf_event_output(). Anyway back to preempt_disable(). Think of it as a giant spin_lock that covers the whole program. In preemptable kernels it hurts tail latency and fairness, and is completely unacceptable in RT. That's why we moved to migrate_disable. Technically we can add bpf_preempt_disable() kfunc, but if we do that we'll be back to square one. The issues with preemptions and RT will reappear. So let's figure out a different solution. Why not use a scratch buffer per program ? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-25 18:52 ` Alexei Starovoitov @ 2023-01-25 19:59 ` Yaniv Agman 2023-01-26 2:22 ` Alexei Starovoitov 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-25 19:59 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ד׳, 25 בינו׳ 2023 ב-20:52 מאת Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת Alexei Starovoitov > > <alexei.starovoitov@gmail.com>: > > > > > > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com>: > > > > > > > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > > > > > > <alexei.starovoitov@gmail.com>: > > > > > > > > > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > > > > > > <jakub@cloudflare.com>: > > > > > > > > > > > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > > > > > > <martin.lau@linux.dev>: > > > > > > > > > >> > > > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > > > > > > >> >>> know if my program currently interrupts the run of another program - > > > > > > > > > >> >>> is there a way to do that? > > > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > > > > > > >> same cpu. > > > > > > > > > > > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > > > > > > zero, should I ignore the event? > > > > > > > > > > > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > > > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > > > > > > Perhaps that's needed now too. > > > > > > > > > > > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > > > > > > from the bpf program - maybe even adding a helper for that? > > > > > > > > > > > > > > I'm not sure what the issue is here. > > > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever > > > > > > > be interrupted by another bpf prog. > > > > > > > Like networking bpf prog in old preempt_disable can call into something > > > > > > > where there is a kprobe and another tracing bpf prog will be called. > > > > > > > The same can happen after we switched to migrate_disable. > > > > > > > > > > > > One difference here is that in what you describe the programmer can > > > > > > know in advance which functions might call others and avoid that or > > > > > > use other percpu maps, but if preemption can happen between functions > > > > > > which are not related to one another (don't have a relation of caller > > > > > > and callee), then the programmer can't have control over it > > > > > > > > > > Could you give a specific example? > > > > > > > > Sure. I can give two examples from places where we saw such corruption: > > > > > > > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open, > > > > and a percpu scratch map to prepare the event for submit. When we also > > > > added a TRACEPOINT to trace sched_process_free (where we also use this > > > > scratch percpu map), the security_file_open events got corrupted and > > > > we didn't know what was happening (was very hard to debug) > > > > > > bpf_lsm_file_open is sleepable. > > > We never did preempt_disable there. > > > kprobe on security_file_open works, but it's a bit of a hack. > > > With preempt->migrate > > > the delayed_put_task_struct->trace_sched_process_free can happen > > > on the same cpu if prog gets preempted in preemtable kernel, but something > > > is fishy here. > > > Non-sleepable bpf progs always run in rcu cs while > > > delayed_put_task_struct is call_rcu(), > > > so if you've used a per-cpu scratch buffer for a duration > > > of bpf prog (either old preempt_disable or new migrate_disable) > > > the trace_sched_process_free won't be called until prog is finished. > > > > > > If you're using per-cpu scratch buffer to remember the data in one > > > execution of kprobe-bpf and consuming in the next then all bets are off. > > > This is going to break sooner or later. > > > > > > > Yes, I agree this is fishy, but we only use the per-cpu scratch map > > for the duration of a bpf program, assuming tail-calls are considered > > to be part of the same program. > > We've even spotted the place where preemption happens to be right > > after calling bpf_perf_event_submit() helper by placing bpf_printk in > > the security_file_open kprobe. Before the call to the helper the > > buffer was ok, but after it got corrupted (and there were more lines > > in the program after the call to this helper). > > I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise > which will send an ipi to the current cpu which may resched. > > I'm still missing why kprobe in security_file_open has to share > per-cpu scratch buffer with kprobe in trace_sched_process_free. > Why not use a scratch buffer per program? > Well, since all the documentation I read about eBPF says that BPF programs are non-preemptible, I assumed it will be ok if we use the same per-cpu scratch buffer for all programs as long as we use it only in the scope of the program, but now I understand this assumption (and documentation) was wrong. > > > > > > 2. same was happening when kprobes were combined with cgroup_skb > > > > programs to trace network events > > > > > > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf > > > is it can run while cgroup-bpf hasn't finished. > > > > > > > Actually what we saw here is that the ingress skb program that runs > > from ksoftirq/idle context corrupts a percpu map shared with a > > raw_sys_exit tracepoint used to submit execve/openat events. > > > > > Then there are differences in kernel configs: no preempt, full preempt, > > > preempt_rcu and differences in rcu_tasks_trace implementation > > > over the years. You can only assume that the validity of the pointer > > > to bpf per-cpu array and bpf per-cpu hash, but no assumption > > > about concurrent access to the same map if you share the same map > > > across different programs. > > > > I think that is the main point here. Thanks for clarifying this. > > > > > See BPF_F_LOCK earlier suggestion. It might be what you need. > > > > I am not sure how exactly this can help. The size of the percpu buffer > > we use is 32kb and we don't use the map_update helper but the pointer > > returned from the lookup to update the map in several places in the > > program. > > It wasn't clear what you do with the scratch buffer. > BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without > a race with bpf prog that would use bpf_spin_lock() to access the element. > Looks like that's not the case you had in mind. > You just stream these buffers to user space via bpf_perf_event_output(). > exactly > Anyway back to preempt_disable(). Think of it as a giant spin_lock > that covers the whole program. In preemptable kernels it hurts > tail latency and fairness, and is completely unacceptable in RT. > That's why we moved to migrate_disable. > Technically we can add bpf_preempt_disable() kfunc, but if we do that > we'll be back to square one. The issues with preemptions and RT > will reappear. So let's figure out a different solution. > Why not use a scratch buffer per program ? Totally understand the reason for avoiding preemption disable, especially in RT kernels. I believe the answer for why not to use a scratch buffer per program will simply be memory space. In our use-case, Tracee [1], we let the user choose whatever events to trace for a specific workload. This list of events is very big, and we have many BPF programs attached to different places of the kernel. Let's assume that we have 100 events, and for each event we have a different BPF program. Then having 32kb per-cpu scratch buffers translates to 3.2MB per one cpu, and ~100MB per 32 CPUs, which is more common for our case. Since we always add new events to Tracee, this will also not be scalable. Yet, if there is no other solution, I believe we will go in that direction [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-25 19:59 ` Yaniv Agman @ 2023-01-26 2:22 ` Alexei Starovoitov 2023-01-26 6:59 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Alexei Starovoitov @ 2023-01-26 2:22 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock > > that covers the whole program. In preemptable kernels it hurts > > tail latency and fairness, and is completely unacceptable in RT. > > That's why we moved to migrate_disable. > > Technically we can add bpf_preempt_disable() kfunc, but if we do that > > we'll be back to square one. The issues with preemptions and RT > > will reappear. So let's figure out a different solution. > > Why not use a scratch buffer per program ? > > Totally understand the reason for avoiding preemption disable, > especially in RT kernels. > I believe the answer for why not to use a scratch buffer per program > will simply be memory space. > In our use-case, Tracee [1], we let the user choose whatever events to > trace for a specific workload. > This list of events is very big, and we have many BPF programs > attached to different places of the kernel. > Let's assume that we have 100 events, and for each event we have a > different BPF program. > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one > cpu, and ~100MB per 32 CPUs, which is more common for our case. Well, 100 bpf progs consume at least a page each, so you might want one program attached to all events. > Since we always add new events to Tracee, this will also not be scalable. > Yet, if there is no other solution, I believe we will go in that direction > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ? Insead of scratch_map per program, use atomic per-cpu counter for recursion. You'll have 3 levels in the worst case. So it will be: BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); On prog entry increment the recursion counter, on exit decrement. And use that particular scratch_t in the prog. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-26 2:22 ` Alexei Starovoitov @ 2023-01-26 6:59 ` Yaniv Agman 2023-01-26 15:29 ` Alexei Starovoitov 0 siblings, 1 reply; 21+ messages in thread From: Yaniv Agman @ 2023-01-26 6:59 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock > > > that covers the whole program. In preemptable kernels it hurts > > > tail latency and fairness, and is completely unacceptable in RT. > > > That's why we moved to migrate_disable. > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that > > > we'll be back to square one. The issues with preemptions and RT > > > will reappear. So let's figure out a different solution. > > > Why not use a scratch buffer per program ? > > > > Totally understand the reason for avoiding preemption disable, > > especially in RT kernels. > > I believe the answer for why not to use a scratch buffer per program > > will simply be memory space. > > In our use-case, Tracee [1], we let the user choose whatever events to > > trace for a specific workload. > > This list of events is very big, and we have many BPF programs > > attached to different places of the kernel. > > Let's assume that we have 100 events, and for each event we have a > > different BPF program. > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one > > cpu, and ~100MB per 32 CPUs, which is more common for our case. > > Well, 100 bpf progs consume at least a page each, > so you might want one program attached to all events. > Unfortunately, I don't think that would be possible. We still need to support kernels with 4096 instructions limit. We may add some generic programs for events with simpler logic, but even then, support for bpf cookies needed for such programs was only added in more recent kernels. > > Since we always add new events to Tracee, this will also not be scalable. > > Yet, if there is no other solution, I believe we will go in that direction > > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ? We actually have 3 different percpu maps there shared between different programs so we will have to take care of them all. > > Insead of scratch_map per program, use atomic per-cpu counter > for recursion. > You'll have 3 levels in the worst case. Is it guaranteed that no more than 3 levels exist? I suggested a similar solution with 2 levels at the beginning of this thread, but Yonghong Song replied that there is no restriction on this. > So it will be: > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); > On prog entry increment the recursion counter, on exit decrement. > And use that particular scratch_t in the prog. I'm just afraid of potential deadlocks. For sure we will need to decrement the counter on each return path, but can it happen that a bpf program crashes, leaving the counter non-zero? I know that's the job of the verifier to make sure such things won't happen, but can we be sure of that? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-26 6:59 ` Yaniv Agman @ 2023-01-26 15:29 ` Alexei Starovoitov 2023-01-26 17:51 ` Yaniv Agman 0 siblings, 1 reply; 21+ messages in thread From: Alexei Starovoitov @ 2023-01-26 15:29 UTC (permalink / raw) To: Yaniv Agman; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת Alexei Starovoitov > <alexei.starovoitov@gmail.com>: > > > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock > > > > that covers the whole program. In preemptable kernels it hurts > > > > tail latency and fairness, and is completely unacceptable in RT. > > > > That's why we moved to migrate_disable. > > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that > > > > we'll be back to square one. The issues with preemptions and RT > > > > will reappear. So let's figure out a different solution. > > > > Why not use a scratch buffer per program ? > > > > > > Totally understand the reason for avoiding preemption disable, > > > especially in RT kernels. > > > I believe the answer for why not to use a scratch buffer per program > > > will simply be memory space. > > > In our use-case, Tracee [1], we let the user choose whatever events to > > > trace for a specific workload. > > > This list of events is very big, and we have many BPF programs > > > attached to different places of the kernel. > > > Let's assume that we have 100 events, and for each event we have a > > > different BPF program. > > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one > > > cpu, and ~100MB per 32 CPUs, which is more common for our case. > > > > Well, 100 bpf progs consume at least a page each, > > so you might want one program attached to all events. > > > > Unfortunately, I don't think that would be possible. We still need to > support kernels with 4096 instructions limit. > We may add some generic programs for events with simpler logic, but > even then, support for bpf cookies needed for such programs was only > added in more recent kernels. > > > > Since we always add new events to Tracee, this will also not be scalable. > > > Yet, if there is no other solution, I believe we will go in that direction > > > > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c > > > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ? > > We actually have 3 different percpu maps there shared between > different programs so we will have to take care of them all. > > > > > Insead of scratch_map per program, use atomic per-cpu counter > > for recursion. > > You'll have 3 levels in the worst case. > > Is it guaranteed that no more than 3 levels exist? > I suggested a similar solution with 2 levels at the beginning of this > thread, but Yonghong Song replied that there is no restriction on > this. There are no restrictions, but I doubt you can construct a case where you'll see more than 3 in practice. > > So it will be: > > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); > > On prog entry increment the recursion counter, on exit decrement. > > And use that particular scratch_t in the prog. > > I'm just afraid of potential deadlocks. For sure we will need to > decrement the counter on each return path, but can it happen that a > bpf program crashes, leaving the counter non-zero? I know that's the > job of the verifier to make sure such things won't happen, but can we > be sure of that? if you don't trust the verifier you shouldn't be using bpf in the first place. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Are BPF programs preemptible? 2023-01-26 15:29 ` Alexei Starovoitov @ 2023-01-26 17:51 ` Yaniv Agman 0 siblings, 0 replies; 21+ messages in thread From: Yaniv Agman @ 2023-01-26 17:51 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Jakub Sitnicki, Martin KaFai Lau, bpf, Yonghong Song בתאריך יום ה׳, 26 בינו׳ 2023 ב-17:29 מאת Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת Alexei Starovoitov > > <alexei.starovoitov@gmail.com>: > > > > > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@gmail.com> wrote: > > > > > > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock > > > > > that covers the whole program. In preemptable kernels it hurts > > > > > tail latency and fairness, and is completely unacceptable in RT. > > > > > That's why we moved to migrate_disable. > > > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that > > > > > we'll be back to square one. The issues with preemptions and RT > > > > > will reappear. So let's figure out a different solution. > > > > > Why not use a scratch buffer per program ? > > > > > > > > Totally understand the reason for avoiding preemption disable, > > > > especially in RT kernels. > > > > I believe the answer for why not to use a scratch buffer per program > > > > will simply be memory space. > > > > In our use-case, Tracee [1], we let the user choose whatever events to > > > > trace for a specific workload. > > > > This list of events is very big, and we have many BPF programs > > > > attached to different places of the kernel. > > > > Let's assume that we have 100 events, and for each event we have a > > > > different BPF program. > > > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one > > > > cpu, and ~100MB per 32 CPUs, which is more common for our case. > > > > > > Well, 100 bpf progs consume at least a page each, > > > so you might want one program attached to all events. > > > > > > > Unfortunately, I don't think that would be possible. We still need to > > support kernels with 4096 instructions limit. > > We may add some generic programs for events with simpler logic, but > > even then, support for bpf cookies needed for such programs was only > > added in more recent kernels. > > > > > > Since we always add new events to Tracee, this will also not be scalable. > > > > Yet, if there is no other solution, I believe we will go in that direction > > > > > > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c > > > > > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ? > > > > We actually have 3 different percpu maps there shared between > > different programs so we will have to take care of them all. > > > > > > > > Insead of scratch_map per program, use atomic per-cpu counter > > > for recursion. > > > You'll have 3 levels in the worst case. > > > > Is it guaranteed that no more than 3 levels exist? > > I suggested a similar solution with 2 levels at the beginning of this > > thread, but Yonghong Song replied that there is no restriction on > > this. > > There are no restrictions, but I doubt you can construct a case > where you'll see more than 3 in practice. > Ok, I see. > > > So it will be: > > > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); > > > On prog entry increment the recursion counter, on exit decrement. > > > And use that particular scratch_t in the prog. > > > > I'm just afraid of potential deadlocks. For sure we will need to > > decrement the counter on each return path, but can it happen that a > > bpf program crashes, leaving the counter non-zero? I know that's the > > job of the verifier to make sure such things won't happen, but can we > > be sure of that? > > if you don't trust the verifier you shouldn't be using bpf in the first place. Oh believe me I trust it, but I must be honest and say that I don't know about all the intricacies of the kernel, especially when it comes to scheduling and preemption. So I am just trying to think if there is a path where this mechanism can break, and express my concern about the potential for errors or bugs that may lead to unexpected behaviors. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2023-01-26 17:52 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-01-23 9:21 Are BPF programs preemptible? Yaniv Agman 2023-01-23 10:46 ` Jakub Sitnicki 2023-01-23 12:30 ` Yaniv Agman 2023-01-23 17:02 ` Yonghong Song 2023-01-23 17:32 ` Yaniv Agman 2023-01-23 20:06 ` Martin KaFai Lau 2023-01-23 21:01 ` Yaniv Agman 2023-01-23 21:22 ` Jakub Sitnicki 2023-01-23 21:56 ` Yaniv Agman 2023-01-24 12:30 ` Alexei Starovoitov 2023-01-24 15:47 ` Yaniv Agman 2023-01-24 17:24 ` Alexei Starovoitov 2023-01-24 17:38 ` Yaniv Agman 2023-01-25 0:04 ` Alexei Starovoitov 2023-01-25 16:39 ` Yaniv Agman 2023-01-25 18:52 ` Alexei Starovoitov 2023-01-25 19:59 ` Yaniv Agman 2023-01-26 2:22 ` Alexei Starovoitov 2023-01-26 6:59 ` Yaniv Agman 2023-01-26 15:29 ` Alexei Starovoitov 2023-01-26 17:51 ` Yaniv Agman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.