* [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
@ 2024-09-10 3:43 Yonghong Song
2024-09-10 5:34 ` Andrii Nakryiko
0 siblings, 1 reply; 8+ messages in thread
From: Yonghong Song @ 2024-09-10 3:43 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Salvatore Benedetto
Salvatore Benedetto reported an issue that when doing syscall tracepoint
tracing the kernel stack is empty. For example, using the following
command line
bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
the output will be
===
Kernel Stack
===
Further analysis shows that pt_regs used for bpf syscall tracepoint
tracing is from the one constructed during user->kernel transition.
The call stack looks like
perf_syscall_enter+0x88/0x7c0
trace_sys_enter+0x41/0x80
syscall_trace_enter+0x100/0x160
do_syscall_64+0x38/0xf0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The ip address stored in pt_regs is from user space hence no kernel
stack is printed.
To fix the issue, we need to use kernel address from pt_regs.
In kernel repo, there are already a few cases like this. For example,
in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
instances are used to supply ip address or use ip address to construct
call stack.
The patch follows the above example by using a fake pt_regs.
The pt_regs is stored in local stack since the syscall tracepoint
tracing is in process context and there are no possibility that
different concurrent syscall tracepoint tracing could mess up with each
other. This is similar to a perf_fetch_caller_regs() use case in
kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
where a local pt_regs is used.
With this patch, for the above bpftrace script, I got the following output
===
Kernel Stack
syscall_trace_enter+407
syscall_trace_enter+407
do_syscall_64+74
entry_SYSCALL_64_after_hwframe+75
===
Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
kernel/trace/trace_syscalls.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 9c581d6da843..063f51952d49 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
int syscall_nr;
unsigned long args[SYSCALL_DEFINE_MAXARGS];
} __aligned(8) param;
+ struct pt_regs fake_regs;
int i;
BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
/* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */
- *(struct pt_regs **)¶m = regs;
+ memset(&fake_regs, 0, sizeof(fake_regs));
+ perf_fetch_caller_regs(&fake_regs);
+ *(struct pt_regs **)¶m = &fake_regs;
param.syscall_nr = rec->nr;
for (i = 0; i < sys_data->nb_args; i++)
param.args[i] = rec->args[i];
--
2.43.5
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song @ 2024-09-10 5:34 ` Andrii Nakryiko 2024-09-10 5:42 ` Andrii Nakryiko 2024-09-10 15:23 ` Yonghong Song 0 siblings, 2 replies; 8+ messages in thread From: Andrii Nakryiko @ 2024-09-10 5:34 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: > > Salvatore Benedetto reported an issue that when doing syscall tracepoint > tracing the kernel stack is empty. For example, using the following > command line > bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' > the output will be > === > Kernel Stack > === > > Further analysis shows that pt_regs used for bpf syscall tracepoint > tracing is from the one constructed during user->kernel transition. > The call stack looks like > perf_syscall_enter+0x88/0x7c0 > trace_sys_enter+0x41/0x80 > syscall_trace_enter+0x100/0x160 > do_syscall_64+0x38/0xf0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The ip address stored in pt_regs is from user space hence no kernel > stack is printed. > > To fix the issue, we need to use kernel address from pt_regs. > In kernel repo, there are already a few cases like this. For example, > in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) > instances are used to supply ip address or use ip address to construct > call stack. > > The patch follows the above example by using a fake pt_regs. > The pt_regs is stored in local stack since the syscall tracepoint > tracing is in process context and there are no possibility that > different concurrent syscall tracepoint tracing could mess up with each > other. This is similar to a perf_fetch_caller_regs() use case in > kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() > where a local pt_regs is used. > > With this patch, for the above bpftrace script, I got the following output > === > Kernel Stack > > syscall_trace_enter+407 > syscall_trace_enter+407 > do_syscall_64+74 > entry_SYSCALL_64_after_hwframe+75 > === > > Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> > Suggested-by: Andrii Nakryiko <andrii@kernel.org> > Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > --- > kernel/trace/trace_syscalls.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > Note, we need to solve the same for perf_call_bpf_exit(). pw-bot: cr > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > index 9c581d6da843..063f51952d49 100644 > --- a/kernel/trace/trace_syscalls.c > +++ b/kernel/trace/trace_syscalls.c > @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re let's also drop struct pt_regs * argument into perf_call_bpf_{enter,exit}(), they are not actually used anymore > int syscall_nr; > unsigned long args[SYSCALL_DEFINE_MAXARGS]; > } __aligned(8) param; > + struct pt_regs fake_regs; > int i; > > BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); > > /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ > - *(struct pt_regs **)¶m = regs; > + memset(&fake_regs, 0, sizeof(fake_regs)); sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping 336 bytes, so these memset(0) calls are not free for sure. But we don't need to do this unnecessary work all the time. I initially was going to suggest to use get_bpf_raw_tp_regs() from kernel/trace/bpf_trace.c to get a temporary pt_regs that was already memset(0) and used to initialize these minimal "fake regs". But, it turns out we don't need to do even that. Note perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, and if you pass a valid pointer there, it will return "fake regs" struct to be used. We already use that functionality in perf_trace_##call in include/trace/perf.h (i.e., non-syscall tracepoints), so this seems to be a perfect fit. > + perf_fetch_caller_regs(&fake_regs); > + *(struct pt_regs **)¶m = &fake_regs; > param.syscall_nr = rec->nr; > for (i = 0; i < sys_data->nb_args; i++) > param.args[i] = rec->args[i]; > -- > 2.43.5 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 5:34 ` Andrii Nakryiko @ 2024-09-10 5:42 ` Andrii Nakryiko 2024-09-10 15:25 ` Yonghong Song 2024-09-10 15:23 ` Yonghong Song 1 sibling, 1 reply; 8+ messages in thread From: Andrii Nakryiko @ 2024-09-10 5:42 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: > > > > Salvatore Benedetto reported an issue that when doing syscall tracepoint > > tracing the kernel stack is empty. For example, using the following > > command line > > bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' > > the output will be > > === > > Kernel Stack > > === > > > > Further analysis shows that pt_regs used for bpf syscall tracepoint > > tracing is from the one constructed during user->kernel transition. > > The call stack looks like > > perf_syscall_enter+0x88/0x7c0 > > trace_sys_enter+0x41/0x80 > > syscall_trace_enter+0x100/0x160 > > do_syscall_64+0x38/0xf0 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > The ip address stored in pt_regs is from user space hence no kernel > > stack is printed. > > > > To fix the issue, we need to use kernel address from pt_regs. > > In kernel repo, there are already a few cases like this. For example, > > in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) > > instances are used to supply ip address or use ip address to construct > > call stack. > > > > The patch follows the above example by using a fake pt_regs. > > The pt_regs is stored in local stack since the syscall tracepoint > > tracing is in process context and there are no possibility that > > different concurrent syscall tracepoint tracing could mess up with each > > other. This is similar to a perf_fetch_caller_regs() use case in > > kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() > > where a local pt_regs is used. > > > > With this patch, for the above bpftrace script, I got the following output > > === > > Kernel Stack > > > > syscall_trace_enter+407 > > syscall_trace_enter+407 > > do_syscall_64+74 > > entry_SYSCALL_64_after_hwframe+75 > > === > > > > Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> > > Suggested-by: Andrii Nakryiko <andrii@kernel.org> > > Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > > --- > > kernel/trace/trace_syscalls.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > Note, we need to solve the same for perf_call_bpf_exit(). > > pw-bot: cr > BTW, we lived with this bug for years, so I suggest basing your fix on top of bpf-next/master, no bpf/master, which will give people a bit of time to validate that the fix works as expected and doesn't produce any undesirable side effects, before this makes it into the final Linux release. > > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > > index 9c581d6da843..063f51952d49 100644 > > --- a/kernel/trace/trace_syscalls.c > > +++ b/kernel/trace/trace_syscalls.c > > @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re > > let's also drop struct pt_regs * argument into > perf_call_bpf_{enter,exit}(), they are not actually used anymore > > > int syscall_nr; > > unsigned long args[SYSCALL_DEFINE_MAXARGS]; > > } __aligned(8) param; > > + struct pt_regs fake_regs; > > int i; > > > > BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); > > > > /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ > > - *(struct pt_regs **)¶m = regs; > > + memset(&fake_regs, 0, sizeof(fake_regs)); > > sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping > 336 bytes, so these memset(0) calls are not free for sure. > > But we don't need to do this unnecessary work all the time. > > I initially was going to suggest to use get_bpf_raw_tp_regs() from > kernel/trace/bpf_trace.c to get a temporary pt_regs that was already > memset(0) and used to initialize these minimal "fake regs". > > But, it turns out we don't need to do even that. Note > perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, > and if you pass a valid pointer there, it will return "fake regs" > struct to be used. We already use that functionality in > perf_trace_##call in include/trace/perf.h (i.e., non-syscall > tracepoints), so this seems to be a perfect fit. > > > + perf_fetch_caller_regs(&fake_regs); > > + *(struct pt_regs **)¶m = &fake_regs; > > param.syscall_nr = rec->nr; > > for (i = 0; i < sys_data->nb_args; i++) > > param.args[i] = rec->args[i]; > > -- > > 2.43.5 > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 5:42 ` Andrii Nakryiko @ 2024-09-10 15:25 ` Yonghong Song 2024-09-10 16:50 ` Andrii Nakryiko 0 siblings, 1 reply; 8+ messages in thread From: Yonghong Song @ 2024-09-10 15:25 UTC (permalink / raw) To: Andrii Nakryiko Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On 9/9/24 10:42 PM, Andrii Nakryiko wrote: > On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: >> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: >>> Salvatore Benedetto reported an issue that when doing syscall tracepoint >>> tracing the kernel stack is empty. For example, using the following >>> command line >>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' >>> the output will be >>> === >>> Kernel Stack >>> === >>> >>> Further analysis shows that pt_regs used for bpf syscall tracepoint >>> tracing is from the one constructed during user->kernel transition. >>> The call stack looks like >>> perf_syscall_enter+0x88/0x7c0 >>> trace_sys_enter+0x41/0x80 >>> syscall_trace_enter+0x100/0x160 >>> do_syscall_64+0x38/0xf0 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> >>> The ip address stored in pt_regs is from user space hence no kernel >>> stack is printed. >>> >>> To fix the issue, we need to use kernel address from pt_regs. >>> In kernel repo, there are already a few cases like this. For example, >>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) >>> instances are used to supply ip address or use ip address to construct >>> call stack. >>> >>> The patch follows the above example by using a fake pt_regs. >>> The pt_regs is stored in local stack since the syscall tracepoint >>> tracing is in process context and there are no possibility that >>> different concurrent syscall tracepoint tracing could mess up with each >>> other. This is similar to a perf_fetch_caller_regs() use case in >>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() >>> where a local pt_regs is used. >>> >>> With this patch, for the above bpftrace script, I got the following output >>> === >>> Kernel Stack >>> >>> syscall_trace_enter+407 >>> syscall_trace_enter+407 >>> do_syscall_64+74 >>> entry_SYSCALL_64_after_hwframe+75 >>> === >>> >>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> >>> Suggested-by: Andrii Nakryiko <andrii@kernel.org> >>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> >>> --- >>> kernel/trace/trace_syscalls.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >> Note, we need to solve the same for perf_call_bpf_exit(). >> >> pw-bot: cr >> > BTW, we lived with this bug for years, so I suggest basing your fix on > top of bpf-next/master, no bpf/master, which will give people a bit of > time to validate that the fix works as expected and doesn't produce > any undesirable side effects, before this makes it into the final > Linux release. Yes, I did. See I indeed use 'bpf-next' in subject above. > >>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c >>> index 9c581d6da843..063f51952d49 100644 >>> --- a/kernel/trace/trace_syscalls.c >>> +++ b/kernel/trace/trace_syscalls.c >>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re >> let's also drop struct pt_regs * argument into >> perf_call_bpf_{enter,exit}(), they are not actually used anymore >> >>> int syscall_nr; >>> unsigned long args[SYSCALL_DEFINE_MAXARGS]; >>> } __aligned(8) param; >>> + struct pt_regs fake_regs; >>> int i; >>> >>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); >>> >>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ >>> - *(struct pt_regs **)¶m = regs; >>> + memset(&fake_regs, 0, sizeof(fake_regs)); >> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping >> 336 bytes, so these memset(0) calls are not free for sure. >> >> But we don't need to do this unnecessary work all the time. >> >> I initially was going to suggest to use get_bpf_raw_tp_regs() from >> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already >> memset(0) and used to initialize these minimal "fake regs". >> >> But, it turns out we don't need to do even that. Note >> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, >> and if you pass a valid pointer there, it will return "fake regs" >> struct to be used. We already use that functionality in >> perf_trace_##call in include/trace/perf.h (i.e., non-syscall >> tracepoints), so this seems to be a perfect fit. >> >>> + perf_fetch_caller_regs(&fake_regs); >>> + *(struct pt_regs **)¶m = &fake_regs; >>> param.syscall_nr = rec->nr; >>> for (i = 0; i < sys_data->nb_args; i++) >>> param.args[i] = rec->args[i]; >>> -- >>> 2.43.5 >>> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 15:25 ` Yonghong Song @ 2024-09-10 16:50 ` Andrii Nakryiko 2024-09-10 18:22 ` Yonghong Song 0 siblings, 1 reply; 8+ messages in thread From: Andrii Nakryiko @ 2024-09-10 16:50 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On Tue, Sep 10, 2024 at 8:25 AM Yonghong Song <yonghong.song@linux.dev> wrote: > > > On 9/9/24 10:42 PM, Andrii Nakryiko wrote: > > On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > >> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: > >>> Salvatore Benedetto reported an issue that when doing syscall tracepoint > >>> tracing the kernel stack is empty. For example, using the following > >>> command line > >>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' > >>> the output will be > >>> === > >>> Kernel Stack > >>> === > >>> > >>> Further analysis shows that pt_regs used for bpf syscall tracepoint > >>> tracing is from the one constructed during user->kernel transition. > >>> The call stack looks like > >>> perf_syscall_enter+0x88/0x7c0 > >>> trace_sys_enter+0x41/0x80 > >>> syscall_trace_enter+0x100/0x160 > >>> do_syscall_64+0x38/0xf0 > >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e > >>> > >>> The ip address stored in pt_regs is from user space hence no kernel > >>> stack is printed. > >>> > >>> To fix the issue, we need to use kernel address from pt_regs. > >>> In kernel repo, there are already a few cases like this. For example, > >>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) > >>> instances are used to supply ip address or use ip address to construct > >>> call stack. > >>> > >>> The patch follows the above example by using a fake pt_regs. > >>> The pt_regs is stored in local stack since the syscall tracepoint > >>> tracing is in process context and there are no possibility that > >>> different concurrent syscall tracepoint tracing could mess up with each > >>> other. This is similar to a perf_fetch_caller_regs() use case in > >>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() > >>> where a local pt_regs is used. > >>> > >>> With this patch, for the above bpftrace script, I got the following output > >>> === > >>> Kernel Stack > >>> > >>> syscall_trace_enter+407 > >>> syscall_trace_enter+407 > >>> do_syscall_64+74 > >>> entry_SYSCALL_64_after_hwframe+75 > >>> === > >>> > >>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> > >>> Suggested-by: Andrii Nakryiko <andrii@kernel.org> > >>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > >>> --- > >>> kernel/trace/trace_syscalls.c | 5 ++++- > >>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>> > >> Note, we need to solve the same for perf_call_bpf_exit(). > >> > >> pw-bot: cr > >> > > BTW, we lived with this bug for years, so I suggest basing your fix on > > top of bpf-next/master, no bpf/master, which will give people a bit of > > time to validate that the fix works as expected and doesn't produce > > any undesirable side effects, before this makes it into the final > > Linux release. > > Yes, I did. See I indeed use 'bpf-next' in subject above. Huh, strange, I actually tried to apply your patch to bpf-next/master and it didn't apply cleanly. It did apply to bpf/master, though, which is why I assumed you based it off of bpf/master. > > > > >>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > >>> index 9c581d6da843..063f51952d49 100644 > >>> --- a/kernel/trace/trace_syscalls.c > >>> +++ b/kernel/trace/trace_syscalls.c > >>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re > >> let's also drop struct pt_regs * argument into > >> perf_call_bpf_{enter,exit}(), they are not actually used anymore > >> > >>> int syscall_nr; > >>> unsigned long args[SYSCALL_DEFINE_MAXARGS]; > >>> } __aligned(8) param; > >>> + struct pt_regs fake_regs; > >>> int i; > >>> > >>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); > >>> > >>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ > >>> - *(struct pt_regs **)¶m = regs; > >>> + memset(&fake_regs, 0, sizeof(fake_regs)); > >> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping > >> 336 bytes, so these memset(0) calls are not free for sure. > >> > >> But we don't need to do this unnecessary work all the time. > >> > >> I initially was going to suggest to use get_bpf_raw_tp_regs() from > >> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already > >> memset(0) and used to initialize these minimal "fake regs". > >> > >> But, it turns out we don't need to do even that. Note > >> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, > >> and if you pass a valid pointer there, it will return "fake regs" > >> struct to be used. We already use that functionality in > >> perf_trace_##call in include/trace/perf.h (i.e., non-syscall > >> tracepoints), so this seems to be a perfect fit. > >> > >>> + perf_fetch_caller_regs(&fake_regs); > >>> + *(struct pt_regs **)¶m = &fake_regs; > >>> param.syscall_nr = rec->nr; > >>> for (i = 0; i < sys_data->nb_args; i++) > >>> param.args[i] = rec->args[i]; > >>> -- > >>> 2.43.5 > >>> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 16:50 ` Andrii Nakryiko @ 2024-09-10 18:22 ` Yonghong Song 2024-09-10 18:25 ` Andrii Nakryiko 0 siblings, 1 reply; 8+ messages in thread From: Yonghong Song @ 2024-09-10 18:22 UTC (permalink / raw) To: Andrii Nakryiko Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On 9/10/24 9:50 AM, Andrii Nakryiko wrote: > On Tue, Sep 10, 2024 at 8:25 AM Yonghong Song <yonghong.song@linux.dev> wrote: >> >> On 9/9/24 10:42 PM, Andrii Nakryiko wrote: >>> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko >>> <andrii.nakryiko@gmail.com> wrote: >>>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: >>>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint >>>>> tracing the kernel stack is empty. For example, using the following >>>>> command line >>>>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' >>>>> the output will be >>>>> === >>>>> Kernel Stack >>>>> === >>>>> >>>>> Further analysis shows that pt_regs used for bpf syscall tracepoint >>>>> tracing is from the one constructed during user->kernel transition. >>>>> The call stack looks like >>>>> perf_syscall_enter+0x88/0x7c0 >>>>> trace_sys_enter+0x41/0x80 >>>>> syscall_trace_enter+0x100/0x160 >>>>> do_syscall_64+0x38/0xf0 >>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>>> >>>>> The ip address stored in pt_regs is from user space hence no kernel >>>>> stack is printed. >>>>> >>>>> To fix the issue, we need to use kernel address from pt_regs. >>>>> In kernel repo, there are already a few cases like this. For example, >>>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) >>>>> instances are used to supply ip address or use ip address to construct >>>>> call stack. >>>>> >>>>> The patch follows the above example by using a fake pt_regs. >>>>> The pt_regs is stored in local stack since the syscall tracepoint >>>>> tracing is in process context and there are no possibility that >>>>> different concurrent syscall tracepoint tracing could mess up with each >>>>> other. This is similar to a perf_fetch_caller_regs() use case in >>>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() >>>>> where a local pt_regs is used. >>>>> >>>>> With this patch, for the above bpftrace script, I got the following output >>>>> === >>>>> Kernel Stack >>>>> >>>>> syscall_trace_enter+407 >>>>> syscall_trace_enter+407 >>>>> do_syscall_64+74 >>>>> entry_SYSCALL_64_after_hwframe+75 >>>>> === >>>>> >>>>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> >>>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org> >>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> >>>>> --- >>>>> kernel/trace/trace_syscalls.c | 5 ++++- >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>> >>>> Note, we need to solve the same for perf_call_bpf_exit(). >>>> >>>> pw-bot: cr >>>> >>> BTW, we lived with this bug for years, so I suggest basing your fix on >>> top of bpf-next/master, no bpf/master, which will give people a bit of >>> time to validate that the fix works as expected and doesn't produce >>> any undesirable side effects, before this makes it into the final >>> Linux release. >> Yes, I did. See I indeed use 'bpf-next' in subject above. > Huh, strange, I actually tried to apply your patch to bpf-next/master > and it didn't apply cleanly. It did apply to bpf/master, though, which > is why I assumed you based it off of bpf/master. Interesting. The following is my git history: 7b71206057440d9559ecb9cd02d891f46927b272 (HEAD -> trace_syscall) bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 41d0c4677feee1ea063e0f2c2af72dc953b1f1cc (origin/master, origin/HEAD, master) libbpf: Fix some typos in comments 72d8508ecd3b081dba03ec00930c6b07c1ad55d3 MAINTAINERS: BPF ARC JIT: Update my e-mail address bee109b7b3e50739b88252a219fa07ecd78ad628 bpf: Fix error message on kfunc arg type mismatch ... Not sure what is going on ... > >>>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c >>>>> index 9c581d6da843..063f51952d49 100644 >>>>> --- a/kernel/trace/trace_syscalls.c >>>>> +++ b/kernel/trace/trace_syscalls.c >>>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re >>>> let's also drop struct pt_regs * argument into >>>> perf_call_bpf_{enter,exit}(), they are not actually used anymore >>>> >>>>> int syscall_nr; >>>>> unsigned long args[SYSCALL_DEFINE_MAXARGS]; >>>>> } __aligned(8) param; >>>>> + struct pt_regs fake_regs; >>>>> int i; >>>>> >>>>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); >>>>> >>>>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ >>>>> - *(struct pt_regs **)¶m = regs; >>>>> + memset(&fake_regs, 0, sizeof(fake_regs)); >>>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping >>>> 336 bytes, so these memset(0) calls are not free for sure. >>>> >>>> But we don't need to do this unnecessary work all the time. >>>> >>>> I initially was going to suggest to use get_bpf_raw_tp_regs() from >>>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already >>>> memset(0) and used to initialize these minimal "fake regs". >>>> >>>> But, it turns out we don't need to do even that. Note >>>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, >>>> and if you pass a valid pointer there, it will return "fake regs" >>>> struct to be used. We already use that functionality in >>>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall >>>> tracepoints), so this seems to be a perfect fit. >>>> >>>>> + perf_fetch_caller_regs(&fake_regs); >>>>> + *(struct pt_regs **)¶m = &fake_regs; >>>>> param.syscall_nr = rec->nr; >>>>> for (i = 0; i < sys_data->nb_args; i++) >>>>> param.args[i] = rec->args[i]; >>>>> -- >>>>> 2.43.5 >>>>> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 18:22 ` Yonghong Song @ 2024-09-10 18:25 ` Andrii Nakryiko 0 siblings, 0 replies; 8+ messages in thread From: Andrii Nakryiko @ 2024-09-10 18:25 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On Tue, Sep 10, 2024 at 11:22 AM Yonghong Song <yonghong.song@linux.dev> wrote: > > > On 9/10/24 9:50 AM, Andrii Nakryiko wrote: > > On Tue, Sep 10, 2024 at 8:25 AM Yonghong Song <yonghong.song@linux.dev> wrote: > >> > >> On 9/9/24 10:42 PM, Andrii Nakryiko wrote: > >>> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko > >>> <andrii.nakryiko@gmail.com> wrote: > >>>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: > >>>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint > >>>>> tracing the kernel stack is empty. For example, using the following > >>>>> command line > >>>>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' > >>>>> the output will be > >>>>> === > >>>>> Kernel Stack > >>>>> === > >>>>> > >>>>> Further analysis shows that pt_regs used for bpf syscall tracepoint > >>>>> tracing is from the one constructed during user->kernel transition. > >>>>> The call stack looks like > >>>>> perf_syscall_enter+0x88/0x7c0 > >>>>> trace_sys_enter+0x41/0x80 > >>>>> syscall_trace_enter+0x100/0x160 > >>>>> do_syscall_64+0x38/0xf0 > >>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e > >>>>> > >>>>> The ip address stored in pt_regs is from user space hence no kernel > >>>>> stack is printed. > >>>>> > >>>>> To fix the issue, we need to use kernel address from pt_regs. > >>>>> In kernel repo, there are already a few cases like this. For example, > >>>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) > >>>>> instances are used to supply ip address or use ip address to construct > >>>>> call stack. > >>>>> > >>>>> The patch follows the above example by using a fake pt_regs. > >>>>> The pt_regs is stored in local stack since the syscall tracepoint > >>>>> tracing is in process context and there are no possibility that > >>>>> different concurrent syscall tracepoint tracing could mess up with each > >>>>> other. This is similar to a perf_fetch_caller_regs() use case in > >>>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() > >>>>> where a local pt_regs is used. > >>>>> > >>>>> With this patch, for the above bpftrace script, I got the following output > >>>>> === > >>>>> Kernel Stack > >>>>> > >>>>> syscall_trace_enter+407 > >>>>> syscall_trace_enter+407 > >>>>> do_syscall_64+74 > >>>>> entry_SYSCALL_64_after_hwframe+75 > >>>>> === > >>>>> > >>>>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> > >>>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org> > >>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > >>>>> --- > >>>>> kernel/trace/trace_syscalls.c | 5 ++++- > >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>> > >>>> Note, we need to solve the same for perf_call_bpf_exit(). > >>>> > >>>> pw-bot: cr > >>>> > >>> BTW, we lived with this bug for years, so I suggest basing your fix on > >>> top of bpf-next/master, no bpf/master, which will give people a bit of > >>> time to validate that the fix works as expected and doesn't produce > >>> any undesirable side effects, before this makes it into the final > >>> Linux release. > >> Yes, I did. See I indeed use 'bpf-next' in subject above. > > Huh, strange, I actually tried to apply your patch to bpf-next/master > > and it didn't apply cleanly. It did apply to bpf/master, though, which > > is why I assumed you based it off of bpf/master. > > Interesting. The following is my git history: > > 7b71206057440d9559ecb9cd02d891f46927b272 (HEAD -> trace_syscall) bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing > 41d0c4677feee1ea063e0f2c2af72dc953b1f1cc (origin/master, origin/HEAD, master) libbpf: Fix some typos in comments > 72d8508ecd3b081dba03ec00930c6b07c1ad55d3 MAINTAINERS: BPF ARC JIT: Update my e-mail address > bee109b7b3e50739b88252a219fa07ecd78ad628 bpf: Fix error message on kfunc arg type mismatch > ... > > Not sure what is going on ... > Doesn't matter, maybe my local bpf-next/master branch was screwed up. Just send a v2 when you are ready and I'll try again :) > > > > >>>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > >>>>> index 9c581d6da843..063f51952d49 100644 > >>>>> --- a/kernel/trace/trace_syscalls.c > >>>>> +++ b/kernel/trace/trace_syscalls.c > >>>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re > >>>> let's also drop struct pt_regs * argument into > >>>> perf_call_bpf_{enter,exit}(), they are not actually used anymore > >>>> > >>>>> int syscall_nr; > >>>>> unsigned long args[SYSCALL_DEFINE_MAXARGS]; > >>>>> } __aligned(8) param; > >>>>> + struct pt_regs fake_regs; > >>>>> int i; > >>>>> > >>>>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); > >>>>> > >>>>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ > >>>>> - *(struct pt_regs **)¶m = regs; > >>>>> + memset(&fake_regs, 0, sizeof(fake_regs)); > >>>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping > >>>> 336 bytes, so these memset(0) calls are not free for sure. > >>>> > >>>> But we don't need to do this unnecessary work all the time. > >>>> > >>>> I initially was going to suggest to use get_bpf_raw_tp_regs() from > >>>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already > >>>> memset(0) and used to initialize these minimal "fake regs". > >>>> > >>>> But, it turns out we don't need to do even that. Note > >>>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, > >>>> and if you pass a valid pointer there, it will return "fake regs" > >>>> struct to be used. We already use that functionality in > >>>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall > >>>> tracepoints), so this seems to be a perfect fit. > >>>> > >>>>> + perf_fetch_caller_regs(&fake_regs); > >>>>> + *(struct pt_regs **)¶m = &fake_regs; > >>>>> param.syscall_nr = rec->nr; > >>>>> for (i = 0; i < sys_data->nb_args; i++) > >>>>> param.args[i] = rec->args[i]; > >>>>> -- > >>>>> 2.43.5 > >>>>> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing 2024-09-10 5:34 ` Andrii Nakryiko 2024-09-10 5:42 ` Andrii Nakryiko @ 2024-09-10 15:23 ` Yonghong Song 1 sibling, 0 replies; 8+ messages in thread From: Yonghong Song @ 2024-09-10 15:23 UTC (permalink / raw) To: Andrii Nakryiko Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, Martin KaFai Lau, Salvatore Benedetto On 9/9/24 10:34 PM, Andrii Nakryiko wrote: > On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote: >> Salvatore Benedetto reported an issue that when doing syscall tracepoint >> tracing the kernel stack is empty. For example, using the following >> command line >> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' >> the output will be >> === >> Kernel Stack >> === >> >> Further analysis shows that pt_regs used for bpf syscall tracepoint >> tracing is from the one constructed during user->kernel transition. >> The call stack looks like >> perf_syscall_enter+0x88/0x7c0 >> trace_sys_enter+0x41/0x80 >> syscall_trace_enter+0x100/0x160 >> do_syscall_64+0x38/0xf0 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> The ip address stored in pt_regs is from user space hence no kernel >> stack is printed. >> >> To fix the issue, we need to use kernel address from pt_regs. >> In kernel repo, there are already a few cases like this. For example, >> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) >> instances are used to supply ip address or use ip address to construct >> call stack. >> >> The patch follows the above example by using a fake pt_regs. >> The pt_regs is stored in local stack since the syscall tracepoint >> tracing is in process context and there are no possibility that >> different concurrent syscall tracepoint tracing could mess up with each >> other. This is similar to a perf_fetch_caller_regs() use case in >> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() >> where a local pt_regs is used. >> >> With this patch, for the above bpftrace script, I got the following output >> === >> Kernel Stack >> >> syscall_trace_enter+407 >> syscall_trace_enter+407 >> do_syscall_64+74 >> entry_SYSCALL_64_after_hwframe+75 >> === >> >> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> >> Suggested-by: Andrii Nakryiko <andrii@kernel.org> >> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> >> --- >> kernel/trace/trace_syscalls.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> > Note, we need to solve the same for perf_call_bpf_exit(). Sorry, missed this one! Will add in the next revision. > > pw-bot: cr > >> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c >> index 9c581d6da843..063f51952d49 100644 >> --- a/kernel/trace/trace_syscalls.c >> +++ b/kernel/trace/trace_syscalls.c >> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re > let's also drop struct pt_regs * argument into > perf_call_bpf_{enter,exit}(), they are not actually used anymore Ack. > >> int syscall_nr; >> unsigned long args[SYSCALL_DEFINE_MAXARGS]; >> } __aligned(8) param; >> + struct pt_regs fake_regs; >> int i; >> >> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); >> >> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ >> - *(struct pt_regs **)¶m = regs; >> + memset(&fake_regs, 0, sizeof(fake_regs)); > sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping > 336 bytes, so these memset(0) calls are not free for sure. I calculated size on x86-64 and feels it might be acceptable. But not aware that arm64 has much larger size like 336. Indeed 336 bytes on the stack is quite large. > > But we don't need to do this unnecessary work all the time. > > I initially was going to suggest to use get_bpf_raw_tp_regs() from > kernel/trace/bpf_trace.c to get a temporary pt_regs that was already > memset(0) and used to initialize these minimal "fake regs". > > But, it turns out we don't need to do even that. Note > perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, > and if you pass a valid pointer there, it will return "fake regs" > struct to be used. We already use that functionality in > perf_trace_##call in include/trace/perf.h (i.e., non-syscall > tracepoints), so this seems to be a perfect fit. I double checked and perf_call_bpf_enter() call is in atomic process context (preempt disabled), so we are safe to use perf_trace_buf_alloc() which uses per-cpu variables. > >> + perf_fetch_caller_regs(&fake_regs); >> + *(struct pt_regs **)¶m = &fake_regs; >> param.syscall_nr = rec->nr; >> for (i = 0; i < sys_data->nb_args; i++) >> param.args[i] = rec->args[i]; >> -- >> 2.43.5 >> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-09-10 18:25 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-10 3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song 2024-09-10 5:34 ` Andrii Nakryiko 2024-09-10 5:42 ` Andrii Nakryiko 2024-09-10 15:25 ` Yonghong Song 2024-09-10 16:50 ` Andrii Nakryiko 2024-09-10 18:22 ` Yonghong Song 2024-09-10 18:25 ` Andrii Nakryiko 2024-09-10 15:23 ` Yonghong Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox