From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
Salvatore Benedetto <salvabenedetto@meta.com>
Subject: Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Date: Tue, 10 Sep 2024 08:23:26 -0700 [thread overview]
Message-ID: <ab9d0432-7d71-4510-bd2f-5fb5834f9772@linux.dev> (raw)
In-Reply-To: <CAEf4BzbsYn-b7YiKZ0MPW9_VLzDq38Jv8UkocfMLVje_SAmENA@mail.gmail.com>
On 9/9/24 10:34 PM, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
>> tracing the kernel stack is empty. For example, using the following
>> command line
>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
>> the output will be
>> ===
>> Kernel Stack
>> ===
>>
>> Further analysis shows that pt_regs used for bpf syscall tracepoint
>> tracing is from the one constructed during user->kernel transition.
>> The call stack looks like
>> perf_syscall_enter+0x88/0x7c0
>> trace_sys_enter+0x41/0x80
>> syscall_trace_enter+0x100/0x160
>> do_syscall_64+0x38/0xf0
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> The ip address stored in pt_regs is from user space hence no kernel
>> stack is printed.
>>
>> To fix the issue, we need to use kernel address from pt_regs.
>> In kernel repo, there are already a few cases like this. For example,
>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
>> instances are used to supply ip address or use ip address to construct
>> call stack.
>>
>> The patch follows the above example by using a fake pt_regs.
>> The pt_regs is stored in local stack since the syscall tracepoint
>> tracing is in process context and there are no possibility that
>> different concurrent syscall tracepoint tracing could mess up with each
>> other. This is similar to a perf_fetch_caller_regs() use case in
>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
>> where a local pt_regs is used.
>>
>> With this patch, for the above bpftrace script, I got the following output
>> ===
>> Kernel Stack
>>
>> syscall_trace_enter+407
>> syscall_trace_enter+407
>> do_syscall_64+74
>> entry_SYSCALL_64_after_hwframe+75
>> ===
>>
>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> kernel/trace/trace_syscalls.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
> Note, we need to solve the same for perf_call_bpf_exit().
Sorry, missed this one! Will add in the next revision.
>
> pw-bot: cr
>
>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>> index 9c581d6da843..063f51952d49 100644
>> --- a/kernel/trace/trace_syscalls.c
>> +++ b/kernel/trace/trace_syscalls.c
>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
> let's also drop struct pt_regs * argument into
> perf_call_bpf_{enter,exit}(), they are not actually used anymore
Ack.
>
>> int syscall_nr;
>> unsigned long args[SYSCALL_DEFINE_MAXARGS];
>> } __aligned(8) param;
>> + struct pt_regs fake_regs;
>> int i;
>>
>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>>
>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */
>> - *(struct pt_regs **)¶m = regs;
>> + memset(&fake_regs, 0, sizeof(fake_regs));
> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
> 336 bytes, so these memset(0) calls are not free for sure.
I calculated size on x86-64 and feels it might be acceptable.
But not aware that arm64 has much larger size like 336. Indeed
336 bytes on the stack is quite large.
>
> But we don't need to do this unnecessary work all the time.
>
> I initially was going to suggest to use get_bpf_raw_tp_regs() from
> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
> memset(0) and used to initialize these minimal "fake regs".
>
> But, it turns out we don't need to do even that. Note
> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
> and if you pass a valid pointer there, it will return "fake regs"
> struct to be used. We already use that functionality in
> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
> tracepoints), so this seems to be a perfect fit.
I double checked and perf_call_bpf_enter() call is in atomic
process context (preempt disabled), so we are safe to use
perf_trace_buf_alloc() which uses per-cpu variables.
>
>> + perf_fetch_caller_regs(&fake_regs);
>> + *(struct pt_regs **)¶m = &fake_regs;
>> param.syscall_nr = rec->nr;
>> for (i = 0; i < sys_data->nb_args; i++)
>> param.args[i] = rec->args[i];
>> --
>> 2.43.5
>>
prev parent reply other threads:[~2024-09-10 15:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song
2024-09-10 5:34 ` Andrii Nakryiko
2024-09-10 5:42 ` Andrii Nakryiko
2024-09-10 15:25 ` Yonghong Song
2024-09-10 16:50 ` Andrii Nakryiko
2024-09-10 18:22 ` Yonghong Song
2024-09-10 18:25 ` Andrii Nakryiko
2024-09-10 15:23 ` Yonghong Song [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ab9d0432-7d71-4510-bd2f-5fb5834f9772@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=salvabenedetto@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.