From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
Salvatore Benedetto <salvabenedetto@meta.com>
Subject: Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Date: Tue, 10 Sep 2024 08:23:26 -0700 [thread overview]
Message-ID: <ab9d0432-7d71-4510-bd2f-5fb5834f9772@linux.dev> (raw)
In-Reply-To: <CAEf4BzbsYn-b7YiKZ0MPW9_VLzDq38Jv8UkocfMLVje_SAmENA@mail.gmail.com>
On 9/9/24 10:34 PM, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
>> tracing the kernel stack is empty. For example, using the following
>> command line
>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
>> the output will be
>> ===
>> Kernel Stack
>> ===
>>
>> Further analysis shows that pt_regs used for bpf syscall tracepoint
>> tracing is from the one constructed during user->kernel transition.
>> The call stack looks like
>> perf_syscall_enter+0x88/0x7c0
>> trace_sys_enter+0x41/0x80
>> syscall_trace_enter+0x100/0x160
>> do_syscall_64+0x38/0xf0
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> The ip address stored in pt_regs is from user space hence no kernel
>> stack is printed.
>>
>> To fix the issue, we need to use kernel address from pt_regs.
>> In kernel repo, there are already a few cases like this. For example,
>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
>> instances are used to supply ip address or use ip address to construct
>> call stack.
>>
>> The patch follows the above example by using a fake pt_regs.
>> The pt_regs is stored in local stack since the syscall tracepoint
>> tracing is in process context and there are no possibility that
>> different concurrent syscall tracepoint tracing could mess up with each
>> other. This is similar to a perf_fetch_caller_regs() use case in
>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
>> where a local pt_regs is used.
>>
>> With this patch, for the above bpftrace script, I got the following output
>> ===
>> Kernel Stack
>>
>> syscall_trace_enter+407
>> syscall_trace_enter+407
>> do_syscall_64+74
>> entry_SYSCALL_64_after_hwframe+75
>> ===
>>
>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> kernel/trace/trace_syscalls.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
> Note, we need to solve the same for perf_call_bpf_exit().
Sorry, missed this one! Will add in the next revision.
>
> pw-bot: cr
>
>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>> index 9c581d6da843..063f51952d49 100644
>> --- a/kernel/trace/trace_syscalls.c
>> +++ b/kernel/trace/trace_syscalls.c
>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
> let's also drop struct pt_regs * argument into
> perf_call_bpf_{enter,exit}(), they are not actually used anymore
Ack.
>
>> int syscall_nr;
>> unsigned long args[SYSCALL_DEFINE_MAXARGS];
>> } __aligned(8) param;
>> + struct pt_regs fake_regs;
>> int i;
>>
>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>>
>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */
>> - *(struct pt_regs **)¶m = regs;
>> + memset(&fake_regs, 0, sizeof(fake_regs));
> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
> 336 bytes, so these memset(0) calls are not free for sure.
I calculated size on x86-64 and feels it might be acceptable.
But not aware that arm64 has much larger size like 336. Indeed
336 bytes on the stack is quite large.
>
> But we don't need to do this unnecessary work all the time.
>
> I initially was going to suggest to use get_bpf_raw_tp_regs() from
> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
> memset(0) and used to initialize these minimal "fake regs".
>
> But, it turns out we don't need to do even that. Note
> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
> and if you pass a valid pointer there, it will return "fake regs"
> struct to be used. We already use that functionality in
> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
> tracepoints), so this seems to be a perfect fit.
I double checked and perf_call_bpf_enter() call is in atomic
process context (preempt disabled), so we are safe to use
perf_trace_buf_alloc() which uses per-cpu variables.
>
>> + perf_fetch_caller_regs(&fake_regs);
>> + *(struct pt_regs **)¶m = &fake_regs;
>> param.syscall_nr = rec->nr;
>> for (i = 0; i < sys_data->nb_args; i++)
>> param.args[i] = rec->args[i];
>> --
>> 2.43.5
>>
prev parent reply other threads:[~2024-09-10 15:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song
2024-09-10 5:34 ` Andrii Nakryiko
2024-09-10 5:42 ` Andrii Nakryiko
2024-09-10 15:25 ` Yonghong Song
2024-09-10 16:50 ` Andrii Nakryiko
2024-09-10 18:22 ` Yonghong Song
2024-09-10 18:25 ` Andrii Nakryiko
2024-09-10 15:23 ` Yonghong Song [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ab9d0432-7d71-4510-bd2f-5fb5834f9772@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=salvabenedetto@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox