From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
Salvatore Benedetto <salvabenedetto@meta.com>
Subject: Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Date: Tue, 10 Sep 2024 08:25:22 -0700 [thread overview]
Message-ID: <84f2c314-980c-4e01-bcaa-dafb62a934f3@linux.dev> (raw)
In-Reply-To: <CAEf4BzZC3FyP06p-H8JhQVJqOTRfjLSfNpHBZn3hN2WRfypDsw@mail.gmail.com>
On 9/9/24 10:42 PM, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
>>> tracing the kernel stack is empty. For example, using the following
>>> command line
>>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
>>> the output will be
>>> ===
>>> Kernel Stack
>>> ===
>>>
>>> Further analysis shows that pt_regs used for bpf syscall tracepoint
>>> tracing is from the one constructed during user->kernel transition.
>>> The call stack looks like
>>> perf_syscall_enter+0x88/0x7c0
>>> trace_sys_enter+0x41/0x80
>>> syscall_trace_enter+0x100/0x160
>>> do_syscall_64+0x38/0xf0
>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> The ip address stored in pt_regs is from user space hence no kernel
>>> stack is printed.
>>>
>>> To fix the issue, we need to use kernel address from pt_regs.
>>> In kernel repo, there are already a few cases like this. For example,
>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
>>> instances are used to supply ip address or use ip address to construct
>>> call stack.
>>>
>>> The patch follows the above example by using a fake pt_regs.
>>> The pt_regs is stored in local stack since the syscall tracepoint
>>> tracing is in process context and there are no possibility that
>>> different concurrent syscall tracepoint tracing could mess up with each
>>> other. This is similar to a perf_fetch_caller_regs() use case in
>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
>>> where a local pt_regs is used.
>>>
>>> With this patch, for the above bpftrace script, I got the following output
>>> ===
>>> Kernel Stack
>>>
>>> syscall_trace_enter+407
>>> syscall_trace_enter+407
>>> do_syscall_64+74
>>> entry_SYSCALL_64_after_hwframe+75
>>> ===
>>>
>>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>> kernel/trace/trace_syscalls.c | 5 ++++-
>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>> Note, we need to solve the same for perf_call_bpf_exit().
>>
>> pw-bot: cr
>>
> BTW, we lived with this bug for years, so I suggest basing your fix on
> top of bpf-next/master, no bpf/master, which will give people a bit of
> time to validate that the fix works as expected and doesn't produce
> any undesirable side effects, before this makes it into the final
> Linux release.
Yes, I did. See I indeed use 'bpf-next' in subject above.
>
>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>>> index 9c581d6da843..063f51952d49 100644
>>> --- a/kernel/trace/trace_syscalls.c
>>> +++ b/kernel/trace/trace_syscalls.c
>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
>> let's also drop struct pt_regs * argument into
>> perf_call_bpf_{enter,exit}(), they are not actually used anymore
>>
>>> int syscall_nr;
>>> unsigned long args[SYSCALL_DEFINE_MAXARGS];
>>> } __aligned(8) param;
>>> + struct pt_regs fake_regs;
>>> int i;
>>>
>>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>>>
>>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */
>>> - *(struct pt_regs **)¶m = regs;
>>> + memset(&fake_regs, 0, sizeof(fake_regs));
>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
>> 336 bytes, so these memset(0) calls are not free for sure.
>>
>> But we don't need to do this unnecessary work all the time.
>>
>> I initially was going to suggest to use get_bpf_raw_tp_regs() from
>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
>> memset(0) and used to initialize these minimal "fake regs".
>>
>> But, it turns out we don't need to do even that. Note
>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
>> and if you pass a valid pointer there, it will return "fake regs"
>> struct to be used. We already use that functionality in
>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
>> tracepoints), so this seems to be a perfect fit.
>>
>>> + perf_fetch_caller_regs(&fake_regs);
>>> + *(struct pt_regs **)¶m = &fake_regs;
>>> param.syscall_nr = rec->nr;
>>> for (i = 0; i < sys_data->nb_args; i++)
>>> param.args[i] = rec->args[i];
>>> --
>>> 2.43.5
>>>
next prev parent reply other threads:[~2024-09-10 15:25 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song
2024-09-10 5:34 ` Andrii Nakryiko
2024-09-10 5:42 ` Andrii Nakryiko
2024-09-10 15:25 ` Yonghong Song [this message]
2024-09-10 16:50 ` Andrii Nakryiko
2024-09-10 18:22 ` Yonghong Song
2024-09-10 18:25 ` Andrii Nakryiko
2024-09-10 15:23 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=84f2c314-980c-4e01-bcaa-dafb62a934f3@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=salvabenedetto@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox