From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
Salvatore Benedetto <salvabenedetto@meta.com>
Subject: Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Date: Tue, 10 Sep 2024 11:22:31 -0700 [thread overview]
Message-ID: <e9b9db08-7ad4-47e0-be4d-6cd85eed854e@linux.dev> (raw)
In-Reply-To: <CAEf4BzahXi9t+Y883iCTDrAkcr2DEy0he-NW+jg9yT3TXH6NUA@mail.gmail.com>
On 9/10/24 9:50 AM, Andrii Nakryiko wrote:
> On Tue, Sep 10, 2024 at 8:25 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 9/9/24 10:42 PM, Andrii Nakryiko wrote:
>>> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
>>>>> tracing the kernel stack is empty. For example, using the following
>>>>> command line
>>>>> bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
>>>>> the output will be
>>>>> ===
>>>>> Kernel Stack
>>>>> ===
>>>>>
>>>>> Further analysis shows that pt_regs used for bpf syscall tracepoint
>>>>> tracing is from the one constructed during user->kernel transition.
>>>>> The call stack looks like
>>>>> perf_syscall_enter+0x88/0x7c0
>>>>> trace_sys_enter+0x41/0x80
>>>>> syscall_trace_enter+0x100/0x160
>>>>> do_syscall_64+0x38/0xf0
>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> The ip address stored in pt_regs is from user space hence no kernel
>>>>> stack is printed.
>>>>>
>>>>> To fix the issue, we need to use kernel address from pt_regs.
>>>>> In kernel repo, there are already a few cases like this. For example,
>>>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
>>>>> instances are used to supply ip address or use ip address to construct
>>>>> call stack.
>>>>>
>>>>> The patch follows the above example by using a fake pt_regs.
>>>>> The pt_regs is stored in local stack since the syscall tracepoint
>>>>> tracing is in process context and there are no possibility that
>>>>> different concurrent syscall tracepoint tracing could mess up with each
>>>>> other. This is similar to a perf_fetch_caller_regs() use case in
>>>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
>>>>> where a local pt_regs is used.
>>>>>
>>>>> With this patch, for the above bpftrace script, I got the following output
>>>>> ===
>>>>> Kernel Stack
>>>>>
>>>>> syscall_trace_enter+407
>>>>> syscall_trace_enter+407
>>>>> do_syscall_64+74
>>>>> entry_SYSCALL_64_after_hwframe+75
>>>>> ===
>>>>>
>>>>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
>>>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> ---
>>>>> kernel/trace/trace_syscalls.c | 5 ++++-
>>>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>> Note, we need to solve the same for perf_call_bpf_exit().
>>>>
>>>> pw-bot: cr
>>>>
>>> BTW, we lived with this bug for years, so I suggest basing your fix on
>>> top of bpf-next/master, no bpf/master, which will give people a bit of
>>> time to validate that the fix works as expected and doesn't produce
>>> any undesirable side effects, before this makes it into the final
>>> Linux release.
>> Yes, I did. See I indeed use 'bpf-next' in subject above.
> Huh, strange, I actually tried to apply your patch to bpf-next/master
> and it didn't apply cleanly. It did apply to bpf/master, though, which
> is why I assumed you based it off of bpf/master.
Interesting. The following is my git history:
7b71206057440d9559ecb9cd02d891f46927b272 (HEAD -> trace_syscall) bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
41d0c4677feee1ea063e0f2c2af72dc953b1f1cc (origin/master, origin/HEAD, master) libbpf: Fix some typos in comments
72d8508ecd3b081dba03ec00930c6b07c1ad55d3 MAINTAINERS: BPF ARC JIT: Update my e-mail address
bee109b7b3e50739b88252a219fa07ecd78ad628 bpf: Fix error message on kfunc arg type mismatch
...
Not sure what is going on ...
>
>>>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>>>>> index 9c581d6da843..063f51952d49 100644
>>>>> --- a/kernel/trace/trace_syscalls.c
>>>>> +++ b/kernel/trace/trace_syscalls.c
>>>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
>>>> let's also drop struct pt_regs * argument into
>>>> perf_call_bpf_{enter,exit}(), they are not actually used anymore
>>>>
>>>>> int syscall_nr;
>>>>> unsigned long args[SYSCALL_DEFINE_MAXARGS];
>>>>> } __aligned(8) param;
>>>>> + struct pt_regs fake_regs;
>>>>> int i;
>>>>>
>>>>> BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>>>>>
>>>>> /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */
>>>>> - *(struct pt_regs **)¶m = regs;
>>>>> + memset(&fake_regs, 0, sizeof(fake_regs));
>>>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
>>>> 336 bytes, so these memset(0) calls are not free for sure.
>>>>
>>>> But we don't need to do this unnecessary work all the time.
>>>>
>>>> I initially was going to suggest to use get_bpf_raw_tp_regs() from
>>>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
>>>> memset(0) and used to initialize these minimal "fake regs".
>>>>
>>>> But, it turns out we don't need to do even that. Note
>>>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
>>>> and if you pass a valid pointer there, it will return "fake regs"
>>>> struct to be used. We already use that functionality in
>>>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
>>>> tracepoints), so this seems to be a perfect fit.
>>>>
>>>>> + perf_fetch_caller_regs(&fake_regs);
>>>>> + *(struct pt_regs **)¶m = &fake_regs;
>>>>> param.syscall_nr = rec->nr;
>>>>> for (i = 0; i < sys_data->nb_args; i++)
>>>>> param.args[i] = rec->args[i];
>>>>> --
>>>>> 2.43.5
>>>>>
next prev parent reply other threads:[~2024-09-10 18:22 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song
2024-09-10 5:34 ` Andrii Nakryiko
2024-09-10 5:42 ` Andrii Nakryiko
2024-09-10 15:25 ` Yonghong Song
2024-09-10 16:50 ` Andrii Nakryiko
2024-09-10 18:22 ` Yonghong Song [this message]
2024-09-10 18:25 ` Andrii Nakryiko
2024-09-10 15:23 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e9b9db08-7ad4-47e0-be4d-6cd85eed854e@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=salvabenedetto@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox