Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
	Salvatore Benedetto <salvabenedetto@meta.com>
Subject: Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Date: Tue, 10 Sep 2024 11:22:31 -0700	[thread overview]
Message-ID: <e9b9db08-7ad4-47e0-be4d-6cd85eed854e@linux.dev> (raw)
In-Reply-To: <CAEf4BzahXi9t+Y883iCTDrAkcr2DEy0he-NW+jg9yT3TXH6NUA@mail.gmail.com>


On 9/10/24 9:50 AM, Andrii Nakryiko wrote:
> On Tue, Sep 10, 2024 at 8:25 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 9/9/24 10:42 PM, Andrii Nakryiko wrote:
>>> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
>>>>> tracing the kernel stack is empty. For example, using the following
>>>>> command line
>>>>>     bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
>>>>> the output will be
>>>>> ===
>>>>>     Kernel Stack
>>>>> ===
>>>>>
>>>>> Further analysis shows that pt_regs used for bpf syscall tracepoint
>>>>> tracing is from the one constructed during user->kernel transition.
>>>>> The call stack looks like
>>>>>     perf_syscall_enter+0x88/0x7c0
>>>>>     trace_sys_enter+0x41/0x80
>>>>>     syscall_trace_enter+0x100/0x160
>>>>>     do_syscall_64+0x38/0xf0
>>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> The ip address stored in pt_regs is from user space hence no kernel
>>>>> stack is printed.
>>>>>
>>>>> To fix the issue, we need to use kernel address from pt_regs.
>>>>> In kernel repo, there are already a few cases like this. For example,
>>>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
>>>>> instances are used to supply ip address or use ip address to construct
>>>>> call stack.
>>>>>
>>>>> The patch follows the above example by using a fake pt_regs.
>>>>> The pt_regs is stored in local stack since the syscall tracepoint
>>>>> tracing is in process context and there are no possibility that
>>>>> different concurrent syscall tracepoint tracing could mess up with each
>>>>> other. This is similar to a perf_fetch_caller_regs() use case in
>>>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
>>>>> where a local pt_regs is used.
>>>>>
>>>>> With this patch, for the above bpftrace script, I got the following output
>>>>> ===
>>>>>     Kernel Stack
>>>>>
>>>>>           syscall_trace_enter+407
>>>>>           syscall_trace_enter+407
>>>>>           do_syscall_64+74
>>>>>           entry_SYSCALL_64_after_hwframe+75
>>>>> ===
>>>>>
>>>>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
>>>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> ---
>>>>>    kernel/trace/trace_syscalls.c | 5 ++++-
>>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>> Note, we need to solve the same for perf_call_bpf_exit().
>>>>
>>>> pw-bot: cr
>>>>
>>> BTW, we lived with this bug for years, so I suggest basing your fix on
>>> top of bpf-next/master, no bpf/master, which will give people a bit of
>>> time to validate that the fix works as expected and doesn't produce
>>> any undesirable side effects, before this makes it into the final
>>> Linux release.
>> Yes, I did. See I indeed use 'bpf-next' in subject above.
> Huh, strange, I actually tried to apply your patch to bpf-next/master
> and it didn't apply cleanly. It did apply to bpf/master, though, which
> is why I assumed you based it off of bpf/master.

Interesting. The following is my git history:

7b71206057440d9559ecb9cd02d891f46927b272 (HEAD -> trace_syscall) bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
41d0c4677feee1ea063e0f2c2af72dc953b1f1cc (origin/master, origin/HEAD, master) libbpf: Fix some typos in comments
72d8508ecd3b081dba03ec00930c6b07c1ad55d3 MAINTAINERS: BPF ARC JIT: Update my e-mail address
bee109b7b3e50739b88252a219fa07ecd78ad628 bpf: Fix error message on kfunc arg type mismatch
...

Not sure what is going on ...
   

>
>>>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>>>>> index 9c581d6da843..063f51952d49 100644
>>>>> --- a/kernel/trace/trace_syscalls.c
>>>>> +++ b/kernel/trace/trace_syscalls.c
>>>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
>>>> let's also drop struct pt_regs * argument into
>>>> perf_call_bpf_{enter,exit}(), they are not actually used anymore
>>>>
>>>>>                   int syscall_nr;
>>>>>                   unsigned long args[SYSCALL_DEFINE_MAXARGS];
>>>>>           } __aligned(8) param;
>>>>> +       struct pt_regs fake_regs;
>>>>>           int i;
>>>>>
>>>>>           BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>>>>>
>>>>>           /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. &param) */
>>>>> -       *(struct pt_regs **)&param = regs;
>>>>> +       memset(&fake_regs, 0, sizeof(fake_regs));
>>>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
>>>> 336 bytes, so these memset(0) calls are not free for sure.
>>>>
>>>> But we don't need to do this unnecessary work all the time.
>>>>
>>>> I initially was going to suggest to use get_bpf_raw_tp_regs() from
>>>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
>>>> memset(0) and used to initialize these minimal "fake regs".
>>>>
>>>> But, it turns out we don't need to do even that. Note
>>>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
>>>> and if you pass a valid pointer there, it will return "fake regs"
>>>> struct to be used. We already use that functionality in
>>>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
>>>> tracepoints), so this seems to be a perfect fit.
>>>>
>>>>> +       perf_fetch_caller_regs(&fake_regs);
>>>>> +       *(struct pt_regs **)&param = &fake_regs;
>>>>>           param.syscall_nr = rec->nr;
>>>>>           for (i = 0; i < sys_data->nb_args; i++)
>>>>>                   param.args[i] = rec->args[i];
>>>>> --
>>>>> 2.43.5
>>>>>

next prev parent reply	other threads:[~2024-09-10 18:22 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song
2024-09-10  5:34 ` Andrii Nakryiko
2024-09-10  5:42   ` Andrii Nakryiko
2024-09-10 15:25     ` Yonghong Song
2024-09-10 16:50       ` Andrii Nakryiko
2024-09-10 18:22         ` Yonghong Song [this message]
2024-09-10 18:25           ` Andrii Nakryiko
2024-09-10 15:23   ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e9b9db08-7ad4-47e0-be4d-6cd85eed854e@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    --cc=salvabenedetto@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox