Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
	Salvatore Benedetto <salvabenedetto@meta.com>
Subject: Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Date: Tue, 10 Sep 2024 08:25:22 -0700	[thread overview]
Message-ID: <84f2c314-980c-4e01-bcaa-dafb62a934f3@linux.dev> (raw)
In-Reply-To: <CAEf4BzZC3FyP06p-H8JhQVJqOTRfjLSfNpHBZn3hN2WRfypDsw@mail.gmail.com>


On 9/9/24 10:42 PM, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
>>> tracing the kernel stack is empty. For example, using the following
>>> command line
>>>    bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
>>> the output will be
>>> ===
>>>    Kernel Stack
>>> ===
>>>
>>> Further analysis shows that pt_regs used for bpf syscall tracepoint
>>> tracing is from the one constructed during user->kernel transition.
>>> The call stack looks like
>>>    perf_syscall_enter+0x88/0x7c0
>>>    trace_sys_enter+0x41/0x80
>>>    syscall_trace_enter+0x100/0x160
>>>    do_syscall_64+0x38/0xf0
>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> The ip address stored in pt_regs is from user space hence no kernel
>>> stack is printed.
>>>
>>> To fix the issue, we need to use kernel address from pt_regs.
>>> In kernel repo, there are already a few cases like this. For example,
>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
>>> instances are used to supply ip address or use ip address to construct
>>> call stack.
>>>
>>> The patch follows the above example by using a fake pt_regs.
>>> The pt_regs is stored in local stack since the syscall tracepoint
>>> tracing is in process context and there are no possibility that
>>> different concurrent syscall tracepoint tracing could mess up with each
>>> other. This is similar to a perf_fetch_caller_regs() use case in
>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
>>> where a local pt_regs is used.
>>>
>>> With this patch, for the above bpftrace script, I got the following output
>>> ===
>>>    Kernel Stack
>>>
>>>          syscall_trace_enter+407
>>>          syscall_trace_enter+407
>>>          do_syscall_64+74
>>>          entry_SYSCALL_64_after_hwframe+75
>>> ===
>>>
>>> Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>>   kernel/trace/trace_syscalls.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>> Note, we need to solve the same for perf_call_bpf_exit().
>>
>> pw-bot: cr
>>
> BTW, we lived with this bug for years, so I suggest basing your fix on
> top of bpf-next/master, no bpf/master, which will give people a bit of
> time to validate that the fix works as expected and doesn't produce
> any undesirable side effects, before this makes it into the final
> Linux release.

Yes, I did. See I indeed use 'bpf-next' in subject above.

>
>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>>> index 9c581d6da843..063f51952d49 100644
>>> --- a/kernel/trace/trace_syscalls.c
>>> +++ b/kernel/trace/trace_syscalls.c
>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
>> let's also drop struct pt_regs * argument into
>> perf_call_bpf_{enter,exit}(), they are not actually used anymore
>>
>>>                  int syscall_nr;
>>>                  unsigned long args[SYSCALL_DEFINE_MAXARGS];
>>>          } __aligned(8) param;
>>> +       struct pt_regs fake_regs;
>>>          int i;
>>>
>>>          BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>>>
>>>          /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. &param) */
>>> -       *(struct pt_regs **)&param = regs;
>>> +       memset(&fake_regs, 0, sizeof(fake_regs));
>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
>> 336 bytes, so these memset(0) calls are not free for sure.
>>
>> But we don't need to do this unnecessary work all the time.
>>
>> I initially was going to suggest to use get_bpf_raw_tp_regs() from
>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
>> memset(0) and used to initialize these minimal "fake regs".
>>
>> But, it turns out we don't need to do even that. Note
>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
>> and if you pass a valid pointer there, it will return "fake regs"
>> struct to be used. We already use that functionality in
>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
>> tracepoints), so this seems to be a perfect fit.
>>
>>> +       perf_fetch_caller_regs(&fake_regs);
>>> +       *(struct pt_regs **)&param = &fake_regs;
>>>          param.syscall_nr = rec->nr;
>>>          for (i = 0; i < sys_data->nb_args; i++)
>>>                  param.args[i] = rec->args[i];
>>> --
>>> 2.43.5
>>>

next prev parent reply	other threads:[~2024-09-10 15:25 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  3:43 [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing Yonghong Song
2024-09-10  5:34 ` Andrii Nakryiko
2024-09-10  5:42   ` Andrii Nakryiko
2024-09-10 15:25     ` Yonghong Song [this message]
2024-09-10 16:50       ` Andrii Nakryiko
2024-09-10 18:22         ` Yonghong Song
2024-09-10 18:25           ` Andrii Nakryiko
2024-09-10 15:23   ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84f2c314-980c-4e01-bcaa-dafb62a934f3@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    --cc=salvabenedetto@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox