From: Alan Maguire <alan.maguire@oracle.com>
To: Eduard Zingerman <eddyz87@gmail.com>,
Timofei Pushkin <pushkin.td@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: Question: CO-RE-enabled PT_REGS macros give strange results
Date: Wed, 26 Jul 2023 14:46:15 +0100 [thread overview]
Message-ID: <308bfec7-38d7-9dcd-3130-5602658db47f@oracle.com> (raw)
In-Reply-To: <49c9170f7dd0d3e78a12570ae422bce553a1e236.camel@gmail.com>
On 26/07/2023 01:03, Eduard Zingerman wrote:
> On Tue, 2023-07-25 at 15:04 +0100, Alan Maguire wrote:
>> On 25/07/2023 00:00, Alan Maguire wrote:
>>> On 24/07/2023 16:04, Timofei Pushkin wrote:
>>>> On Mon, Jul 24, 2023 at 3:36 PM Alan Maguire <alan.maguire@oracle.com> wrote:
>>>>>
>>>>> On 24/07/2023 11:32, Timofei Pushkin wrote:
>>>>>> Dear BPF community,
>>>>>>
>>>>>> I'm developing a perf_event BPF program which reads some register
>>>>>> values (frame and instruction pointers in particular) from the context
>>>>>> provided to it. I found that CO-RE-enabled PT_REGS macros give results
>>>>>> different from the results of the usual PT_REGS macros. I run the
>>>>>> program on the same system I compiled it on, and so I cannot
>>>>>> understand why the results differ and which ones should I use?
>>>>>>
>>>>>> From my tests, the results of the usual macros are the correct ones
>>>>>> (e.g. I can symbolize the instruction pointers I get this way), but
>>>>>> since I try to follow the CO-RE principle, it seems like I should be
>>>>>> using the CO-RE-enabled variants instead.
>>>>>>
>>>>>> I did some experiments and found out that it is the
>>>>>> bpf_probe_read_kernel part of the CO-RE-enabled PT_REGS macros that
>>>>>> change the results and not __builtin_preserve_access_index. But I
>>>>>> still don't get why exactly it changes the results.
>>>>>>
>>>>>
>>>>> Can you provide the exact usage of the BPF CO-RE macros that isn't
>>>>> working, and the equivalent non-CO-RE version that is? Also if you
>>>>
>>>> As a minimal example, I wrote the following little BPF program which
>>>> prints instruction pointers obtained with non-CO-RE and CO-RE macros:
>>>>
>>>> volatile const pid_t target_pid;
>>>>
>>>> SEC("perf_event")
>>>> int do_test(struct bpf_perf_event_data *ctx) {
>>>> pid_t pid = bpf_get_current_pid_tgid();
>>>> if (pid != target_pid) return 0;
>>>>
>>>> unsigned long p = PT_REGS_IP(&ctx->regs);
>>>> unsigned long p_core = PT_REGS_IP_CORE(&ctx->regs);
>>>> bpf_printk("non-CO-RE: %lx, CO-RE: %lx", p, p_core);
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> From user space, I set the target PID and attach the program to CPU
>>>> clock perf events (error checking and cleanup omitted for brevity):
>>>>
>>>> int main(int argc, char *argv[]) {
>>>> // Load the program also setting the target PID
>>>> struct test_program_bpf *skel = test_program_bpf__open();
>>>> skel->rodata->target_pid = (pid_t) strtol(argv[1], NULL, 10);
>>>> test_program_bpf__load(skel);
>>>>
>>>> // Attach to perf events
>>>> struct perf_event_attr attr = {
>>>> .type = PERF_TYPE_SOFTWARE,
>>>> .size = sizeof(struct perf_event_attr),
>>>> .config = PERF_COUNT_SW_CPU_CLOCK,
>>>> .sample_freq = 1,
>>>> .freq = true
>>>> };
>>>> for (int cpu_i = 0; cpu_i < libbpf_num_possible_cpus(); cpu_i++) {
>>>> int perf_fd = syscall(SYS_perf_event_open, &attr, -1, cpu_i, -1, 0);
>>>> bpf_program__attach_perf_event(skel->progs.do_test, perf_fd);
>>>> }
>>>>
>>>> // Wait for Ctrl-C
>>>> pause();
>>>> return 0;
>>>> }
>>>>
>>>> As an experiment, I launched a simple C program with an endless loop
>>>> in main and started the BPF program above with its target PID set to
>>>> the PID of this simple C program. Then by checking the virtual memory
>>>> mapped for the C program (with "cat /proc/<PID>/maps"), I found out
>>>> that its .text section got mapped into 55ca2577b000-55ca2577c000
>>>> address space. When I checked the output of the BPF program, I got
>>>> "non-CO-RE: 55ca2577b131, CO-RE: ffffa58810527e48". As you can see,
>>>> the non-CO-RE result maps into the .text section of the launched C
>>>> program (as it should since this is the value of the instruction
>>>> pointer), while the CO-RE result does not.
>>>>
>>>> Alternatively, if I replace PT_REGS_IP and PT_REGS_IP_CORE with the
>>>> equivalents for the stack pointer (PT_REGS_SP and PT_REGS_SP_CORE), I
>>>> get results that correspond to the stack address space from the
>>>> non-CO-RE macro, but I always get 0 from the CO-RE macro.
>>>>
>>>>> can provide details on the platform you're running on that will
>>>>> help narrow down the issue. Thanks!
>>>>
>>>> Sure. I'm running Ubuntu 22.04.1, kernel version 5.19.0-46-generic,
>>>> the architecture is x86_64, clang 14.0.0 is used to compile BPF
>>>> programs with flags -g -O2 -D__TARGET_ARCH_x86.
>>>>
>>>
>>> Thanks for the additional details! I've reproduced this on
>>> bpf-next with LLVM 15; I'm seeing the same issues with the CO-RE
>>> macros, and with BPF_CORE_READ(). However with extra libbpf debugging
>>> I do see that we pick up the right type id/index for the ip field in
>>> pt_regs:
>>>
>>> libbpf: prog 'do_test': relo #4: matching candidate #0 <byte_off> [216]
>>> struct pt_regs.ip (0:16 @ offset 128)
>>>
>>> One thing I noticed - perhaps this will ring some bells for someone -
>>> if I use __builtin_preserve_access_index() I get the same (correct)
>>> value for ip as is retrieved with PT_REGS_IP():
>>>
>>> __builtin_preserve_access_index(({
>>> p_core = ctx->regs.ip;
>>> }));
>>>
>>> I'll check with latest LLVM to see if the issue persists there.
>>>
>>
>>
>> The problem occurs with latest bpf-next + latest LLVM too. Perf event
>> programs fix up context accesses to the "struct bpf_perf_event_data *"
>> context, so accessing ctx->regs in your program becomes accessing the
>> "struct bpf_perf_event_data_kern *" regs, which is a pointer to
>> struct pt_regs. So I _think_ that's why the
>>
>> __builtin_preserve_access_index(({
>> p_core = ctx->regs.ip;
>> }));
>>
>>
>> ...works; ctx->regs is fixed up to point at the right place, then
>> CO-RE does its thing with the results. Contrast this with
>>
>> bpf_probe_read_kernel(&ip, sizeof(ip), &ctx->regs.ip);
>>
>> In the latter case, the fixups don't seem to happen and we get a
>> bogus address which appears to be consistently 218 bytes after the ctx
>> pointer. I've confirmed that a basic bpf_probe_read_kernel()
>> exposes the issue (and gives the same wrong address as a CO-RE-wrapped
>> bpf_probe_read_kernel()).
>>
>> I tried some permutations like defining
>>
>> struct pt_regs *regs = &ctx->regs;
>>
>> ...to see if that helps, but I think in that case the accesses aren't
>> caught by the verifier because we use the & operator on the ctx->regs.
>>
>> Not sure how smart the verifier can be about context accesses like this;
>> can someone who understands that code better than me take a look at this?
>
> Hi Alan,
>
> Your analysis is correct: verifier applies rewrites to instructions
> that read/write from/to certain context fields, including
> `struct bpf_perf_event_data`.
>
> This is done by function verifier.c:convert_ctx_accesses().
> This function handles BPF_LDX, BPF_STX and BPF_ST instructions, but it
> does not handle calls to helpers like bpf_probe_read_kernel().
>
> So, when code generated for PT_REGS_IP(&ctx->regs) is processed, the
> correct access sequence is inserted by function
> bpf_trace.c:pe_prog_convert_ctx_access() (see below).
>
> But code generated for `PT_REGS_IP_CORE(&ctx->regs)` is not modified.
>
Ah, makes sense. Would you consider it a bug that helper parameters
don't get context conversions applied, or are there additional
complexities here that mean that's not doable? (I'm wondering if
we should fix versus document this?). I would have thought the
only difference is the destination register, but the verifier is
a mysterious land to me..
> It looks like `PT_REGS_IP_CORE` macro should not be defined through
> bpf_probe_read_kernel(). I'll dig through commit history tomorrow to
> understand why is it defined like that now.
> help
If I recall the rationale was to allow the macros to work for both
BPF programs that can do direct dereference (fentry, fexit, tp_btf etc)
and for kprobe-style that need to use bpf_probe_read_kernel().
Not sure if it would be worth having variants that are purely
dereference-based, since we can just use PT_REGS_IP() due to
the __builtin_preserve_access_index attributes applied in vmlinux.h.
Thanks!
Alan
> Thanks,
> Eduard
>
> ---
> Below is annotated example, inpatient reader might skip it
>
> For the following test program:
>
> #include "vmlinux.h"
> ...
> SEC("perf_event")
> int do_test(struct bpf_perf_event_data *ctx) {
> unsigned long p = PT_REGS_IP(&ctx->regs);
> unsigned long p_core = PT_REGS_IP_CORE(&ctx->regs);
> bpf_printk("non-CO-RE: %lx, CO-RE: %lx", p, p_core);
> return 0;
> }
>
> Generated BPF code looks as follows:
>
> $ llvm-objdump --no-show-raw-insn -rd bpf.linked.o
> ...
> 0000000000000000 <do_test>:
> # Third argument for bpf_probe_read_kernel: offset of bpf_perf_event_data::regs.ip
> 0: r2 = 0x80
> 0000000000000000: CO-RE <byte_off> [2] struct bpf_perf_event_data::regs.ip (0:0:16)
> 1: r3 = r1
> 2: r3 += r2
> # The "non CO-RE" version of PT_REGS_IP is, in fact, CO-RE
> # because `struct bpf_perf_event_data` has preserve_access_index
> # tag in the vmlinux.h.
> # Here the regs.ip is stored in r6 to be used after the call
> # to bpf_probe_read_kernel() (from PT_REGS_IP_CORE).
> 3: r6 = *(u64 *)(r1 + 0x80)
> 0000000000000018: CO-RE <byte_off> [2] struct bpf_perf_event_data::regs.ip (0:0:16)
> # First argument for bpf_probe_read_kernel: a place on stack to put read result to.
> 4: r1 = r10
> 5: r1 += -0x8
> # Second argument for bpf_probe_read_kernel: the size of the field to read.
> 6: w2 = 0x8
> # Call to bpf_probe_read_kernel()
> 7: call 0x71
> # Fourth parameter of bpf_printk: p_core read from stack
> # (was written by call to bpf_probe_read_kernel)
> 8: r4 = *(u64 *)(r10 - 0x8)
> # First parameter of bpf_printk: control string
> 9: r1 = 0x0 ll
> 0000000000000048: R_BPF_64_64 .rodata
> # Second parameter of bpf_printk: size of the control string
> 11: w2 = 0x1b
> # Third parameter of bpf_printk: p (see addr 3)
> 12: r3 = r6
> # Call to bpf_printk
> 13: call 0x6
> ; return 0;
> 14: w0 = 0x0
> 15: exit
>
> I get the following BPF after all verifier rewrites are applied
> (including verifier.c:convert_ctx_accesses()):
>
> # ./tools/bpf/bpftool/bpftool prog dump xlated id 114
> int do_test(struct bpf_perf_event_data * ctx):
> ; int do_test(struct bpf_perf_event_data *ctx) {
> 0: (b7) r2 = 128 | CO-RE replacement, 128 is a valid offset for
> | bpf_perf_event_data::regs.ip in my kernel
> 1: (bf) r3 = r1
> 2: (0f) r3 += r2
>
> 3: (79) r6 = *(u64 *)(r1 +0) | This is an expantion of the
> 4: (79) r6 = *(u64 *)(r6 +128) | r6 = *(u64 *)(r1 + 0x80)
> 5: (bf) r1 = r10 | Created by bpf_trace.c:pe_prog_convert_ctx_access()
>
> 6: (07) r1 += -8
> 7: (b4) w2 = 8
> 8: (85) call bpf_probe_read_kernel#-91984
> 9: (79) r4 = *(u64 *)(r10 -8)
> 10: (18) r1 = map[id:59][0]+0
> 12: (b4) w2 = 27
> 13: (bf) r3 = r6
> 14: (85) call bpf_trace_printk#-85520
> 15: (b4) w0 = 0
> 16: (95) exit
>
>
next prev parent reply other threads:[~2023-07-26 13:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-24 10:32 Question: CO-RE-enabled PT_REGS macros give strange results Timofei Pushkin
2023-07-24 12:36 ` Alan Maguire
2023-07-24 15:04 ` Timofei Pushkin
2023-07-24 23:00 ` Alan Maguire
2023-07-25 14:04 ` Alan Maguire
2023-07-26 0:03 ` Eduard Zingerman
2023-07-26 13:46 ` Alan Maguire [this message]
2023-07-26 20:03 ` Eduard Zingerman
2023-07-26 23:39 ` Eduard Zingerman
2023-07-28 3:03 ` Yonghong Song
2023-07-28 12:30 ` Eduard Zingerman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=308bfec7-38d7-9dcd-3130-5602658db47f@oracle.com \
--to=alan.maguire@oracle.com \
--cc=bpf@vger.kernel.org \
--cc=eddyz87@gmail.com \
--cc=pushkin.td@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox