From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>,
Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack
Date: Wed, 9 Oct 2024 08:56:08 -0700 [thread overview]
Message-ID: <5be197d5-dec6-4d65-9908-1bfb6267d091@linux.dev> (raw)
In-Reply-To: <CAADnVQ+T5AD8J_p3U5vpTs=5nqpypuQeGBE+wezB7mnh8Axo0Q@mail.gmail.com>
On 10/9/24 7:56 AM, Alexei Starovoitov wrote:
> On Tue, Oct 8, 2024 at 11:31 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 10/8/24 7:06 PM, Alexei Starovoitov wrote:
>>> On Tue, Oct 8, 2024 at 3:10 PM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> We need to scrap this idea.
>>>> Let's go back to push/pop r11 around calls :(
>>> I didn't give up :)
>>>
>>> Here is a new idea that seems to work:
>>>
>>> [ 131.472066] dump_stack_lvl+0x53/0x70
>>> [ 131.472066] bpf_task_storage_get+0x3e/0x2f0
>>> [ 131.472066] ? bpf_task_storage_get+0x231/0x2f0
>>> [ 131.472066] bpf_prog_ed7a5f33cc9fefab_foo+0x30/0x32
>>> [ 131.472066] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d
>>> ...
>>> [ 131.417145] dump_stack_lvl+0x53/0x70
>>> [ 131.417145] bpf_task_storage_get+0x3e/0x2f0
>>> [ 131.417145] ? selinux_netlbl_socket_post_create+0xab/0x150
>>> [ 131.417145] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x60/0x6d
>>>
>>>
>>> The stack dump works fine out of main prog and out of subprog.
>>>
>>> The key difference it to pretend to have stack_depth=0,
>>> so there is no adjustment to %rsp,
>>> but point %rbp to per-cpu private stack and grow it _up_.
>>>
>>> For the main prog %rbp points to the bottom of priv stack
>>> plus stack_depth it needs,
>>> so all bpf insns that do r10-off access the bottom of that priv stack.
>>> When subprog is called it does 'add %rbp, its_stack_depth' and
>>> in turn it's using memory above the bottom of the priv stack.
>>>
>>> That seems to work, but exceptions and tailcalls are broken.
>>> I ran out of time today to debug.
>>> Pls see the attached patch.
>> The core part of the code is below:
>>
>> EMIT1(0x55); /* push rbp */ - EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp
>> */ + if (tail_call_reachable || !bpf_prog->aux->priv_stack_ptr) { +
>> EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ + } else { + if
>> (!is_subprog) { + /* mov rsp, pcpu_priv_stack_bottom */ + void __percpu
>> *priv_frame_ptr = + bpf_prog->aux->priv_stack_ptr +
>> round_up(stack_depth, 8); + + /* movabs sp, priv_frame_ptr */ +
>> emit_mov_imm64(&prog, AUX_REG, (long) priv_frame_ptr >> 32, + (u32)
>> (long) priv_frame_ptr); + + /* add <aux_reg>, gs:[<off>] */ +
>> EMIT2(0x65, 0x4c); + EMIT3(0x03, 0x1c, 0x25); + EMIT((u32)(unsigned
>> long)&this_cpu_off, 4); + /* mov rbp, aux_reg */ + EMIT3(0x4c, 0x89,
>> 0xdd); + } else { + /* add rbp, stack_depth */ + EMIT3_off32(0x48, 0x81,
>> 0xC5, round_up(stack_depth, 8)); + } + }
> your mailer garbled the diff.
Sorry, I just copy-paste from your attached code. It shows properly
when I send email. I guess, I need to ensure I use proper format
in my editor.
>
>> So for main program, we have
>>
>> push rbp rbp = per_cpu_ptr(priv_stack_ptr + stack_size) ... What will
>> happen we have an interrupt like below? push rbp rbp =
>> per_cpu_ptr(priv_stack_ptr + stack_size) <=== interrupt happens here ...
>> If we need to dump the stack trace at interrupt point then unwinder may
>> have difficulty to find the proper stack trace since *rbp is a arbitrary
>> value and *(rbp + 8) will not have proper func return address. Does this
>> make sense?
> Hard to read above... but I think you're saying that rbp will point
Sorry again. Formating issue again.
> to priv stack, irq happens and unwinder cannot work ?
> Yes. I was also expecting it to break, but orc unwinder
> with fallback to fp somehow did it correctly. See above stack dumps.
> For the top frame the unwinder starts from SP, so it's fine,
> but for the subprog 'foo' above the 'push rbp' pushes the
> addr of priv stack, so the chain should be broken,
> but the printed stack is correct, so I'm puzzled why it worked :)
We still have issues here. With 'rbp = ...' approach, I got
stack:
[ 53.429814] Call Trace:
[ 53.430177] <TASK>
[ 53.430498] dump_stack_lvl+0x52/0x70
[ 53.431067] bpf_task_storage_get+0x41/0x120
[ 53.431680] bpf_prog_71392c3ef5437fd9_foo+0x30/0x32
[ 53.432404] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d
[ 53.433241] ? bpf_trampoline_6442549714+0x68/0x10d
[ 53.433879] ? bpf_lsm_socket_post_create+0x9/0x20
[ 53.434512] ? security_socket_post_create+0x6e/0xd0
[ 53.435166] ? __sock_create+0x19e/0x2d0
[ 53.435686] ? __sys_socket+0x56/0xd0
[ 53.436176] ? __x64_sys_socket+0x19/0x30
[ 53.436702] ? do_syscall_64+0x58/0xf0
[ 53.437201] ? clear_bhb_loop+0x45/0xa0
[ 53.437746] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 53.438488] </TASK>
With the original kernel plus the following hack:
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -255,6 +255,7 @@ BPF_CALL_5(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
return (unsigned long)NULL;
+ dump_stack();
bpf_task_storage_lock();
data = __bpf_task_storage_get(map, task, value, flags,
gfp_flags, true);
I got stack trace:
[ 32.146519] Call Trace:
[ 32.146979] <TASK>
[ 32.147356] dump_stack_lvl+0x52/0x70
[ 32.147984] bpf_task_storage_get+0x41/0x120
[ 32.148741] bpf_prog_3c50a12b50fe949a_socket_post_create+0x5d/0xaa
[ 32.149844] bpf_trampoline_6442512791+0x68/0x10d
[ 32.150679] bpf_lsm_socket_post_create+0x9/0x20
[ 32.151451] security_socket_post_create+0x6e/0xd0
[ 32.152320] __sock_create+0x19e/0x2d0
[ 32.153059] __sys_socket+0x56/0xd0
[ 32.153779] __x64_sys_socket+0x19/0x30
[ 32.154561] do_syscall_64+0x58/0xf0
[ 32.155225] ? clear_bhb_loop+0x45/0xa0
[ 32.155970] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 32.156864] RIP: 0033:0x7f580d11385b
[ 32.157554] Code: 8b 54 24 08 64 48 2b 14 25 28 00 00 00 75 05 48 83 c4 18 c3 67 e8 65 d0 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 29 00 00 00 0f 05 <48> 3d 8
[ 32.160990] RSP: 002b:00007f58005ffea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
[ 32.162500] RAX: ffffffffffffffda RBX: 00007f5800600cdc RCX: 00007f580d11385b
[ 32.163907] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000002
[ 32.165292] RBP: 00007f58005ffed0 R08: 0000000000000000 R09: 00007f58006006c0
[ 32.166608] R10: 0000000000000008 R11: 0000000000000246 R12: ffffffffffffff80
[ 32.167898] R13: 0000000000000002 R14: 00007ffc461fba30 R15: 00007f57ffe00000
[ 32.169119] </TASK>
The difference is after bpf prog, the kernel stack trace
does not have '?' while with private stack and 'rbp = priv_stack_ptr'
approach, we have '?'.
The reason is that for private stack, when unwinder find the 'rbp', it
is not able to find the previous frame return address and previous proper 'rbp'.
next prev parent reply other threads:[~2024-10-09 15:56 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-26 23:45 [PATCH bpf-next v3 0/5] bpf: Support private stack for bpf progs Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 1/5] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 2/5] bpf: Collect stack depth information Yonghong Song
2024-09-30 14:42 ` Alexei Starovoitov
2024-09-30 16:23 ` Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 3/5] bpf: Mark each subprog with proper pstack states Yonghong Song
2024-09-30 14:49 ` Alexei Starovoitov
2024-09-30 16:26 ` Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack Yonghong Song
2024-09-27 4:58 ` Leon Hwang
2024-09-27 15:24 ` Yonghong Song
2024-09-29 8:31 ` kernel test robot
2024-09-30 16:29 ` Yonghong Song
2024-09-29 13:02 ` kernel test robot
2024-09-30 16:31 ` Yonghong Song
2024-09-29 13:34 ` kernel test robot
2024-09-30 15:03 ` Alexei Starovoitov
2024-09-30 16:33 ` Yonghong Song
2024-10-01 4:31 ` Kumar Kartikeya Dwivedi
2024-10-01 4:37 ` Kumar Kartikeya Dwivedi
2024-10-01 18:49 ` Alexei Starovoitov
2024-10-01 19:53 ` yet another approach Was: " Alexei Starovoitov
2024-10-01 20:50 ` Kumar Kartikeya Dwivedi
2024-10-01 21:28 ` Alexei Starovoitov
2024-10-02 0:22 ` Kumar Kartikeya Dwivedi
2024-10-02 1:26 ` Alexei Starovoitov
2024-10-02 2:16 ` Kumar Kartikeya Dwivedi
2024-10-02 6:28 ` Yonghong Song
2024-10-02 6:48 ` Yonghong Song
2024-10-03 6:17 ` Yonghong Song
2024-10-03 13:39 ` Kumar Kartikeya Dwivedi
2024-10-03 17:35 ` Alexei Starovoitov
2024-10-03 18:53 ` Yonghong Song
2024-10-03 20:44 ` Yonghong Song
2024-10-03 20:47 ` Kumar Kartikeya Dwivedi
2024-10-03 20:54 ` Yonghong Song
2024-10-03 22:32 ` Alexei Starovoitov
2024-10-04 5:22 ` Yonghong Song
2024-10-04 19:27 ` Yonghong Song
2024-10-04 19:52 ` Alexei Starovoitov
2024-10-05 2:03 ` Yonghong Song
2024-10-08 22:10 ` Alexei Starovoitov
2024-10-09 2:06 ` Alexei Starovoitov
2024-10-09 6:31 ` Yonghong Song
2024-10-09 14:56 ` Alexei Starovoitov
2024-10-09 15:56 ` Yonghong Song [this message]
2024-10-09 16:36 ` Kumar Kartikeya Dwivedi
2024-10-09 16:38 ` Kumar Kartikeya Dwivedi
2024-10-09 17:37 ` Kumar Kartikeya Dwivedi
2024-10-09 6:12 ` Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 5/5] selftests/bpf: Add private stack tests Yonghong Song
2024-09-30 13:40 ` Jiri Olsa
2024-09-30 15:05 ` Alexei Starovoitov
2024-09-30 16:35 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5be197d5-dec6-4d65-9908-1bfb6267d091@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox