Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
Date: Mon, 22 Jul 2024 22:30:05 -0700	[thread overview]
Message-ID: <7dce9923-2a18-4b41-8b40-420e1cfa82e8@linux.dev> (raw)
In-Reply-To: <CAADnVQJDE24HQD7KYRRu1Nsz9965op=62dhx7HqW2QZRzHGBKQ@mail.gmail.com>


On 7/22/24 6:05 PM, Alexei Starovoitov wrote:
> On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> The main motivation for private stack comes from nested
>>>> scheduler in sched-ext from Tejun. The basic idea is that
>>>>   - each cgroup will its own associated bpf program,
>>>>   - bpf program with parent cgroup will call bpf programs
>>>>     in immediate child cgroups.
>>>>
>>>> Let us say we have the following cgroup hierarchy:
>>>>    root_cg (prog0):
>>>>      cg1 (prog1):
>>>>        cg11 (prog11):
>>>>          cg111 (prog111)
>>>>          cg112 (prog112)
>>>>        cg12 (prog12):
>>>>          cg121 (prog121)
>>>>          cg122 (prog122)
>>>>      cg2 (prog2):
>>>>        cg21 (prog21)
>>>>        cg22 (prog22)
>>>>        cg23 (prog23)
>>>>
>>>> In the above example, prog0 will call a kfunc which will
>>>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>>>> then the information is summarized and sent back to prog0.
>>>> Similarly, prog11 and prog12 will be invoked in the kfunc
>>>> and the result will be summarized and sent back to prog1, etc.
>>>>
>>>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>>>> The each bpf program (including its subprograms) has maximum
>>>> 512B stack size to avoid potential stack overflow.
>>>> And nested bpf programs increase the risk of stack overflow.
>>>> To avoid potential stack overflow caused by bpf programs,
>>>> this patch implemented a private stack so bpf program stack
>>>> space is allocated dynamically when the program is jited.
>>>> Such private stack is applied to tracing programs like
>>>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>>>> tracing.
>>>>
>>>> But more than one instance of the same bpf program may
>>>> run in the system. To make things simple, percpu private
>>>> stack is allocated for each program, so if the same program
>>>> is running on different cpus concurrently, we won't have
>>>> any issue. Note that the kernel already have logic to prevent
>>>> the recursion for the same bpf program on the same cpu
>>>> (kprobe, fentry, etc.).
>>>>
>>>> The patch implemented a percpu private stack based approach
>>>> for x86 arch.
>>>>    - The stack size will be 0 and any stack access is from
>>>>      jit-time allocated percpu storage.
>>>>    - In the beginning of jit, r9 is used to save percpu
>>>>      private stack pointer.
>>>>    - Each rbp in the bpf asm insn is replaced by r9.
>>>>    - For each call, push r9 before the call and pop r9
>>>>      after the call to preserve r9 value.
>>>>
>>>> Compared to previous RFC patch [1], this patch added
>>>> some conditions to enable private stack, e.g., verifier
>>>> calculated stack size, prog type, etc. The new patch
>>>> also added a performance test to compare private stack
>>>> vs. no private stack.
>>>>
>>>> The following are some code example to illustrate the idea
>>>> for selftest cgroup_skb_sk_lookup:
>>>>
>>>>     the existing code                        the private-stack approach code
>>>>     endbr64                                  endbr64
>>>>     nop    DWORD PTR [rax+rax*1+0x0]         nop    DWORD PTR [rax+rax*1+0x0]
>>>>     xchg   ax,ax                             xchg   ax,ax
>>>>     push   rbp                               push   rbp
>>>>     mov    rbp,rsp                           mov    rbp,rsp
>>>>     endbr64                                  endbr64
>>>>     sub    rsp,0x68
>>>>     push   rbx                               push   rbx
>>>>     ...                                      ...
>>>>     ...                                      mov    r9d,0x8c1c860
>>>>     ...                                      add    r9,QWORD PTR gs:0x21a00
>>>>     ...                                      ...
>>>>     mov    rdx,rbp                           mov    rdx, r9
>>>>     add    rdx,0xffffffffffffffb4            rdx,0xffffffffffffffb4
>>>>     ...                                      ...
>>>>     mov    ecx,0x28                          mov    ecx,0x28
>>>>                                              push   r9
>>>>     call   0xffffffffe305e474                call   0xffffffffe305e524
>>>>                                              pop    r9
>>>>     mov    rdi,rax                           mov    rdi,rax
>>>>     ...                                      ...
>>>>     movzx  rdi,BYTE PTR [rbp-0x46]           movzx  rdi,BYTE PTR [r9-0x46]
>>>>     ...                                      ...
>>>>
>>> Eduard nerd-sniped me today with this a bit... :)
>>>
>>> I have a few questions and suggestions.
>>>
>>> So it seems like each *subprogram* (not the entire BPF program) gets
>>> its own per-CPU private stack allocation. Is that intentional? That
>>> seems a bit unnecessary. It also prevents any sort of actual
>>> recursion. Not sure if it's possible to write recursive BPF subprogram
>>> today, verifier seems to reject obvious limited recursion cases, but
>>> still, eventually we might need/want to support that, and this will be
>>> just another hurdle to overcome (so it's best to avoid adding it in
>>> the first place).
>>>
>>> I'm sure Eduard is going to try something like below and it will
>>> probably break badly (I haven't tried, sorry):
>>>
>>> int entry(void *ctx);
>>>
>>> struct {
>>>          __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>>          __uint(max_entries, 1);
>>>          __uint(key_size, sizeof(__u32));
>>>          __array(values, int (void *));
>>> } prog_array_init SEC(".maps") = {
>>>          .values = {
>>>                  [0] = (void *)&entry,
>>>          },
>>> };
>>>
>>> static __noinline int subprog1(void)
>>> {
>>>      <some state on the stack>
>>>
>>>      /* here entry will replace subprog1, and so we'll have
>>>       * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
>>>       */
>>>      bpf_tail_call(ctx, &prog_array_init, 0);
>>>
>>>      return 0;
>>> }
>>>
>>>
>>> SEC("raw_tp/sys_enter")
>>> int entry(void *ctx)
>>> {
>>>       <some state on the stack>
>>>
>>>       subprog1();
>>> }
>>>
>>> And we effectively have limited recursion where the entry's stack
>>> state is clobbered, no?
>>>
>>> So it seems like we need to support recursion.
>>>
>> How come everyone just completely ignored the main point of my entire
>> email and a real problem that has to be solved?...
>>
>> Anyways, I did write a below program:
>>
>> $ cat minimal.bpf.c
>> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> /* Copyright (c) 2020 Facebook */
>> #include <linux/bpf.h>
>> #include <bpf/bpf_helpers.h>
>>
>> char LICENSE[] SEC("license") = "Dual BSD/GPL";
>>
>> int my_pid = 0;
>>
>> int handle_tp(void *ctx);
>>
>> struct {
>>          __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>          __uint(max_entries, 1);
>>          __uint(key_size, sizeof(__u32));
>>          __array(values, int (void *));
>> } prog_array_init SEC(".maps") = {
>>          .values = {
>>                  [0] = (void *)&handle_tp,
>>          },
>> };
>>
>> static __noinline int subprog(void *ctx)
>> {
>>          static int cnt;
>>
>>          cnt++;
>>
>>          bpf_printk("SUBPROG - BEFORE %d", cnt);
>>
>>          bpf_tail_call(ctx, &prog_array_init, 0);
>>
>>          bpf_printk("SUBPROG - AFTER %d", cnt);
>>
>>      return 0;
>> }
>>
>> SEC("tp/syscalls/sys_enter_write")
>> int handle_tp(void *ctx)
>> {
>>          static int cnt;
>>          int pid = bpf_get_current_pid_tgid() >> 32;
>>
>>          if (pid != my_pid)
>>                  return 0;
>>
>>          cnt++;
>>
>>          bpf_printk("ENTRY - BEFORE %d", cnt);
>>
>>          subprog(ctx);
>>
>>          bpf_printk("ENTRY - AFTER %d", cnt);
>>
>>          return 0;
>> }
>>
>>
>> And triggered one write syscall, getting the log above. You can see
>> that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
>> to the tail call limit). And we do indeed get lots of entry program
>> recurrence.
>>
>> We *need to support recursion* is my main point.
> Not quite.
> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
> static int cnt counts stuff because it's static.
>
> So we don't need to support recursion with private stack,
> but tail_calls and private stack are buggy indeed.
>
> emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
> stack is used.

Right, stack_depth argument in emit_bpf_tail_call_direct()/emit_bpf_tail_call_indirect()
should be 0 if private stack is used. Will fix in next revision.

next prev parent reply	other threads:[~2024-07-23  5:30 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44   ` Yonghong Song
2024-07-18 21:59     ` Kumar Kartikeya Dwivedi
2024-07-19  3:01       ` Yonghong Song
2024-07-19  0:36     ` Alexei Starovoitov
2024-07-19  2:21       ` Yonghong Song
2024-07-20  0:14   ` bot+bpf-ci
2024-07-20  1:08   ` Alexei Starovoitov
2024-07-22 16:33     ` Yonghong Song
2024-07-20  3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43   ` Yonghong Song
2024-07-24  5:08     ` Yonghong Song
2024-07-24 16:54       ` Alexei Starovoitov
2024-07-24 17:56         ` Yonghong Song
2024-07-22 20:57   ` Andrii Nakryiko
2024-07-23  1:05     ` Alexei Starovoitov
2024-07-23  3:26       ` Andrii Nakryiko
2024-07-24  3:17         ` Alexei Starovoitov
2024-07-24  4:06           ` Andrii Nakryiko
2024-07-24  4:46             ` Yonghong Song
2024-07-24  4:32           ` Yonghong Song
2024-07-23  5:30       ` Yonghong Song [this message]
2024-07-23  7:02         ` Yonghong Song
2024-07-22  3:33 ` Eduard Zingerman
2024-07-22 16:54   ` Yonghong Song
2024-07-22 17:53     ` Eduard Zingerman
2024-07-22 17:51   ` Alexei Starovoitov
2024-07-22 18:22     ` Eduard Zingerman
2024-07-22 20:08       ` Alexei Starovoitov
2024-07-24 21:28   ` Yonghong Song
2024-07-25  4:55     ` Alexei Starovoitov
2024-07-25 17:20       ` Eduard Zingerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7dce9923-2a18-4b41-8b40-420e1cfa82e8@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox