Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
Date: Mon, 22 Jul 2024 22:30:05 -0700	[thread overview]
Message-ID: <7dce9923-2a18-4b41-8b40-420e1cfa82e8@linux.dev> (raw)
In-Reply-To: <CAADnVQJDE24HQD7KYRRu1Nsz9965op=62dhx7HqW2QZRzHGBKQ@mail.gmail.com>


On 7/22/24 6:05 PM, Alexei Starovoitov wrote:
> On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> The main motivation for private stack comes from nested
>>>> scheduler in sched-ext from Tejun. The basic idea is that
>>>>   - each cgroup will its own associated bpf program,
>>>>   - bpf program with parent cgroup will call bpf programs
>>>>     in immediate child cgroups.
>>>>
>>>> Let us say we have the following cgroup hierarchy:
>>>>    root_cg (prog0):
>>>>      cg1 (prog1):
>>>>        cg11 (prog11):
>>>>          cg111 (prog111)
>>>>          cg112 (prog112)
>>>>        cg12 (prog12):
>>>>          cg121 (prog121)
>>>>          cg122 (prog122)
>>>>      cg2 (prog2):
>>>>        cg21 (prog21)
>>>>        cg22 (prog22)
>>>>        cg23 (prog23)
>>>>
>>>> In the above example, prog0 will call a kfunc which will
>>>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>>>> then the information is summarized and sent back to prog0.
>>>> Similarly, prog11 and prog12 will be invoked in the kfunc
>>>> and the result will be summarized and sent back to prog1, etc.
>>>>
>>>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>>>> The each bpf program (including its subprograms) has maximum
>>>> 512B stack size to avoid potential stack overflow.
>>>> And nested bpf programs increase the risk of stack overflow.
>>>> To avoid potential stack overflow caused by bpf programs,
>>>> this patch implemented a private stack so bpf program stack
>>>> space is allocated dynamically when the program is jited.
>>>> Such private stack is applied to tracing programs like
>>>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>>>> tracing.
>>>>
>>>> But more than one instance of the same bpf program may
>>>> run in the system. To make things simple, percpu private
>>>> stack is allocated for each program, so if the same program
>>>> is running on different cpus concurrently, we won't have
>>>> any issue. Note that the kernel already have logic to prevent
>>>> the recursion for the same bpf program on the same cpu
>>>> (kprobe, fentry, etc.).
>>>>
>>>> The patch implemented a percpu private stack based approach
>>>> for x86 arch.
>>>>    - The stack size will be 0 and any stack access is from
>>>>      jit-time allocated percpu storage.
>>>>    - In the beginning of jit, r9 is used to save percpu
>>>>      private stack pointer.
>>>>    - Each rbp in the bpf asm insn is replaced by r9.
>>>>    - For each call, push r9 before the call and pop r9
>>>>      after the call to preserve r9 value.
>>>>
>>>> Compared to previous RFC patch [1], this patch added
>>>> some conditions to enable private stack, e.g., verifier
>>>> calculated stack size, prog type, etc. The new patch
>>>> also added a performance test to compare private stack
>>>> vs. no private stack.
>>>>
>>>> The following are some code example to illustrate the idea
>>>> for selftest cgroup_skb_sk_lookup:
>>>>
>>>>     the existing code                        the private-stack approach code
>>>>     endbr64                                  endbr64
>>>>     nop    DWORD PTR [rax+rax*1+0x0]         nop    DWORD PTR [rax+rax*1+0x0]
>>>>     xchg   ax,ax                             xchg   ax,ax
>>>>     push   rbp                               push   rbp
>>>>     mov    rbp,rsp                           mov    rbp,rsp
>>>>     endbr64                                  endbr64
>>>>     sub    rsp,0x68
>>>>     push   rbx                               push   rbx
>>>>     ...                                      ...
>>>>     ...                                      mov    r9d,0x8c1c860
>>>>     ...                                      add    r9,QWORD PTR gs:0x21a00
>>>>     ...                                      ...
>>>>     mov    rdx,rbp                           mov    rdx, r9
>>>>     add    rdx,0xffffffffffffffb4            rdx,0xffffffffffffffb4
>>>>     ...                                      ...
>>>>     mov    ecx,0x28                          mov    ecx,0x28
>>>>                                              push   r9
>>>>     call   0xffffffffe305e474                call   0xffffffffe305e524
>>>>                                              pop    r9
>>>>     mov    rdi,rax                           mov    rdi,rax
>>>>     ...                                      ...
>>>>     movzx  rdi,BYTE PTR [rbp-0x46]           movzx  rdi,BYTE PTR [r9-0x46]
>>>>     ...                                      ...
>>>>
>>> Eduard nerd-sniped me today with this a bit... :)
>>>
>>> I have a few questions and suggestions.
>>>
>>> So it seems like each *subprogram* (not the entire BPF program) gets
>>> its own per-CPU private stack allocation. Is that intentional? That
>>> seems a bit unnecessary. It also prevents any sort of actual
>>> recursion. Not sure if it's possible to write recursive BPF subprogram
>>> today, verifier seems to reject obvious limited recursion cases, but
>>> still, eventually we might need/want to support that, and this will be
>>> just another hurdle to overcome (so it's best to avoid adding it in
>>> the first place).
>>>
>>> I'm sure Eduard is going to try something like below and it will
>>> probably break badly (I haven't tried, sorry):
>>>
>>> int entry(void *ctx);
>>>
>>> struct {
>>>          __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>>          __uint(max_entries, 1);
>>>          __uint(key_size, sizeof(__u32));
>>>          __array(values, int (void *));
>>> } prog_array_init SEC(".maps") = {
>>>          .values = {
>>>                  [0] = (void *)&entry,
>>>          },
>>> };
>>>
>>> static __noinline int subprog1(void)
>>> {
>>>      <some state on the stack>
>>>
>>>      /* here entry will replace subprog1, and so we'll have
>>>       * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
>>>       */
>>>      bpf_tail_call(ctx, &prog_array_init, 0);
>>>
>>>      return 0;
>>> }
>>>
>>>
>>> SEC("raw_tp/sys_enter")
>>> int entry(void *ctx)
>>> {
>>>       <some state on the stack>
>>>
>>>       subprog1();
>>> }
>>>
>>> And we effectively have limited recursion where the entry's stack
>>> state is clobbered, no?
>>>
>>> So it seems like we need to support recursion.
>>>
>> How come everyone just completely ignored the main point of my entire
>> email and a real problem that has to be solved?...
>>
>> Anyways, I did write a below program:
>>
>> $ cat minimal.bpf.c
>> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> /* Copyright (c) 2020 Facebook */
>> #include <linux/bpf.h>
>> #include <bpf/bpf_helpers.h>
>>
>> char LICENSE[] SEC("license") = "Dual BSD/GPL";
>>
>> int my_pid = 0;
>>
>> int handle_tp(void *ctx);
>>
>> struct {
>>          __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>          __uint(max_entries, 1);
>>          __uint(key_size, sizeof(__u32));
>>          __array(values, int (void *));
>> } prog_array_init SEC(".maps") = {
>>          .values = {
>>                  [0] = (void *)&handle_tp,
>>          },
>> };
>>
>> static __noinline int subprog(void *ctx)
>> {
>>          static int cnt;
>>
>>          cnt++;
>>
>>          bpf_printk("SUBPROG - BEFORE %d", cnt);
>>
>>          bpf_tail_call(ctx, &prog_array_init, 0);
>>
>>          bpf_printk("SUBPROG - AFTER %d", cnt);
>>
>>      return 0;
>> }
>>
>> SEC("tp/syscalls/sys_enter_write")
>> int handle_tp(void *ctx)
>> {
>>          static int cnt;
>>          int pid = bpf_get_current_pid_tgid() >> 32;
>>
>>          if (pid != my_pid)
>>                  return 0;
>>
>>          cnt++;
>>
>>          bpf_printk("ENTRY - BEFORE %d", cnt);
>>
>>          subprog(ctx);
>>
>>          bpf_printk("ENTRY - AFTER %d", cnt);
>>
>>          return 0;
>> }
>>
>>
>> And triggered one write syscall, getting the log above. You can see
>> that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
>> to the tail call limit). And we do indeed get lots of entry program
>> recurrence.
>>
>> We *need to support recursion* is my main point.
> Not quite.
> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
> static int cnt counts stuff because it's static.
>
> So we don't need to support recursion with private stack,
> but tail_calls and private stack are buggy indeed.
>
> emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
> stack is used.

Right, stack_depth argument in emit_bpf_tail_call_direct()/emit_bpf_tail_call_indirect()
should be 0 if private stack is used. Will fix in next revision.

next prev parent reply	other threads:[~2024-07-23  5:30 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44   ` Yonghong Song
2024-07-18 21:59     ` Kumar Kartikeya Dwivedi
2024-07-19  3:01       ` Yonghong Song
2024-07-19  0:36     ` Alexei Starovoitov
2024-07-19  2:21       ` Yonghong Song
2024-07-20  0:14   ` bot+bpf-ci
2024-07-20  1:08   ` Alexei Starovoitov
2024-07-22 16:33     ` Yonghong Song
2024-07-20  3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43   ` Yonghong Song
2024-07-24  5:08     ` Yonghong Song
2024-07-24 16:54       ` Alexei Starovoitov
2024-07-24 17:56         ` Yonghong Song
2024-07-22 20:57   ` Andrii Nakryiko
2024-07-23  1:05     ` Alexei Starovoitov
2024-07-23  3:26       ` Andrii Nakryiko
2024-07-24  3:17         ` Alexei Starovoitov
2024-07-24  4:06           ` Andrii Nakryiko
2024-07-24  4:46             ` Yonghong Song
2024-07-24  4:32           ` Yonghong Song
2024-07-23  5:30       ` Yonghong Song [this message]
2024-07-23  7:02         ` Yonghong Song
2024-07-22  3:33 ` Eduard Zingerman
2024-07-22 16:54   ` Yonghong Song
2024-07-22 17:53     ` Eduard Zingerman
2024-07-22 17:51   ` Alexei Starovoitov
2024-07-22 18:22     ` Eduard Zingerman
2024-07-22 20:08       ` Alexei Starovoitov
2024-07-24 21:28   ` Yonghong Song
2024-07-25  4:55     ` Alexei Starovoitov
2024-07-25 17:20       ` Eduard Zingerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7dce9923-2a18-4b41-8b40-420e1cfa82e8@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.