From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
Date: Tue, 23 Jul 2024 22:08:51 -0700 [thread overview]
Message-ID: <f12db0b4-bcd4-4fb3-a0cf-35c96c2b549c@linux.dev> (raw)
In-Reply-To: <036e4320-1e22-4066-bfa5-42b1fa290a39@linux.dev>
On 7/22/24 9:43 AM, Yonghong Song wrote:
>
> On 7/19/24 8:28 PM, Andrii Nakryiko wrote:
>> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song
>> <yonghong.song@linux.dev> wrote:
>>> The main motivation for private stack comes from nested
>>> scheduler in sched-ext from Tejun. The basic idea is that
>>> - each cgroup will its own associated bpf program,
>>> - bpf program with parent cgroup will call bpf programs
>>> in immediate child cgroups.
>>>
>>> Let us say we have the following cgroup hierarchy:
>>> root_cg (prog0):
>>> cg1 (prog1):
>>> cg11 (prog11):
>>> cg111 (prog111)
>>> cg112 (prog112)
>>> cg12 (prog12):
>>> cg121 (prog121)
>>> cg122 (prog122)
>>> cg2 (prog2):
>>> cg21 (prog21)
>>> cg22 (prog22)
>>> cg23 (prog23)
>>>
>>> In the above example, prog0 will call a kfunc which will
>>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>>> then the information is summarized and sent back to prog0.
>>> Similarly, prog11 and prog12 will be invoked in the kfunc
>>> and the result will be summarized and sent back to prog1, etc.
>>>
>>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>>> The each bpf program (including its subprograms) has maximum
>>> 512B stack size to avoid potential stack overflow.
>>> And nested bpf programs increase the risk of stack overflow.
>>> To avoid potential stack overflow caused by bpf programs,
>>> this patch implemented a private stack so bpf program stack
>>> space is allocated dynamically when the program is jited.
>>> Such private stack is applied to tracing programs like
>>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>>> tracing.
>>>
>>> But more than one instance of the same bpf program may
>>> run in the system. To make things simple, percpu private
>>> stack is allocated for each program, so if the same program
>>> is running on different cpus concurrently, we won't have
>>> any issue. Note that the kernel already have logic to prevent
>>> the recursion for the same bpf program on the same cpu
>>> (kprobe, fentry, etc.).
>>>
>>> The patch implemented a percpu private stack based approach
>>> for x86 arch.
>>> - The stack size will be 0 and any stack access is from
>>> jit-time allocated percpu storage.
>>> - In the beginning of jit, r9 is used to save percpu
>>> private stack pointer.
>>> - Each rbp in the bpf asm insn is replaced by r9.
>>> - For each call, push r9 before the call and pop r9
>>> after the call to preserve r9 value.
>>>
>>> Compared to previous RFC patch [1], this patch added
>>> some conditions to enable private stack, e.g., verifier
>>> calculated stack size, prog type, etc. The new patch
>>> also added a performance test to compare private stack
>>> vs. no private stack.
>>>
>>> The following are some code example to illustrate the idea
>>> for selftest cgroup_skb_sk_lookup:
>>>
>>> the existing code the private-stack
>>> approach code
>>> endbr64 endbr64
>>> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR
>>> [rax+rax*1+0x0]
>>> xchg ax,ax xchg ax,ax
>>> push rbp push rbp
>>> mov rbp,rsp mov rbp,rsp
>>> endbr64 endbr64
>>> sub rsp,0x68
>>> push rbx push rbx
>>> ... ...
>>> ... mov r9d,0x8c1c860
>>> ... add r9,QWORD PTR
>>> gs:0x21a00
>>> ... ...
>>> mov rdx,rbp mov rdx, r9
>>> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
>>> ... ...
>>> mov ecx,0x28 mov ecx,0x28
>>> push r9
>>> call 0xffffffffe305e474 call 0xffffffffe305e524
>>> pop r9
>>> mov rdi,rax mov rdi,rax
>>> ... ...
>>> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR
>>> [r9-0x46]
>>> ... ...
>>>
>> Eduard nerd-sniped me today with this a bit... :)
>>
>> I have a few questions and suggestions.
>>
>> So it seems like each *subprogram* (not the entire BPF program) gets
>> its own per-CPU private stack allocation. Is that intentional? That
>
> Currently yes. The reason is the same prog could be run on different
> cpus at the same time.
>
>> seems a bit unnecessary. It also prevents any sort of actual
>> recursion. Not sure if it's possible to write recursive BPF subprogram
>> today, verifier seems to reject obvious limited recursion cases, but
>> still, eventually we might need/want to support that, and this will be
>> just another hurdle to overcome (so it's best to avoid adding it in
>> the first place).
>>
>> I'm sure Eduard is going to try something like below and it will
>> probably break badly (I haven't tried, sorry):
>>
>> int entry(void *ctx);
>>
>> struct {
>> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>> __uint(max_entries, 1);
>> __uint(key_size, sizeof(__u32));
>> __array(values, int (void *));
>> } prog_array_init SEC(".maps") = {
>> .values = {
>> [0] = (void *)&entry,
>> },
>> };
>>
>> static __noinline int subprog1(void)
>> {
>> <some state on the stack>
>>
>> /* here entry will replace subprog1, and so we'll have
>> * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
>> */
>> bpf_tail_call(ctx, &prog_array_init, 0);
>>
>> return 0;
>> }
>>
>>
>> SEC("raw_tp/sys_enter")
>> int entry(void *ctx)
>> {
>> <some state on the stack>
>>
>> subprog1();
>> }
>>
>> And we effectively have limited recursion where the entry's stack
>> state is clobbered, no?
>>
>> So it seems like we need to support recursion.
>>
>>
>> So, the question I have is. Why not do the following:
>> a) only setup r9 *once* in entry program's prologue (before tail call
>> jump target)
>> b) before each call we can adjust r9 with current prog/subprog's
>> maximum *own* stack, something like:
>>
>> push r9;
>> r9 += 128; // 128 is subprog's stack usage
>> call <some-subprog>
>> pop r9;
>>
>> The idea being that on tail call or in subprog call we assume r9 is
>> already pointing to the right place. We can probably also figure out
>> how to avoid push/pop r9 if we make sure that subprogram always
>> restores r9 (taking tail calls into account and all that, of course)?
>>
>> Is this feasible?
>
> This is possible. I actually hacked such an idea easily. The basic
> idea is push frame pointer as an additional argument to the bpf
> static sub-prog. This is a little bit complicated. It will probably
> save some stack size but I am not sure how much it is.
Discussed with Andrii. I think the following approach should work.
For each non-static prog, the private stack is allocated including
that non-static prog and the called static progs. For example,
main_prog
static_prog_1
static_prog_11
global_prog
static_prog_12
static_prog_2
So in verifier we calculate stack size for
main_prog
static_prog_1
static_prog_11
static_prog_2
and
global_prog
static_prog_12
Let us say the stack size for main_prog like below for each (sub)prog
main_prog // stack size 100
static_prog_1 // stack size 100
static_prog_11 // stack size 100
static_prog_2 // static size 100
so total static size is 300 so the private stack size will be 300.
So R9 is calculated like below
main_prog
R9 = ... // for tailcall reachable, R9 may be original R9 + offset
// for non-tailcall reachable, R9 equals the original R9 (based on jit-time allocation).
... R9 ...
R9 += 100
static_prog_1
... R9 ...
R9 += 100
static_prog_11
... R9 ...
R9 -= 100
R9 -= 100
... R9 ...
R9 += 100
static_prog_2
... R9 ...
R9 -= 100
Similary, we can calculate R9 offset for
global_prog
static_prog_12
as well.
next prev parent reply other threads:[~2024-07-24 5:09 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44 ` Yonghong Song
2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
2024-07-19 3:01 ` Yonghong Song
2024-07-19 0:36 ` Alexei Starovoitov
2024-07-19 2:21 ` Yonghong Song
2024-07-20 0:14 ` bot+bpf-ci
2024-07-20 1:08 ` Alexei Starovoitov
2024-07-22 16:33 ` Yonghong Song
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43 ` Yonghong Song
2024-07-24 5:08 ` Yonghong Song [this message]
2024-07-24 16:54 ` Alexei Starovoitov
2024-07-24 17:56 ` Yonghong Song
2024-07-22 20:57 ` Andrii Nakryiko
2024-07-23 1:05 ` Alexei Starovoitov
2024-07-23 3:26 ` Andrii Nakryiko
2024-07-24 3:17 ` Alexei Starovoitov
2024-07-24 4:06 ` Andrii Nakryiko
2024-07-24 4:46 ` Yonghong Song
2024-07-24 4:32 ` Yonghong Song
2024-07-23 5:30 ` Yonghong Song
2024-07-23 7:02 ` Yonghong Song
2024-07-22 3:33 ` Eduard Zingerman
2024-07-22 16:54 ` Yonghong Song
2024-07-22 17:53 ` Eduard Zingerman
2024-07-22 17:51 ` Alexei Starovoitov
2024-07-22 18:22 ` Eduard Zingerman
2024-07-22 20:08 ` Alexei Starovoitov
2024-07-24 21:28 ` Yonghong Song
2024-07-25 4:55 ` Alexei Starovoitov
2024-07-25 17:20 ` Eduard Zingerman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f12db0b4-bcd4-4fb3-a0cf-35c96c2b549c@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox