* [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
@ 2024-07-18 20:51 Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-18 20:51 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
The main motivation for private stack comes from nested
scheduler in sched-ext from Tejun. The basic idea is that
- each cgroup will its own associated bpf program,
- bpf program with parent cgroup will call bpf programs
in immediate child cgroups.
Let us say we have the following cgroup hierarchy:
root_cg (prog0):
cg1 (prog1):
cg11 (prog11):
cg111 (prog111)
cg112 (prog112)
cg12 (prog12):
cg121 (prog121)
cg122 (prog122)
cg2 (prog2):
cg21 (prog21)
cg22 (prog22)
cg23 (prog23)
In the above example, prog0 will call a kfunc which will
call prog1 and prog2 to get sched info for cg1 and cg2 and
then the information is summarized and sent back to prog0.
Similarly, prog11 and prog12 will be invoked in the kfunc
and the result will be summarized and sent back to prog1, etc.
Currently, for each thread, the x86 kernel allocate 8KB stack.
The each bpf program (including its subprograms) has maximum
512B stack size to avoid potential stack overflow.
And nested bpf programs increase the risk of stack overflow.
To avoid potential stack overflow caused by bpf programs,
this patch implemented a private stack so bpf program stack
space is allocated dynamically when the program is jited.
Such private stack is applied to tracing programs like
kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
tracing.
But more than one instance of the same bpf program may
run in the system. To make things simple, percpu private
stack is allocated for each program, so if the same program
is running on different cpus concurrently, we won't have
any issue. Note that the kernel already have logic to prevent
the recursion for the same bpf program on the same cpu
(kprobe, fentry, etc.).
The patch implemented a percpu private stack based approach
for x86 arch.
- The stack size will be 0 and any stack access is from
jit-time allocated percpu storage.
- In the beginning of jit, r9 is used to save percpu
private stack pointer.
- Each rbp in the bpf asm insn is replaced by r9.
- For each call, push r9 before the call and pop r9
after the call to preserve r9 value.
Compared to previous RFC patch [1], this patch added
some conditions to enable private stack, e.g., verifier
calculated stack size, prog type, etc. The new patch
also added a performance test to compare private stack
vs. no private stack.
The following are some code example to illustrate the idea
for selftest cgroup_skb_sk_lookup:
the existing code the private-stack approach code
endbr64 endbr64
nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
xchg ax,ax xchg ax,ax
push rbp push rbp
mov rbp,rsp mov rbp,rsp
endbr64 endbr64
sub rsp,0x68
push rbx push rbx
... ...
... mov r9d,0x8c1c860
... add r9,QWORD PTR gs:0x21a00
... ...
mov rdx,rbp mov rdx, r9
add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
... ...
mov ecx,0x28 mov ecx,0x28
push r9
call 0xffffffffe305e474 call 0xffffffffe305e524
pop r9
mov rdi,rax mov rdi,rax
... ...
movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
... ...
So the number of insns is increased by 1 + num_of_calls * 2.
Here the number of calls are those calls in the final jited binary.
Comparing function call itself, the push/pop overhead should be
minimum in most common cases.
Our original use case is for sched-ext nested scheduler. This will be done
in the future.
[1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@linux.dev/T/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++---
include/linux/bpf.h | 2 ++
kernel/bpf/core.c | 20 ++++++++++++
kernel/bpf/syscall.c | 1 +
4 files changed, 82 insertions(+), 4 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index d25d81c8ecc0..60f5d86fb6aa 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1309,6 +1309,22 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
*pprog = prog;
}
+static void emit_private_frame_ptr(u8 **pprog, void *private_frame_ptr)
+{
+ u8 *prog = *pprog;
+
+ /* movabs r9, private_frame_ptr */
+ emit_mov_imm64(&prog, X86_REG_R9, (long) private_frame_ptr >> 32,
+ (u32) (long) private_frame_ptr);
+
+ /* add <r9>, gs:[<off>] */
+ EMIT2(0x65, 0x4c);
+ EMIT3(0x03, 0x0c, 0x25);
+ EMIT((u32)(unsigned long)&this_cpu_off, 4);
+
+ *pprog = prog;
+}
+
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
@@ -1324,18 +1340,25 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
int insn_cnt = bpf_prog->len;
bool seen_exit = false;
u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+ u32 stack_depth = bpf_prog->aux->stack_depth;
+ void __percpu *private_frame_ptr = NULL;
u64 arena_vm_start, user_vm_start;
int i, excnt = 0;
int ilen, proglen = 0;
u8 *prog = temp;
int err;
+ if (bpf_prog->private_stack_ptr) {
+ private_frame_ptr = bpf_prog->private_stack_ptr + round_up(stack_depth, 8);
+ stack_depth = 0;
+ }
+
arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
detect_reg_usage(insn, insn_cnt, callee_regs_used);
- emit_prologue(&prog, bpf_prog->aux->stack_depth,
+ emit_prologue(&prog, stack_depth,
bpf_prog_was_classic(bpf_prog), tail_call_reachable,
bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
/* Exception callback will clobber callee regs for its own use, and
@@ -1357,6 +1380,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
emit_mov_imm64(&prog, X86_REG_R12,
arena_vm_start >> 32, (u32) arena_vm_start);
+ if (private_frame_ptr)
+ emit_private_frame_ptr(&prog, private_frame_ptr);
+
ilen = prog - temp;
if (rw_image)
memcpy(rw_image + proglen, temp, ilen);
@@ -1376,6 +1402,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
u8 *func;
int nops;
+ if (private_frame_ptr) {
+ if (src_reg == BPF_REG_FP)
+ src_reg = X86_REG_R9;
+
+ if (dst_reg == BPF_REG_FP)
+ dst_reg = X86_REG_R9;
+ }
+
switch (insn->code) {
/* ALU */
case BPF_ALU | BPF_ADD | BPF_X:
@@ -2007,6 +2041,7 @@ st: if (is_imm8(insn->off))
emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
/* Restore R0 after clobbering RAX */
emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
+
break;
}
@@ -2031,14 +2066,20 @@ st: if (is_imm8(insn->off))
func = (u8 *) __bpf_call_base + imm32;
if (tail_call_reachable) {
- RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
+ RESTORE_TAIL_CALL_CNT(stack_depth);
ip += 7;
}
if (!imm32)
return -EINVAL;
+ if (private_frame_ptr) {
+ EMIT2(0x41, 0x51); /* push r9 */
+ ip += 2;
+ }
ip += x86_call_depth_emit_accounting(&prog, func, ip);
if (emit_call(&prog, func, ip))
return -EINVAL;
+ if (private_frame_ptr)
+ EMIT2(0x41, 0x59); /* pop r9 */
break;
}
@@ -2048,13 +2089,13 @@ st: if (is_imm8(insn->off))
&bpf_prog->aux->poke_tab[imm32 - 1],
&prog, image + addrs[i - 1],
callee_regs_used,
- bpf_prog->aux->stack_depth,
+ stack_depth,
ctx);
else
emit_bpf_tail_call_indirect(bpf_prog,
&prog,
callee_regs_used,
- bpf_prog->aux->stack_depth,
+ stack_depth,
image + addrs[i - 1],
ctx);
break;
@@ -3218,6 +3259,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
{
struct bpf_binary_header *rw_header = NULL;
struct bpf_binary_header *header = NULL;
+ void __percpu *private_stack_ptr = NULL;
struct bpf_prog *tmp, *orig_prog = prog;
struct x64_jit_data *jit_data;
int proglen, oldproglen = 0;
@@ -3284,6 +3326,15 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
ctx.cleanup_addr = proglen;
skip_init_addrs:
+ if (bpf_enable_private_stack(prog) && !prog->private_stack_ptr) {
+ private_stack_ptr = __alloc_percpu_gfp(prog->aux->stack_depth, 8, GFP_KERNEL);
+ if (!private_stack_ptr) {
+ prog = orig_prog;
+ goto out_addrs;
+ }
+ prog->private_stack_ptr = private_stack_ptr;
+ }
+
/*
* JITed image shrinks with every pass and the loop iterates
* until the image stops shrinking. Very large BPF programs
@@ -3309,6 +3360,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
prog->jited = 0;
prog->jited_len = 0;
}
+ if (private_stack_ptr) {
+ free_percpu(private_stack_ptr);
+ prog->private_stack_ptr = NULL;
+ }
goto out_addrs;
}
if (image) {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4f1d4a97b9d1..19a3f5355363 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1563,6 +1563,7 @@ struct bpf_prog {
const struct bpf_insn *insn);
struct bpf_prog_aux *aux; /* Auxiliary fields */
struct sock_fprog_kern *orig_prog; /* Original BPF program */
+ void __percpu *private_stack_ptr;
/* Instructions for interpreter */
union {
DECLARE_FLEX_ARRAY(struct sock_filter, insns);
@@ -1819,6 +1820,7 @@ static inline void bpf_module_put(const void *data, struct module *owner)
module_put(owner);
}
int bpf_struct_ops_link_create(union bpf_attr *attr);
+bool bpf_enable_private_stack(struct bpf_prog *prog);
#ifdef CONFIG_NET
/* Define it here to avoid the use of forward declaration */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7ee62e38faf0..f69eb0c5fe03 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2813,6 +2813,26 @@ void bpf_prog_free(struct bpf_prog *fp)
}
EXPORT_SYMBOL_GPL(bpf_prog_free);
+bool bpf_enable_private_stack(struct bpf_prog *prog)
+{
+ if (prog->aux->stack_depth <= 64)
+ return false;
+
+ switch (prog->aux->prog->type) {
+ case BPF_PROG_TYPE_KPROBE:
+ case BPF_PROG_TYPE_TRACEPOINT:
+ case BPF_PROG_TYPE_PERF_EVENT:
+ case BPF_PROG_TYPE_RAW_TRACEPOINT:
+ return true;
+ case BPF_PROG_TYPE_TRACING:
+ if (prog->expected_attach_type != BPF_TRACE_ITER)
+ return true;
+ fallthrough;
+ default:
+ return false;
+ }
+}
+
/* RNG for unprivileged user space with separated state from prandom_u32(). */
static DEFINE_PER_CPU(struct rnd_state, bpf_user_rnd_state);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 869265852d51..89162ddb4747 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2244,6 +2244,7 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
kvfree(aux->func_info);
kfree(aux->func_info_aux);
+ free_percpu(aux->prog->private_stack_ptr);
free_uid(aux->user);
security_bpf_prog_free(aux->prog);
bpf_prog_free(aux->prog);
--
2.43.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
@ 2024-07-18 20:52 ` Yonghong Song
2024-07-18 21:44 ` Yonghong Song
` (2 more replies)
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 3:33 ` Eduard Zingerman
2 siblings, 3 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-18 20:52 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
This patch intends to show some benchmark results comparing a bpf
program with vs. without private stack. The patch is not intended
to land since it hacks existing kernel interface in order to
do proper comparison. The bpf program is similar to
7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
where a raw_tp program is triggered with bpf_prog_test_run_opts() and
the raw_tp program has a loop of helper bpf_get_numa_node_id() which
will enable a fentry prog to run. The fentry prog calls three
do-nothing functions to maximumly expose the cost of private stack.
The following is the jited code for bpf prog in progs/private_stack.c
without private stack. The number of batch iterations is 4096.
subprog:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
9: 66 90 xchg ax,ax
b: 55 push rbp
c: 48 89 e5 mov rbp,rsp
f: f3 0f 1e fa endbr64
13: 31 c0 xor eax,eax
15: c9 leave
16: c3 ret
main prog:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
9: 66 90 xchg ax,ax
b: 55 push rbp
c: 48 89 e5 mov rbp,rsp
f: f3 0f 1e fa endbr64
13: 48 bf 00 e0 57 00 00 movabs rdi,0xffffc9000057e000
1a: c9 ff ff
1d: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
21: 48 83 c6 01 add rsi,0x1
25: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
29: e8 6e 00 00 00 call 0x9c
2e: e8 69 00 00 00 call 0x9c
33: e8 64 00 00 00 call 0x9c
38: 31 c0 xor eax,eax
3a: c9 leave
3b: c3 ret
The following are the jited progs with private stack:
subprog:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
9: 66 90 xchg ax,ax
b: 55 push rbp
c: 48 89 e5 mov rbp,rsp
f: f3 0f 1e fa endbr64
13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
1a: 60 00 00
1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
24: 02 00
26: 31 c0 xor eax,eax
28: c9 leave
29: c3 ret
main prog:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
9: 66 90 xchg ax,ax
b: 55 push rbp
c: 48 89 e5 mov rbp,rsp
f: f3 0f 1e fa endbr64
13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
1a: 60 00 00
1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
24: 02 00
26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
2d: c9 ff ff
30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
34: 48 83 c6 01 add rsi,0x1
38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
3c: 41 51 push r9
3e: e8 46 23 51 e1 call 0xffffffffe1512389
43: 41 59 pop r9
45: 41 51 push r9
47: e8 3d 23 51 e1 call 0xffffffffe1512389
4c: 41 59 pop r9
4e: 41 51 push r9
50: e8 34 23 51 e1 call 0xffffffffe1512389
55: 41 59 pop r9
57: 31 c0 xor eax,eax
59: c9 leave
5a: c3 ret
From the above, it is clear for subprog and main prog,
we have some r9 related overhead including retriving the stack
in the jit prelog code:
movabs r9,0x607e08c1a688
add r9,QWORD PTR gs:0x21a00
and 'push r9' and 'pop r9' around subprog calls.
I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
which has 20 cores and 80 cpus. The number of hits are in the unit
of loop iterations.
The following are two benchmark results and a few other tries show
similar results in terms of variation.
$ ./benchs/run_bench_private_stack.sh
no-private-stack-1: 2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
private-stack-1: 2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
no-private-stack-8: 89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
private-stack-8: 90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
no-private-stack-64: 1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
private-stack-64: 1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
no-private-stack-512: 14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
private-stack-512: 14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
no-private-stack-2048: 58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
private-stack-2048: 58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
no-private-stack-4096: 119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
private-stack-4096: 114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
$ ./benchs/run_bench_private_stack.sh
no-private-stack-1: 2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
private-stack-1: 2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
no-private-stack-8: 91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
private-stack-8: 90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
no-private-stack-64: 1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
private-stack-64: 1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
no-private-stack-512: 14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
private-stack-512: 14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
no-private-stack-2048: 58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
private-stack-2048: 54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
no-private-stack-4096: 117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
private-stack-4096: 114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
It is is clear that private-stack is worse than non-private stack up to close 5 percents.
This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
Although the benchmark shows up to 5% potential slowdown with private stack.
In reality, the kernel enables private stack only after stack size 64 which means
the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
the performance should be reasonable too.
With 4096 loop ierations per program run, I got
$ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
18.47% bench [k]
17.29% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
13.33% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
11.86% bench [kernel.vmlinux] [k] migrate_enable
11.60% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
11.42% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
7.87% bench [kernel.vmlinux] [k] migrate_disable
3.71% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
3.67% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
0.04% bench bench [.] btf_validate_type
$ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
18.94% bench [k]
16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
15.77% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
11.70% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
11.48% bench [kernel.vmlinux] [k] migrate_enable
11.30% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
5.85% bench [kernel.vmlinux] [k] migrate_disable
3.69% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
3.56% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
0.06% bench bench [.] bpf_prog_test_run_opts
NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
I suspect top 18.47%/18.94% perf run probably due to fentry prog bench_trigger_fentry_batch,
considering even subprog func1 takes 13.33%/16.88% time.
Overall bpf prog include trampoline takes more than 50% of the time.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
arch/x86/net/bpf_jit_comp.c | 5 +-
include/linux/bpf.h | 3 +-
include/uapi/linux/bpf.h | 3 +
kernel/bpf/core.c | 3 +-
kernel/bpf/syscall.c | 4 +-
kernel/bpf/verifier.c | 1 +
tools/include/uapi/linux/bpf.h | 3 +
tools/testing/selftests/bpf/Makefile | 2 +
tools/testing/selftests/bpf/bench.c | 6 +
.../bpf/benchs/bench_private_stack.c | 149 ++++++++++++++++++
.../bpf/benchs/run_bench_private_stack.sh | 11 ++
.../selftests/bpf/progs/private_stack.c | 37 +++++
12 files changed, 222 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/bpf/benchs/bench_private_stack.c
create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh
create mode 100644 tools/testing/selftests/bpf/progs/private_stack.c
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 60f5d86fb6aa..3f12f7a957ba 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -3327,7 +3327,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
skip_init_addrs:
if (bpf_enable_private_stack(prog) && !prog->private_stack_ptr) {
- private_stack_ptr = __alloc_percpu_gfp(prog->aux->stack_depth, 8, GFP_KERNEL);
+ if (prog->aux->stack_depth == 0)
+ private_stack_ptr = __alloc_percpu_gfp(8, 8, GFP_KERNEL);
+ else
+ private_stack_ptr = __alloc_percpu_gfp(prog->aux->stack_depth, 8, GFP_KERNEL);
if (!private_stack_ptr) {
prog = orig_prog;
goto out_addrs;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 19a3f5355363..2f8708465c19 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1551,7 +1551,8 @@ struct bpf_prog {
call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
call_get_func_ip:1, /* Do we call get_func_ip() */
tstamp_type_access:1, /* Accessed __sk_buff->tstamp_type */
- sleepable:1; /* BPF program is sleepable */
+ sleepable:1, /* BPF program is sleepable */
+ disable_private_stack:1; /* Disable private stack */
enum bpf_prog_type type; /* Type of BPF program */
enum bpf_attach_type expected_attach_type; /* For some prog types */
u32 len; /* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 35bcf52dbc65..98af8ea8a4d6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1409,6 +1409,9 @@ enum {
/* Do not translate kernel bpf_arena pointers to user pointers */
BPF_F_NO_USER_CONV = (1U << 18),
+
+/* Disable private stack */
+ BPF_F_DISABLE_PRIVATE_STACK = (1U << 19),
};
/* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f69eb0c5fe03..b5d33cf87695 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2815,14 +2815,13 @@ EXPORT_SYMBOL_GPL(bpf_prog_free);
bool bpf_enable_private_stack(struct bpf_prog *prog)
{
- if (prog->aux->stack_depth <= 64)
+ if (prog->disable_private_stack)
return false;
switch (prog->aux->prog->type) {
case BPF_PROG_TYPE_KPROBE:
case BPF_PROG_TYPE_TRACEPOINT:
case BPF_PROG_TYPE_PERF_EVENT:
- case BPF_PROG_TYPE_RAW_TRACEPOINT:
return true;
case BPF_PROG_TYPE_TRACING:
if (prog->expected_attach_type != BPF_TRACE_ITER)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 89162ddb4747..bb2b632c9c2c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2715,7 +2715,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
BPF_F_XDP_HAS_FRAGS |
BPF_F_XDP_DEV_BOUND_ONLY |
BPF_F_TEST_REG_INVARIANTS |
- BPF_F_TOKEN_FD))
+ BPF_F_TOKEN_FD |
+ BPF_F_DISABLE_PRIVATE_STACK))
return -EINVAL;
bpf_prog_load_fixup_attach_type(attr);
@@ -2828,6 +2829,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
prog->expected_attach_type = attr->expected_attach_type;
prog->sleepable = !!(attr->prog_flags & BPF_F_SLEEPABLE);
+ prog->disable_private_stack = !!(attr->prog_flags & BPF_F_DISABLE_PRIVATE_STACK);
prog->aux->attach_btf = attach_btf;
prog->aux->attach_btf_id = attr->attach_btf_id;
prog->aux->dst_prog = dst_prog;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8da132a1ef28..397b2b9eed24 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19442,6 +19442,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
goto out_free;
func[i]->is_func = 1;
func[i]->sleepable = prog->sleepable;
+ func[i]->disable_private_stack = prog->disable_private_stack;
func[i]->aux->func_idx = i;
/* Below members will be freed only at prog->aux */
func[i]->aux->btf = prog->aux->btf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 35bcf52dbc65..98af8ea8a4d6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1409,6 +1409,9 @@ enum {
/* Do not translate kernel bpf_arena pointers to user pointers */
BPF_F_NO_USER_CONV = (1U << 18),
+
+/* Disable private stack */
+ BPF_F_DISABLE_PRIVATE_STACK = (1U << 19),
};
/* Flags for BPF_PROG_QUERY. */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index dd49c1d23a60..44a6a43da71c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -733,6 +733,7 @@ $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.ske
$(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
$(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
$(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
+$(OUTPUT)/bench_private_stack.o: $(OUTPUT)/private_stack.skel.h
$(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
$(OUTPUT)/bench: LDLIBS += -lm
$(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -753,6 +754,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
$(OUTPUT)/bench_local_storage_create.o \
$(OUTPUT)/bench_htab_mem.o \
$(OUTPUT)/bench_bpf_crypto.o \
+ $(OUTPUT)/bench_private_stack.o \
#
$(call msg,BINARY,,$@)
$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 627b74ae041b..4f4867cd80f9 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -282,6 +282,7 @@ extern struct argp bench_local_storage_create_argp;
extern struct argp bench_htab_mem_argp;
extern struct argp bench_trigger_batch_argp;
extern struct argp bench_crypto_argp;
+extern struct argp bench_private_stack_argp;
static const struct argp_child bench_parsers[] = {
{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -296,6 +297,7 @@ static const struct argp_child bench_parsers[] = {
{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
{ &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
+ { &bench_private_stack_argp, 0, "bpf private stack benchmark", 0 },
{},
};
@@ -542,6 +544,8 @@ extern const struct bench bench_local_storage_create;
extern const struct bench bench_htab_mem;
extern const struct bench bench_crypto_encrypt;
extern const struct bench bench_crypto_decrypt;
+extern const struct bench bench_no_private_stack;
+extern const struct bench bench_private_stack;
static const struct bench *benchs[] = {
&bench_count_global,
@@ -596,6 +600,8 @@ static const struct bench *benchs[] = {
&bench_htab_mem,
&bench_crypto_encrypt,
&bench_crypto_decrypt,
+ &bench_no_private_stack,
+ &bench_private_stack,
};
static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_private_stack.c b/tools/testing/selftests/bpf/benchs/bench_private_stack.c
new file mode 100644
index 000000000000..5ae1e9bfe706
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_private_stack.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include <argp.h>
+#include "bench.h"
+#include "private_stack.skel.h"
+
+static struct ctx {
+ struct private_stack *skel;
+} ctx;
+
+static struct {
+ long nr_batch_iters;
+} args = {
+ .nr_batch_iters = 0,
+};
+
+enum {
+ ARG_NR_BATCH_ITERS = 3000,
+};
+
+static const struct argp_option opts[] = {
+ { "nr-batch-iters", ARG_NR_BATCH_ITERS, "NR_BATCH_ITERS",
+ 0, "nr batch iters" },
+ {},
+};
+
+static error_t private_stack_parse_arg(int key, char *arg, struct argp_state *state)
+{
+ long ret;
+
+ switch (key) {
+ case ARG_NR_BATCH_ITERS:
+ ret = strtoul(arg, NULL, 10);
+ if (ret < 1)
+ argp_usage(state);
+ args.nr_batch_iters = ret;
+ break;
+ default:
+ return ARGP_ERR_UNKNOWN;
+ }
+
+ return 0;
+}
+
+const struct argp bench_private_stack_argp = {
+ .options = opts,
+ .parser = private_stack_parse_arg,
+};
+
+static void private_stack_validate(void)
+{
+ if (env.consumer_cnt != 0) {
+ fprintf(stderr,
+ "The private stack benchmarks do not support consumer\n");
+ exit(1);
+ }
+}
+
+static void common_setup(bool disable_private_stack)
+{
+ struct private_stack *skel;
+ struct bpf_link *link;
+ __u32 old_flags;
+ int err;
+
+ skel = private_stack__open();
+ if(!skel) {
+ fprintf(stderr, "failed to open skeleton\n");
+ exit(1);
+ }
+ ctx.skel = skel;
+
+ if (disable_private_stack) {
+ old_flags = bpf_program__flags(skel->progs.bench_trigger_fentry_batch);
+ bpf_program__set_flags(skel->progs.bench_trigger_fentry_batch, old_flags | BPF_F_DISABLE_PRIVATE_STACK);
+ }
+
+ skel->rodata->batch_iters = args.nr_batch_iters;
+
+ err = private_stack__load(skel);
+ if (err) {
+ fprintf(stderr, "failed to load program\n");
+ exit(1);
+ }
+
+ link = bpf_program__attach(skel->progs.bench_trigger_fentry_batch);
+ if (!link) {
+ fprintf(stderr, "failed to attach program bench_trigger_fentry_batch\n");
+ exit(1);
+ }
+}
+
+static void no_private_stack_setup(void)
+{
+ common_setup(true);
+}
+
+static void private_stack_setup(void)
+{
+ common_setup(false);
+}
+
+static void private_stack_measure(struct bench_res *res)
+{
+ struct private_stack *skel = ctx.skel;
+ unsigned long total_hits = 0;
+ static unsigned long last_hits;
+
+ total_hits = skel->bss->hits * skel->rodata->batch_iters;
+ res->hits = total_hits - last_hits;
+ res->drops = 0;
+ res->false_hits = 0;
+ last_hits = total_hits;
+}
+
+static void *private_stack_producer(void *unused)
+{
+ struct private_stack *skel = ctx.skel;
+ int fd;
+
+ fd = bpf_program__fd(skel->progs.trigger_driver);
+ while (true)
+ bpf_prog_test_run_opts(fd, NULL);
+
+ return NULL;
+}
+
+const struct bench bench_no_private_stack = {
+ .name = "no-private-stack",
+ .argp = &bench_private_stack_argp,
+ .validate = private_stack_validate,
+ .setup = no_private_stack_setup,
+ .producer_thread = private_stack_producer,
+ .measure = private_stack_measure,
+ .report_progress = hits_drops_report_progress,
+ .report_final = hits_drops_report_final,
+};
+
+const struct bench bench_private_stack = {
+ .name = "private-stack",
+ .argp = &bench_private_stack_argp,
+ .validate = private_stack_validate,
+ .setup = private_stack_setup,
+ .producer_thread = private_stack_producer,
+ .measure = private_stack_measure,
+ .report_progress = hits_drops_report_progress,
+ .report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh b/tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh
new file mode 100755
index 000000000000..692a5f9676a7
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+for b in 1 8 64 512 2048 4096; do
+ summarize "no-private-stack-${b}: " "$($RUN_BENCH --nr-batch-iters=${b} no-private-stack)"
+ summarize "private-stack-${b}: " "$($RUN_BENCH --nr-batch-iters=${b} private-stack)"
+done
diff --git a/tools/testing/selftests/bpf/progs/private_stack.c b/tools/testing/selftests/bpf/progs/private_stack.c
new file mode 100644
index 000000000000..81d2efad5890
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/private_stack.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+unsigned long hits = 0;
+const volatile int batch_iters = 0;
+
+SEC("raw_tp")
+int trigger_driver(void *ctx)
+{
+ int i;
+
+ for (i = 0; i < batch_iters; i++)
+ (void)bpf_get_numa_node_id(); /* attach point for benchmarking */
+
+ return 0;
+}
+
+__attribute__((weak)) int func1(void) {
+ return 0;
+}
+
+SEC("fentry/bpf_get_numa_node_id")
+int bench_trigger_fentry_batch(void *ctx)
+{
+ hits++;
+ (void)func1();
+ (void)func1();
+ (void)func1();
+ return 0;
+}
+
--
2.43.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
@ 2024-07-18 21:44 ` Yonghong Song
2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
2024-07-19 0:36 ` Alexei Starovoitov
2024-07-20 0:14 ` bot+bpf-ci
2024-07-20 1:08 ` Alexei Starovoitov
2 siblings, 2 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-18 21:44 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
On 7/18/24 1:52 PM, Yonghong Song wrote:
> This patch intends to show some benchmark results comparing a bpf
> program with vs. without private stack. The patch is not intended
> to land since it hacks existing kernel interface in order to
> do proper comparison. The bpf program is similar to
> 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
> where a raw_tp program is triggered with bpf_prog_test_run_opts() and
> the raw_tp program has a loop of helper bpf_get_numa_node_id() which
> will enable a fentry prog to run. The fentry prog calls three
> do-nothing functions to maximumly expose the cost of private stack.
>
> The following is the jited code for bpf prog in progs/private_stack.c
> without private stack. The number of batch iterations is 4096.
>
> subprog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 31 c0 xor eax,eax
> 15: c9 leave
> 16: c3 ret
>
> main prog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 48 bf 00 e0 57 00 00 movabs rdi,0xffffc9000057e000
> 1a: c9 ff ff
> 1d: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> 21: 48 83 c6 01 add rsi,0x1
> 25: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> 29: e8 6e 00 00 00 call 0x9c
> 2e: e8 69 00 00 00 call 0x9c
> 33: e8 64 00 00 00 call 0x9c
> 38: 31 c0 xor eax,eax
> 3a: c9 leave
> 3b: c3 ret
>
> The following are the jited progs with private stack:
>
> subprog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 31 c0 xor eax,eax
> 28: c9 leave
> 29: c3 ret
>
> main prog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
> 2d: c9 ff ff
> 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> 34: 48 83 c6 01 add rsi,0x1
> 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> 3c: 41 51 push r9
> 3e: e8 46 23 51 e1 call 0xffffffffe1512389
> 43: 41 59 pop r9
> 45: 41 51 push r9
> 47: e8 3d 23 51 e1 call 0xffffffffe1512389
> 4c: 41 59 pop r9
> 4e: 41 51 push r9
> 50: e8 34 23 51 e1 call 0xffffffffe1512389
> 55: 41 59 pop r9
> 57: 31 c0 xor eax,eax
> 59: c9 leave
> 5a: c3 ret
>
> From the above, it is clear for subprog and main prog,
> we have some r9 related overhead including retriving the stack
> in the jit prelog code:
> movabs r9,0x607e08c1a688
> add r9,QWORD PTR gs:0x21a00
> and 'push r9' and 'pop r9' around subprog calls.
>
> I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
> which has 20 cores and 80 cpus. The number of hits are in the unit
> of loop iterations.
>
> The following are two benchmark results and a few other tries show
> similar results in terms of variation.
> $ ./benchs/run_bench_private_stack.sh
> no-private-stack-1: 2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> private-stack-1: 2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-8: 89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
> private-stack-8: 90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-64: 1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
> private-stack-64: 1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-512: 14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
> private-stack-512: 14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-2048: 58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
> private-stack-2048: 58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-4096: 119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
> private-stack-4096: 114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
> $ ./benchs/run_bench_private_stack.sh
> no-private-stack-1: 2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> private-stack-1: 2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-8: 91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> private-stack-8: 90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-64: 1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
> private-stack-64: 1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-512: 14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
> private-stack-512: 14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-2048: 58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
> private-stack-2048: 54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-4096: 117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
> private-stack-4096: 114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
>
> It is is clear that private-stack is worse than non-private stack up to close 5 percents.
> This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
>
> Although the benchmark shows up to 5% potential slowdown with private stack.
> In reality, the kernel enables private stack only after stack size 64 which means
> the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
> push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
> if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
> the performance should be reasonable too.
>
> With 4096 loop ierations per program run, I got
> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
> 18.47% bench [k]
> 17.29% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
> 13.33% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> 11.86% bench [kernel.vmlinux] [k] migrate_enable
> 11.60% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
> 11.42% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
> 7.87% bench [kernel.vmlinux] [k] migrate_disable
> 3.71% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
> 3.67% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
> 0.04% bench bench [.] btf_validate_type
>
> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
> 18.94% bench [k]
> 16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> 15.77% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
> 11.70% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
> 11.48% bench [kernel.vmlinux] [k] migrate_enable
> 11.30% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
> 5.85% bench [kernel.vmlinux] [k] migrate_disable
> 3.69% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
> 3.56% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
> 0.06% bench bench [.] bpf_prog_test_run_opts
>
> NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
I tried with perf built with latest bpf-next and with no-private-stack, the issue still
exists. Will debug more.
>
> I suspect top 18.47%/18.94% perf run probably due to fentry prog bench_trigger_fentry_batch,
> considering even subprog func1 takes 13.33%/16.88% time.
> Overall bpf prog include trampoline takes more than 50% of the time.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> arch/x86/net/bpf_jit_comp.c | 5 +-
> include/linux/bpf.h | 3 +-
> include/uapi/linux/bpf.h | 3 +
> kernel/bpf/core.c | 3 +-
> kernel/bpf/syscall.c | 4 +-
> kernel/bpf/verifier.c | 1 +
> tools/include/uapi/linux/bpf.h | 3 +
> tools/testing/selftests/bpf/Makefile | 2 +
> tools/testing/selftests/bpf/bench.c | 6 +
> .../bpf/benchs/bench_private_stack.c | 149 ++++++++++++++++++
> .../bpf/benchs/run_bench_private_stack.sh | 11 ++
> .../selftests/bpf/progs/private_stack.c | 37 +++++
> 12 files changed, 222 insertions(+), 5 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/benchs/bench_private_stack.c
> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh
> create mode 100644 tools/testing/selftests/bpf/progs/private_stack.c
[...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 21:44 ` Yonghong Song
@ 2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
2024-07-19 3:01 ` Yonghong Song
2024-07-19 0:36 ` Alexei Starovoitov
1 sibling, 1 reply; 33+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-07-18 21:59 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau
On Thu, 18 Jul 2024 at 23:44, Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 7/18/24 1:52 PM, Yonghong Song wrote:
> > This patch intends to show some benchmark results comparing a bpf
> > program with vs. without private stack. The patch is not intended
> > to land since it hacks existing kernel interface in order to
> > do proper comparison. The bpf program is similar to
> > 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
> > where a raw_tp program is triggered with bpf_prog_test_run_opts() and
> > the raw_tp program has a loop of helper bpf_get_numa_node_id() which
> > will enable a fentry prog to run. The fentry prog calls three
> > do-nothing functions to maximumly expose the cost of private stack.
> >
> > The following is the jited code for bpf prog in progs/private_stack.c
> > without private stack. The number of batch iterations is 4096.
> >
> > subprog:
> > 0: f3 0f 1e fa endbr64
> > 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> > 9: 66 90 xchg ax,ax
> > b: 55 push rbp
> > c: 48 89 e5 mov rbp,rsp
> > f: f3 0f 1e fa endbr64
> > 13: 31 c0 xor eax,eax
> > 15: c9 leave
> > 16: c3 ret
> >
> > main prog:
> > 0: f3 0f 1e fa endbr64
> > 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> > 9: 66 90 xchg ax,ax
> > b: 55 push rbp
> > c: 48 89 e5 mov rbp,rsp
> > f: f3 0f 1e fa endbr64
> > 13: 48 bf 00 e0 57 00 00 movabs rdi,0xffffc9000057e000
> > 1a: c9 ff ff
> > 1d: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> > 21: 48 83 c6 01 add rsi,0x1
> > 25: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> > 29: e8 6e 00 00 00 call 0x9c
> > 2e: e8 69 00 00 00 call 0x9c
> > 33: e8 64 00 00 00 call 0x9c
> > 38: 31 c0 xor eax,eax
> > 3a: c9 leave
> > 3b: c3 ret
> >
> > The following are the jited progs with private stack:
> >
> > subprog:
> > 0: f3 0f 1e fa endbr64
> > 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> > 9: 66 90 xchg ax,ax
> > b: 55 push rbp
> > c: 48 89 e5 mov rbp,rsp
> > f: f3 0f 1e fa endbr64
> > 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
> > 1a: 60 00 00
> > 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> > 24: 02 00
> > 26: 31 c0 xor eax,eax
> > 28: c9 leave
> > 29: c3 ret
> >
> > main prog:
> > 0: f3 0f 1e fa endbr64
> > 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> > 9: 66 90 xchg ax,ax
> > b: 55 push rbp
> > c: 48 89 e5 mov rbp,rsp
> > f: f3 0f 1e fa endbr64
> > 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
> > 1a: 60 00 00
> > 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> > 24: 02 00
> > 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
> > 2d: c9 ff ff
> > 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> > 34: 48 83 c6 01 add rsi,0x1
> > 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> > 3c: 41 51 push r9
> > 3e: e8 46 23 51 e1 call 0xffffffffe1512389
> > 43: 41 59 pop r9
> > 45: 41 51 push r9
> > 47: e8 3d 23 51 e1 call 0xffffffffe1512389
> > 4c: 41 59 pop r9
> > 4e: 41 51 push r9
> > 50: e8 34 23 51 e1 call 0xffffffffe1512389
> > 55: 41 59 pop r9
> > 57: 31 c0 xor eax,eax
> > 59: c9 leave
> > 5a: c3 ret
> >
> > From the above, it is clear for subprog and main prog,
> > we have some r9 related overhead including retriving the stack
> > in the jit prelog code:
> > movabs r9,0x607e08c1a688
> > add r9,QWORD PTR gs:0x21a00
> > and 'push r9' and 'pop r9' around subprog calls.
> >
> > I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
> > which has 20 cores and 80 cpus. The number of hits are in the unit
> > of loop iterations.
> >
> > The following are two benchmark results and a few other tries show
> > similar results in terms of variation.
> > $ ./benchs/run_bench_private_stack.sh
> > no-private-stack-1: 2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> > private-stack-1: 2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-8: 89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
> > private-stack-8: 90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-64: 1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
> > private-stack-64: 1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-512: 14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
> > private-stack-512: 14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-2048: 58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
> > private-stack-2048: 58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-4096: 119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
> > private-stack-4096: 114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
> > $ ./benchs/run_bench_private_stack.sh
> > no-private-stack-1: 2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> > private-stack-1: 2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-8: 91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> > private-stack-8: 90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-64: 1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
> > private-stack-64: 1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-512: 14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
> > private-stack-512: 14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-2048: 58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
> > private-stack-2048: 54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
> > no-private-stack-4096: 117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
> > private-stack-4096: 114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
> >
> > It is is clear that private-stack is worse than non-private stack up to close 5 percents.
> > This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
> >
> > Although the benchmark shows up to 5% potential slowdown with private stack.
> > In reality, the kernel enables private stack only after stack size 64 which means
> > the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
> > push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
> > if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
> > the performance should be reasonable too.
> >
> > With 4096 loop ierations per program run, I got
> > $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
> > 18.47% bench [k]
> > 17.29% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
> > 13.33% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> > 11.86% bench [kernel.vmlinux] [k] migrate_enable
> > 11.60% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
> > 11.42% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
> > 7.87% bench [kernel.vmlinux] [k] migrate_disable
> > 3.71% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
> > 3.67% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
> > 0.04% bench bench [.] btf_validate_type
> >
> > $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
> > 18.94% bench [k]
> > 16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> > 15.77% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
> > 11.70% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
> > 11.48% bench [kernel.vmlinux] [k] migrate_enable
> > 11.30% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
> > 5.85% bench [kernel.vmlinux] [k] migrate_disable
> > 3.69% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
> > 3.56% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
> > 0.06% bench bench [.] bpf_prog_test_run_opts
> >
> > NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
>
> I tried with perf built with latest bpf-next and with no-private-stack, the issue still
> exists. Will debug more.
>
Just as an aside, but if this doesn't work, I think you can have a
better signal-to-noise ratio if you try enabling the private stack for
XDP programs and just set up two machines, with a client sending
traffic to another and run xdp-bench [0] on the server. I think you
should observe measurable differences in throughput for
nanosecond-scale changes, especially in programs like drop which do
very little.
[0]: https://github.com/xdp-project/xdp-tools
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 21:44 ` Yonghong Song
2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
@ 2024-07-19 0:36 ` Alexei Starovoitov
2024-07-19 2:21 ` Yonghong Song
1 sibling, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-19 0:36 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On Thu, Jul 18, 2024 at 2:44 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> > $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
> > 18.94% bench [k]
> > 16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> > 15.77% bench bpf_trampoline_6442522961 [k]
...
> > NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
>
> I tried with perf built with latest bpf-next and with no-private-stack, the issue still
> exists. Will debug more.
Try this fix:
https://lore.kernel.org/all/20240714065533.1112616-1-houtao@huaweicloud.com/
btw you were cc-ed on it. your @ fb goes to spam ? ;)
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-19 0:36 ` Alexei Starovoitov
@ 2024-07-19 2:21 ` Yonghong Song
0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-19 2:21 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 7/18/24 5:36 PM, Alexei Starovoitov wrote:
> On Thu, Jul 18, 2024 at 2:44 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
>>> 18.94% bench [k]
>>> 16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
>>> 15.77% bench bpf_trampoline_6442522961 [k]
> ...
>
>>> NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
>> I tried with perf built with latest bpf-next and with no-private-stack, the issue still
>> exists. Will debug more.
> Try this fix:
> https://lore.kernel.org/all/20240714065533.1112616-1-houtao@huaweicloud.com/
It does fix the problem. The 'perf report' for private-stack flavor:
18.94% bench bpf_prog_71f1b7d5309b304a_bench_trigger_fentry_batch [k] bpf_prog_71f1b7d5309b304a_bench_trigger_fentry_batch
16.50% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
15.75% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
11.72% bench [kernel.vmlinux] [k] migrate_enable
11.63% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
11.37% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
6.17% bench [kernel.vmlinux] [k] migrate_disable
3.59% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
3.51% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
0.05% bench bench [.] bpf_prog_test_run_opts
>
> btw you were cc-ed on it. your @ fb goes to spam ? ;)
It is in my inbox. Sadly I skipped it since it is for perf system and I focused on
several of my patches in the last few days.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
@ 2024-07-19 3:01 ` Yonghong Song
0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-19 3:01 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau
On 7/18/24 2:59 PM, Kumar Kartikeya Dwivedi wrote:
> On Thu, 18 Jul 2024 at 23:44, Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 7/18/24 1:52 PM, Yonghong Song wrote:
>>> This patch intends to show some benchmark results comparing a bpf
>>> program with vs. without private stack. The patch is not intended
>>> to land since it hacks existing kernel interface in order to
>>> do proper comparison. The bpf program is similar to
>>> 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
>>> where a raw_tp program is triggered with bpf_prog_test_run_opts() and
>>> the raw_tp program has a loop of helper bpf_get_numa_node_id() which
>>> will enable a fentry prog to run. The fentry prog calls three
>>> do-nothing functions to maximumly expose the cost of private stack.
>>>
>>> The following is the jited code for bpf prog in progs/private_stack.c
>>> without private stack. The number of batch iterations is 4096.
>>>
>>> subprog:
>>> 0: f3 0f 1e fa endbr64
>>> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
>>> 9: 66 90 xchg ax,ax
>>> b: 55 push rbp
>>> c: 48 89 e5 mov rbp,rsp
>>> f: f3 0f 1e fa endbr64
>>> 13: 31 c0 xor eax,eax
>>> 15: c9 leave
>>> 16: c3 ret
>>>
>>> main prog:
>>> 0: f3 0f 1e fa endbr64
>>> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
>>> 9: 66 90 xchg ax,ax
>>> b: 55 push rbp
>>> c: 48 89 e5 mov rbp,rsp
>>> f: f3 0f 1e fa endbr64
>>> 13: 48 bf 00 e0 57 00 00 movabs rdi,0xffffc9000057e000
>>> 1a: c9 ff ff
>>> 1d: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
>>> 21: 48 83 c6 01 add rsi,0x1
>>> 25: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
>>> 29: e8 6e 00 00 00 call 0x9c
>>> 2e: e8 69 00 00 00 call 0x9c
>>> 33: e8 64 00 00 00 call 0x9c
>>> 38: 31 c0 xor eax,eax
>>> 3a: c9 leave
>>> 3b: c3 ret
>>>
>>> The following are the jited progs with private stack:
>>>
>>> subprog:
>>> 0: f3 0f 1e fa endbr64
>>> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
>>> 9: 66 90 xchg ax,ax
>>> b: 55 push rbp
>>> c: 48 89 e5 mov rbp,rsp
>>> f: f3 0f 1e fa endbr64
>>> 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
>>> 1a: 60 00 00
>>> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
>>> 24: 02 00
>>> 26: 31 c0 xor eax,eax
>>> 28: c9 leave
>>> 29: c3 ret
>>>
>>> main prog:
>>> 0: f3 0f 1e fa endbr64
>>> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
>>> 9: 66 90 xchg ax,ax
>>> b: 55 push rbp
>>> c: 48 89 e5 mov rbp,rsp
>>> f: f3 0f 1e fa endbr64
>>> 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
>>> 1a: 60 00 00
>>> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
>>> 24: 02 00
>>> 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
>>> 2d: c9 ff ff
>>> 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
>>> 34: 48 83 c6 01 add rsi,0x1
>>> 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
>>> 3c: 41 51 push r9
>>> 3e: e8 46 23 51 e1 call 0xffffffffe1512389
>>> 43: 41 59 pop r9
>>> 45: 41 51 push r9
>>> 47: e8 3d 23 51 e1 call 0xffffffffe1512389
>>> 4c: 41 59 pop r9
>>> 4e: 41 51 push r9
>>> 50: e8 34 23 51 e1 call 0xffffffffe1512389
>>> 55: 41 59 pop r9
>>> 57: 31 c0 xor eax,eax
>>> 59: c9 leave
>>> 5a: c3 ret
>>>
>>> From the above, it is clear for subprog and main prog,
>>> we have some r9 related overhead including retriving the stack
>>> in the jit prelog code:
>>> movabs r9,0x607e08c1a688
>>> add r9,QWORD PTR gs:0x21a00
>>> and 'push r9' and 'pop r9' around subprog calls.
>>>
>>> I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
>>> which has 20 cores and 80 cpus. The number of hits are in the unit
>>> of loop iterations.
>>>
>>> The following are two benchmark results and a few other tries show
>>> similar results in terms of variation.
>>> $ ./benchs/run_bench_private_stack.sh
>>> no-private-stack-1: 2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-1: 2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-8: 89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-8: 90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-64: 1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-64: 1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-512: 14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-512: 14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-2048: 58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-2048: 58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-4096: 119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-4096: 114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
>>> $ ./benchs/run_bench_private_stack.sh
>>> no-private-stack-1: 2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-1: 2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-8: 91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-8: 90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-64: 1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-64: 1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-512: 14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-512: 14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-2048: 58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-2048: 54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
>>> no-private-stack-4096: 117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
>>> private-stack-4096: 114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
>>>
>>> It is is clear that private-stack is worse than non-private stack up to close 5 percents.
>>> This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
>>>
>>> Although the benchmark shows up to 5% potential slowdown with private stack.
>>> In reality, the kernel enables private stack only after stack size 64 which means
>>> the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
>>> push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
>>> if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
>>> the performance should be reasonable too.
>>>
>>> With 4096 loop ierations per program run, I got
>>> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
>>> 18.47% bench [k]
>>> 17.29% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
>>> 13.33% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
>>> 11.86% bench [kernel.vmlinux] [k] migrate_enable
>>> 11.60% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
>>> 11.42% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
>>> 7.87% bench [kernel.vmlinux] [k] migrate_disable
>>> 3.71% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
>>> 3.67% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
>>> 0.04% bench bench [.] btf_validate_type
>>>
>>> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
>>> 18.94% bench [k]
>>> 16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
>>> 15.77% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
>>> 11.70% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
>>> 11.48% bench [kernel.vmlinux] [k] migrate_enable
>>> 11.30% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
>>> 5.85% bench [kernel.vmlinux] [k] migrate_disable
>>> 3.69% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
>>> 3.56% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
>>> 0.06% bench bench [.] bpf_prog_test_run_opts
>>>
>>> NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
>> I tried with perf built with latest bpf-next and with no-private-stack, the issue still
>> exists. Will debug more.
>>
> Just as an aside, but if this doesn't work, I think you can have a
> better signal-to-noise ratio if you try enabling the private stack for
> XDP programs and just set up two machines, with a client sending
> traffic to another and run xdp-bench [0] on the server. I think you
> should observe measurable differences in throughput for
> nanosecond-scale changes, especially in programs like drop which do
> very little.
>
> [0]: https://github.com/xdp-project/xdp-tools
Currently private stack cannot be used for xdp prog since xdp prog is
really performance critical and we won't want even slightest performance
loss at this point. So private stack focused on tracing related programs
as they are easily to have nested bpf progs if there are quite some
tracing progs at the same time. If we ineed apply private stack to xdp
programs, I expect some performance loss and how much loss is due to bpf
prog itself (bpf prog codes and helpers/kfuncs, see below). The example
in this patch shows overall performance degradation around 5%. But
considering there around 50% non-bpf programs so overall bpf prog
degradation might be around 10% comparing to non-private-stack bpf
programs. But this is an extreme example. In reality, the stack size
needs to be >= 64 bytes so the bpf programs must do some meaningful work
which means bpf prog itself will do more work and it may also call some
helpers/kfuncs. The helpers/kfuncs will introduce push/pop operations.
But if helper/kfunc do some meaningful work, then the relative
performance hit with additional push/pop should be small.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44 ` Yonghong Song
@ 2024-07-20 0:14 ` bot+bpf-ci
2024-07-20 1:08 ` Alexei Starovoitov
2 siblings, 0 replies; 33+ messages in thread
From: bot+bpf-ci @ 2024-07-20 0:14 UTC (permalink / raw)
To: yonghong.song; +Cc: bpf, kernel-ci
[-- Attachment #1: Type: text/plain, Size: 629 bytes --]
Dear patch submitter,
CI has tested the following submission:
Status: FAILURE
Name: [bpf-next,v2,2/2,no_merge] selftests/bpf: Benchmark runtime performance with private stack
Patchwork: https://patchwork.kernel.org/project/netdevbpf/list/?series=872349&state=*
Matrix: https://github.com/kernel-patches/bpf/actions/runs/10015580432
Failed jobs:
test_maps-s390x-gcc: https://github.com/kernel-patches/bpf/actions/runs/10015580432/job/27687366542
Please note: this email is coming from an unmonitored mailbox. If you have
questions or feedback, please reach out to the Meta Kernel CI team at
kernel-ci@meta.com.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44 ` Yonghong Song
2024-07-20 0:14 ` bot+bpf-ci
@ 2024-07-20 1:08 ` Alexei Starovoitov
2024-07-22 16:33 ` Yonghong Song
2 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-20 1:08 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> The following are the jited progs with private stack:
>
> subprog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 31 c0 xor eax,eax
> 28: c9 leave
> 29: c3 ret
Thanks for doing the benchmarking.
It's clear now that worst case overhead is ~5%.
Could you do one more benchmark such that the 'main prog'
below stays as-is with setup of r9 and push/pop r9,
but in the subprog above there is no 'movabs r9 + add r9' ?
To simulate the case when a big function with a large stack
triggers private-stack use, but it calls a subprog without
a private stack.
I think we should see a different overhead.
Obviously subprog won't have these two extra insns that setup r9
which would lead to something like ~4% slowdown vs 5%,
but I feel the overhead of pure push/pop r9 around calls
will be lower as well, because r9 is not written into inside subprog.
The CPU HW should be able to execute such push/pop faster.
I'm curious what it is.
> main prog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
> 2d: c9 ff ff
> 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> 34: 48 83 c6 01 add rsi,0x1
> 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> 3c: 41 51 push r9
> 3e: e8 46 23 51 e1 call 0xffffffffe1512389
> 43: 41 59 pop r9
> 45: 41 51 push r9
> 47: e8 3d 23 51 e1 call 0xffffffffe1512389
> 4c: 41 59 pop r9
> 4e: 41 51 push r9
> 50: e8 34 23 51 e1 call 0xffffffffe1512389
> 55: 41 59 pop r9
> 57: 31 c0 xor eax,eax
> 59: c9 leave
> 5a: c3 ret
>
Also pls share 'perf annotate' of JIT-ed asm.
I wonder where the hotspots are in the code.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
@ 2024-07-20 3:28 ` Andrii Nakryiko
2024-07-22 16:43 ` Yonghong Song
2024-07-22 20:57 ` Andrii Nakryiko
2024-07-22 3:33 ` Eduard Zingerman
2 siblings, 2 replies; 33+ messages in thread
From: Andrii Nakryiko @ 2024-07-20 3:28 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau
On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> The main motivation for private stack comes from nested
> scheduler in sched-ext from Tejun. The basic idea is that
> - each cgroup will its own associated bpf program,
> - bpf program with parent cgroup will call bpf programs
> in immediate child cgroups.
>
> Let us say we have the following cgroup hierarchy:
> root_cg (prog0):
> cg1 (prog1):
> cg11 (prog11):
> cg111 (prog111)
> cg112 (prog112)
> cg12 (prog12):
> cg121 (prog121)
> cg122 (prog122)
> cg2 (prog2):
> cg21 (prog21)
> cg22 (prog22)
> cg23 (prog23)
>
> In the above example, prog0 will call a kfunc which will
> call prog1 and prog2 to get sched info for cg1 and cg2 and
> then the information is summarized and sent back to prog0.
> Similarly, prog11 and prog12 will be invoked in the kfunc
> and the result will be summarized and sent back to prog1, etc.
>
> Currently, for each thread, the x86 kernel allocate 8KB stack.
> The each bpf program (including its subprograms) has maximum
> 512B stack size to avoid potential stack overflow.
> And nested bpf programs increase the risk of stack overflow.
> To avoid potential stack overflow caused by bpf programs,
> this patch implemented a private stack so bpf program stack
> space is allocated dynamically when the program is jited.
> Such private stack is applied to tracing programs like
> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
> tracing.
>
> But more than one instance of the same bpf program may
> run in the system. To make things simple, percpu private
> stack is allocated for each program, so if the same program
> is running on different cpus concurrently, we won't have
> any issue. Note that the kernel already have logic to prevent
> the recursion for the same bpf program on the same cpu
> (kprobe, fentry, etc.).
>
> The patch implemented a percpu private stack based approach
> for x86 arch.
> - The stack size will be 0 and any stack access is from
> jit-time allocated percpu storage.
> - In the beginning of jit, r9 is used to save percpu
> private stack pointer.
> - Each rbp in the bpf asm insn is replaced by r9.
> - For each call, push r9 before the call and pop r9
> after the call to preserve r9 value.
>
> Compared to previous RFC patch [1], this patch added
> some conditions to enable private stack, e.g., verifier
> calculated stack size, prog type, etc. The new patch
> also added a performance test to compare private stack
> vs. no private stack.
>
> The following are some code example to illustrate the idea
> for selftest cgroup_skb_sk_lookup:
>
> the existing code the private-stack approach code
> endbr64 endbr64
> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
> xchg ax,ax xchg ax,ax
> push rbp push rbp
> mov rbp,rsp mov rbp,rsp
> endbr64 endbr64
> sub rsp,0x68
> push rbx push rbx
> ... ...
> ... mov r9d,0x8c1c860
> ... add r9,QWORD PTR gs:0x21a00
> ... ...
> mov rdx,rbp mov rdx, r9
> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
> ... ...
> mov ecx,0x28 mov ecx,0x28
> push r9
> call 0xffffffffe305e474 call 0xffffffffe305e524
> pop r9
> mov rdi,rax mov rdi,rax
> ... ...
> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
> ... ...
>
Eduard nerd-sniped me today with this a bit... :)
I have a few questions and suggestions.
So it seems like each *subprogram* (not the entire BPF program) gets
its own per-CPU private stack allocation. Is that intentional? That
seems a bit unnecessary. It also prevents any sort of actual
recursion. Not sure if it's possible to write recursive BPF subprogram
today, verifier seems to reject obvious limited recursion cases, but
still, eventually we might need/want to support that, and this will be
just another hurdle to overcome (so it's best to avoid adding it in
the first place).
I'm sure Eduard is going to try something like below and it will
probably break badly (I haven't tried, sorry):
int entry(void *ctx);
struct {
__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
__uint(max_entries, 1);
__uint(key_size, sizeof(__u32));
__array(values, int (void *));
} prog_array_init SEC(".maps") = {
.values = {
[0] = (void *)&entry,
},
};
static __noinline int subprog1(void)
{
<some state on the stack>
/* here entry will replace subprog1, and so we'll have
* entry -> entry -> entry -> ..... <tail call limit> -> subprog1
*/
bpf_tail_call(ctx, &prog_array_init, 0);
return 0;
}
SEC("raw_tp/sys_enter")
int entry(void *ctx)
{
<some state on the stack>
subprog1();
}
And we effectively have limited recursion where the entry's stack
state is clobbered, no?
So it seems like we need to support recursion.
So, the question I have is. Why not do the following:
a) only setup r9 *once* in entry program's prologue (before tail call
jump target)
b) before each call we can adjust r9 with current prog/subprog's
maximum *own* stack, something like:
push r9;
r9 += 128; // 128 is subprog's stack usage
call <some-subprog>
pop r9;
The idea being that on tail call or in subprog call we assume r9 is
already pointing to the right place. We can probably also figure out
how to avoid push/pop r9 if we make sure that subprogram always
restores r9 (taking tail calls into account and all that, of course)?
Is this feasible?
Another question I have is whether it would be possible to just plain
set rbp to private stack and keep using rbp in such a way that stack
traces are preserved? I.e., save return address on private stack to
unwinders can correctly jump back to kernel's stack?
How stupid is what I propose above?
> So the number of insns is increased by 1 + num_of_calls * 2.
> Here the number of calls are those calls in the final jited binary.
> Comparing function call itself, the push/pop overhead should be
> minimum in most common cases.
>
> Our original use case is for sched-ext nested scheduler. This will be done
> in the future.
>
> [1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@linux.dev/T/
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++---
> include/linux/bpf.h | 2 ++
> kernel/bpf/core.c | 20 ++++++++++++
> kernel/bpf/syscall.c | 1 +
> 4 files changed, 82 insertions(+), 4 deletions(-)
>
[...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
@ 2024-07-22 3:33 ` Eduard Zingerman
2024-07-22 16:54 ` Yonghong Song
` (2 more replies)
2 siblings, 3 replies; 33+ messages in thread
From: Eduard Zingerman @ 2024-07-22 3:33 UTC (permalink / raw)
To: Yonghong Song, bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
Hi Yonghong,
In general I think that changes in this patch are logical and make sense.
I have a suggestion regarding testing JIT related changes.
We currently lack a convenient way to verify jit behaviour modulo
runtime tests. I think we should have a capability to write tests like below:
SEC("tp")
__jited_x86("f: endbr64")
__jited_x86("13: movabs $0x.*,%r9")
__jited_x86("1d: add %gs:0x.*,%r9")
__jited_x86("26: mov $0x1,%edi")
__jited_x86("2b: mov %rdi,-0x8(%r9)")
__jited_x86("2f: mov -0x8(%r9),%rdi")
__jited_x86("33: xor %eax,%eax")
__jited_x86("35: lock xchg %rax,-0x8(%r9)")
__jited_x86("3a: lock xadd %rax,-0x8(%r9)")
__naked void stack_access_insns(void)
{
asm volatile (
"r1 = 1;"
"*(u64 *)(r10 - 8) = r1;"
"r1 = *(u64 *)(r10 - 8);"
"r0 = 0;"
"r0 = xchg_64(r10 - 8, r0);"
"r0 = atomic_fetch_add((u64 *)(r10 - 8), r0);"
"exit;"
::: __clobber_all);
}
In the following branch I explored a way to add such capability:
https://github.com/eddyz87/bpf/tree/yhs-private-stack-plus-jit-testing
Beside testing exact translation, such tests also provide good
starting point for people trying to figure out how some jit features work.
The below two commits are the gist of the feature:
8f9361be2fb3 ("selftests/bpf: __jited_x86 test tag to check x86 assembly after jit")
0156b148b5b4 ("selftests/bpf: utility function to get program disassembly after jit")
For "0156b148b5b4" I opted to do a popen() call and execute bpftool process,
an alternative would be to:
a. either link tools/bpf/bpftool/jit_disasm.c as a part of the
test_progs executable;
b. call libbfd (binutils dis-assembler) directly from the bpftool.
Currently bpftool can use two dis-assemblers: libbfd and llvm library,
depending on the build environment. For CI builds libbfd is used.
I don't know if llvm and libbfd always produce same output for
identical binary code. Imo, if people are Ok with adding libbfd
dependency to test_progs, option (b) is the best. If folks on the
mailing list agree with this, I can work on updating the patches.
-------------
Aside from testing I agree with Andrii regarding rbp usage,
it seems like it should be possible to do the following in prologue:
movabs $0x...,%rsp
add %gs:0x...,%rsp
push %rbp
and there would be no need to modify translation for instructions
accessing r10, plus debugger stack unrolling logic should still work?.
Or am I mistaken?
Thanks,
Eduard
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
2024-07-20 1:08 ` Alexei Starovoitov
@ 2024-07-22 16:33 ` Yonghong Song
0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-22 16:33 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 7/19/24 6:08 PM, Alexei Starovoitov wrote:
> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> The following are the jited progs with private stack:
>>
>> subprog:
>> 0: f3 0f 1e fa endbr64
>> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
>> 9: 66 90 xchg ax,ax
>> b: 55 push rbp
>> c: 48 89 e5 mov rbp,rsp
>> f: f3 0f 1e fa endbr64
>> 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
>> 1a: 60 00 00
>> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
>> 24: 02 00
>> 26: 31 c0 xor eax,eax
>> 28: c9 leave
>> 29: c3 ret
> Thanks for doing the benchmarking.
> It's clear now that worst case overhead is ~5%.
> Could you do one more benchmark such that the 'main prog'
> below stays as-is with setup of r9 and push/pop r9,
> but in the subprog above there is no 'movabs r9 + add r9' ?
> To simulate the case when a big function with a large stack
> triggers private-stack use, but it calls a subprog without
> a private stack.
> I think we should see a different overhead.
> Obviously subprog won't have these two extra insns that setup r9
> which would lead to something like ~4% slowdown vs 5%,
> but I feel the overhead of pure push/pop r9 around calls
> will be lower as well, because r9 is not written into inside subprog.
> The CPU HW should be able to execute such push/pop faster.
> I'm curious what it is.
Sure. Let me do an experiment with this.
>
>> main prog:
>> 0: f3 0f 1e fa endbr64
>> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
>> 9: 66 90 xchg ax,ax
>> b: 55 push rbp
>> c: 48 89 e5 mov rbp,rsp
>> f: f3 0f 1e fa endbr64
>> 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
>> 1a: 60 00 00
>> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
>> 24: 02 00
>> 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
>> 2d: c9 ff ff
>> 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
>> 34: 48 83 c6 01 add rsi,0x1
>> 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
>> 3c: 41 51 push r9
>> 3e: e8 46 23 51 e1 call 0xffffffffe1512389
>> 43: 41 59 pop r9
>> 45: 41 51 push r9
>> 47: e8 3d 23 51 e1 call 0xffffffffe1512389
>> 4c: 41 59 pop r9
>> 4e: 41 51 push r9
>> 50: e8 34 23 51 e1 call 0xffffffffe1512389
>> 55: 41 59 pop r9
>> 57: 31 c0 xor eax,eax
>> 59: c9 leave
>> 5a: c3 ret
>>
> Also pls share 'perf annotate' of JIT-ed asm.
> I wonder where the hotspots are in the code.
Okay, will do.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
@ 2024-07-22 16:43 ` Yonghong Song
2024-07-24 5:08 ` Yonghong Song
2024-07-22 20:57 ` Andrii Nakryiko
1 sibling, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2024-07-22 16:43 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau
On 7/19/24 8:28 PM, Andrii Nakryiko wrote:
> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> The main motivation for private stack comes from nested
>> scheduler in sched-ext from Tejun. The basic idea is that
>> - each cgroup will its own associated bpf program,
>> - bpf program with parent cgroup will call bpf programs
>> in immediate child cgroups.
>>
>> Let us say we have the following cgroup hierarchy:
>> root_cg (prog0):
>> cg1 (prog1):
>> cg11 (prog11):
>> cg111 (prog111)
>> cg112 (prog112)
>> cg12 (prog12):
>> cg121 (prog121)
>> cg122 (prog122)
>> cg2 (prog2):
>> cg21 (prog21)
>> cg22 (prog22)
>> cg23 (prog23)
>>
>> In the above example, prog0 will call a kfunc which will
>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>> then the information is summarized and sent back to prog0.
>> Similarly, prog11 and prog12 will be invoked in the kfunc
>> and the result will be summarized and sent back to prog1, etc.
>>
>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>> The each bpf program (including its subprograms) has maximum
>> 512B stack size to avoid potential stack overflow.
>> And nested bpf programs increase the risk of stack overflow.
>> To avoid potential stack overflow caused by bpf programs,
>> this patch implemented a private stack so bpf program stack
>> space is allocated dynamically when the program is jited.
>> Such private stack is applied to tracing programs like
>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>> tracing.
>>
>> But more than one instance of the same bpf program may
>> run in the system. To make things simple, percpu private
>> stack is allocated for each program, so if the same program
>> is running on different cpus concurrently, we won't have
>> any issue. Note that the kernel already have logic to prevent
>> the recursion for the same bpf program on the same cpu
>> (kprobe, fentry, etc.).
>>
>> The patch implemented a percpu private stack based approach
>> for x86 arch.
>> - The stack size will be 0 and any stack access is from
>> jit-time allocated percpu storage.
>> - In the beginning of jit, r9 is used to save percpu
>> private stack pointer.
>> - Each rbp in the bpf asm insn is replaced by r9.
>> - For each call, push r9 before the call and pop r9
>> after the call to preserve r9 value.
>>
>> Compared to previous RFC patch [1], this patch added
>> some conditions to enable private stack, e.g., verifier
>> calculated stack size, prog type, etc. The new patch
>> also added a performance test to compare private stack
>> vs. no private stack.
>>
>> The following are some code example to illustrate the idea
>> for selftest cgroup_skb_sk_lookup:
>>
>> the existing code the private-stack approach code
>> endbr64 endbr64
>> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
>> xchg ax,ax xchg ax,ax
>> push rbp push rbp
>> mov rbp,rsp mov rbp,rsp
>> endbr64 endbr64
>> sub rsp,0x68
>> push rbx push rbx
>> ... ...
>> ... mov r9d,0x8c1c860
>> ... add r9,QWORD PTR gs:0x21a00
>> ... ...
>> mov rdx,rbp mov rdx, r9
>> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
>> ... ...
>> mov ecx,0x28 mov ecx,0x28
>> push r9
>> call 0xffffffffe305e474 call 0xffffffffe305e524
>> pop r9
>> mov rdi,rax mov rdi,rax
>> ... ...
>> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
>> ... ...
>>
> Eduard nerd-sniped me today with this a bit... :)
>
> I have a few questions and suggestions.
>
> So it seems like each *subprogram* (not the entire BPF program) gets
> its own per-CPU private stack allocation. Is that intentional? That
Currently yes. The reason is the same prog could be run on different
cpus at the same time.
> seems a bit unnecessary. It also prevents any sort of actual
> recursion. Not sure if it's possible to write recursive BPF subprogram
> today, verifier seems to reject obvious limited recursion cases, but
> still, eventually we might need/want to support that, and this will be
> just another hurdle to overcome (so it's best to avoid adding it in
> the first place).
>
> I'm sure Eduard is going to try something like below and it will
> probably break badly (I haven't tried, sorry):
>
> int entry(void *ctx);
>
> struct {
> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> __uint(max_entries, 1);
> __uint(key_size, sizeof(__u32));
> __array(values, int (void *));
> } prog_array_init SEC(".maps") = {
> .values = {
> [0] = (void *)&entry,
> },
> };
>
> static __noinline int subprog1(void)
> {
> <some state on the stack>
>
> /* here entry will replace subprog1, and so we'll have
> * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
> */
> bpf_tail_call(ctx, &prog_array_init, 0);
>
> return 0;
> }
>
>
> SEC("raw_tp/sys_enter")
> int entry(void *ctx)
> {
> <some state on the stack>
>
> subprog1();
> }
>
> And we effectively have limited recursion where the entry's stack
> state is clobbered, no?
>
> So it seems like we need to support recursion.
>
>
> So, the question I have is. Why not do the following:
> a) only setup r9 *once* in entry program's prologue (before tail call
> jump target)
> b) before each call we can adjust r9 with current prog/subprog's
> maximum *own* stack, something like:
>
> push r9;
> r9 += 128; // 128 is subprog's stack usage
> call <some-subprog>
> pop r9;
>
> The idea being that on tail call or in subprog call we assume r9 is
> already pointing to the right place. We can probably also figure out
> how to avoid push/pop r9 if we make sure that subprogram always
> restores r9 (taking tail calls into account and all that, of course)?
>
> Is this feasible?
This is possible. I actually hacked such an idea easily. The basic
idea is push frame pointer as an additional argument to the bpf
static sub-prog. This is a little bit complicated. It will probably
save some stack size but I am not sure how much it is.
>
> Another question I have is whether it would be possible to just plain
> set rbp to private stack and keep using rbp in such a way that stack
> traces are preserved? I.e., save return address on private stack to
> unwinders can correctly jump back to kernel's stack?
I also tried this approach earlier. But it is very trickly we need to
modify rbp/rsp and additional jit code will be added
If interrupts happens, we will not be able to get reliable stack trace.
>
> How stupid is what I propose above?
>
>
>> So the number of insns is increased by 1 + num_of_calls * 2.
>> Here the number of calls are those calls in the final jited binary.
>> Comparing function call itself, the push/pop overhead should be
>> minimum in most common cases.
>>
>> Our original use case is for sched-ext nested scheduler. This will be done
>> in the future.
>>
>> [1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@linux.dev/T/
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++---
>> include/linux/bpf.h | 2 ++
>> kernel/bpf/core.c | 20 ++++++++++++
>> kernel/bpf/syscall.c | 1 +
>> 4 files changed, 82 insertions(+), 4 deletions(-)
>>
> [...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 3:33 ` Eduard Zingerman
@ 2024-07-22 16:54 ` Yonghong Song
2024-07-22 17:53 ` Eduard Zingerman
2024-07-22 17:51 ` Alexei Starovoitov
2024-07-24 21:28 ` Yonghong Song
2 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2024-07-22 16:54 UTC (permalink / raw)
To: Eduard Zingerman, bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
On 7/21/24 8:33 PM, Eduard Zingerman wrote:
> Hi Yonghong,
>
> In general I think that changes in this patch are logical and make sense.
> I have a suggestion regarding testing JIT related changes.
>
> We currently lack a convenient way to verify jit behaviour modulo
> runtime tests. I think we should have a capability to write tests like below:
>
> SEC("tp")
> __jited_x86("f: endbr64")
> __jited_x86("13: movabs $0x.*,%r9")
> __jited_x86("1d: add %gs:0x.*,%r9")
> __jited_x86("26: mov $0x1,%edi")
> __jited_x86("2b: mov %rdi,-0x8(%r9)")
> __jited_x86("2f: mov -0x8(%r9),%rdi")
> __jited_x86("33: xor %eax,%eax")
> __jited_x86("35: lock xchg %rax,-0x8(%r9)")
> __jited_x86("3a: lock xadd %rax,-0x8(%r9)")
> __naked void stack_access_insns(void)
> {
> asm volatile (
> "r1 = 1;"
> "*(u64 *)(r10 - 8) = r1;"
> "r1 = *(u64 *)(r10 - 8);"
> "r0 = 0;"
> "r0 = xchg_64(r10 - 8, r0);"
> "r0 = atomic_fetch_add((u64 *)(r10 - 8), r0);"
> "exit;"
> ::: __clobber_all);
> }
>
> In the following branch I explored a way to add such capability:
> https://github.com/eddyz87/bpf/tree/yhs-private-stack-plus-jit-testing
>
> Beside testing exact translation, such tests also provide good
> starting point for people trying to figure out how some jit features work.
>
> The below two commits are the gist of the feature:
> 8f9361be2fb3 ("selftests/bpf: __jited_x86 test tag to check x86 assembly after jit")
> 0156b148b5b4 ("selftests/bpf: utility function to get program disassembly after jit")
>
> For "0156b148b5b4" I opted to do a popen() call and execute bpftool process,
> an alternative would be to:
> a. either link tools/bpf/bpftool/jit_disasm.c as a part of the
> test_progs executable;
> b. call libbfd (binutils dis-assembler) directly from the bpftool.
>
> Currently bpftool can use two dis-assemblers: libbfd and llvm library,
> depending on the build environment. For CI builds libbfd is used.
> I don't know if llvm and libbfd always produce same output for
> identical binary code. Imo, if people are Ok with adding libbfd
> dependency to test_progs, option (b) is the best. If folks on the
> mailing list agree with this, I can work on updating the patches.
I think this is a good idea in the long time.
Let me try with your patch.
>
> -------------
>
> Aside from testing I agree with Andrii regarding rbp usage,
> it seems like it should be possible to do the following in prologue:
>
> movabs $0x...,%rsp
> add %gs:0x...,%rsp
> push %rbp
>
> and there would be no need to modify translation for instructions
> accessing r10, plus debugger stack unrolling logic should still work?.
> Or am I mistaken?
This may not work. The 'push %rbp' does not change %rbp value which still
the original %rbp.
>
> Thanks,
> Eduard
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 3:33 ` Eduard Zingerman
2024-07-22 16:54 ` Yonghong Song
@ 2024-07-22 17:51 ` Alexei Starovoitov
2024-07-22 18:22 ` Eduard Zingerman
2024-07-24 21:28 ` Yonghong Song
2 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-22 17:51 UTC (permalink / raw)
To: Eduard Zingerman
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Sun, Jul 21, 2024 at 8:33 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Aside from testing I agree with Andrii regarding rbp usage,
> it seems like it should be possible to do the following in prologue:
>
> movabs $0x...,%rsp
> add %gs:0x...,%rsp
> push %rbp
>
> and there would be no need to modify translation for instructions
> accessing r10, plus debugger stack unrolling logic should still work?.
> Or am I mistaken?
It's not that simple.
Above sequence violates -mno-red-zone.
The part of the fix may look like:
movabs .., rax
add %gs.., rax
mov rbp, qword ptr [rax - ...]
mov rax, rsp
mox rax, rbp
sub rsp, ...
it's probably correct from mno-red-zone pov and
end result is maybe correct for stack unwind,
but if irq happens in the middle it won't crash,
but unwind will not work.
The main reason to use r9 is to have valid unwind
at any point of the prog.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 16:54 ` Yonghong Song
@ 2024-07-22 17:53 ` Eduard Zingerman
0 siblings, 0 replies; 33+ messages in thread
From: Eduard Zingerman @ 2024-07-22 17:53 UTC (permalink / raw)
To: Yonghong Song, bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
On Mon, 2024-07-22 at 09:54 -0700, Yonghong Song wrote:
[...]
> > For "0156b148b5b4" I opted to do a popen() call and execute bpftool process,
> > an alternative would be to:
> > a. either link tools/bpf/bpftool/jit_disasm.c as a part of the
> > test_progs executable;
> > b. call libbfd (binutils dis-assembler) directly from the bpftool.
> >
> > Currently bpftool can use two dis-assemblers: libbfd and llvm library,
> > depending on the build environment. For CI builds libbfd is used.
> > I don't know if llvm and libbfd always produce same output for
> > identical binary code. Imo, if people are Ok with adding libbfd
> > dependency to test_progs, option (b) is the best. If folks on the
> > mailing list agree with this, I can work on updating the patches.
>
> I think this is a good idea in the long time.
> Let me try with your patch.
What do you think about direct dependency on libbfd for test_progs,
should I update the disassembly function or popen'ing bpftool is fine?
I'd prefer libbfd dependency, tbh.
[...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 17:51 ` Alexei Starovoitov
@ 2024-07-22 18:22 ` Eduard Zingerman
2024-07-22 20:08 ` Alexei Starovoitov
0 siblings, 1 reply; 33+ messages in thread
From: Eduard Zingerman @ 2024-07-22 18:22 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Mon, 2024-07-22 at 10:51 -0700, Alexei Starovoitov wrote:
[...]
> It's not that simple.
> Above sequence violates -mno-red-zone.
> The part of the fix may look like:
> movabs .., rax
> add %gs.., rax
> mov rbp, qword ptr [rax - ...]
>
> mov rax, rsp
> mox rax, rbp
> sub rsp, ...
>
> it's probably correct from mno-red-zone pov and
> end result is maybe correct for stack unwind,
> but if irq happens in the middle it won't crash,
> but unwind will not work.
> The main reason to use r9 is to have valid unwind
> at any point of the prog.
Oh, I see, bad things would happen if this sequence:
movabs $0x...,%rsp
add %gs:0x...,%rsp
Would be split by an interrupt.
However, I don't understand why 'push %rbp' violates red zone.
In any case, the interrupt argument is sufficient,
thank you for explaining.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 18:22 ` Eduard Zingerman
@ 2024-07-22 20:08 ` Alexei Starovoitov
0 siblings, 0 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-22 20:08 UTC (permalink / raw)
To: Eduard Zingerman
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Mon, Jul 22, 2024 at 11:22 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2024-07-22 at 10:51 -0700, Alexei Starovoitov wrote:
>
> [...]
>
> > It's not that simple.
> > Above sequence violates -mno-red-zone.
> > The part of the fix may look like:
> > movabs .., rax
> > add %gs.., rax
> > mov rbp, qword ptr [rax - ...]
> >
> > mov rax, rsp
> > mox rax, rbp
> > sub rsp, ...
> >
> > it's probably correct from mno-red-zone pov and
> > end result is maybe correct for stack unwind,
> > but if irq happens in the middle it won't crash,
> > but unwind will not work.
> > The main reason to use r9 is to have valid unwind
> > at any point of the prog.
>
> Oh, I see, bad things would happen if this sequence:
>
> movabs $0x...,%rsp
> add %gs:0x...,%rsp
>
> Would be split by an interrupt.
> However, I don't understand why 'push %rbp' violates red zone.
> In any case, the interrupt argument is sufficient,
> thank you for explaining.
push rbp itself doesn't break red-zone.
If we do:
movabs .., rax
add %gs.., rax
mov rax, rsp
push rbp
at the time of setting of rsp the unwind is broken.
hence the idea to mov rbp, qword ptr [rax - ...]
before setting rsp.
We have to maintain uwindable stack at all times.
But if we overwrite rsp it means everything will go into
this new memory.
We'd need another 16k of space in there for everything
that bpf prog can call, since kernel code will be using
that area from the moment of the switch.
At the end we'd have to restore to original stack somehow.
Instead of single 'leave' insn the sequence has to preserve
unwinding. It all looks very tricky and fragile.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43 ` Yonghong Song
@ 2024-07-22 20:57 ` Andrii Nakryiko
2024-07-23 1:05 ` Alexei Starovoitov
1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2024-07-22 20:57 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau
On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> > The main motivation for private stack comes from nested
> > scheduler in sched-ext from Tejun. The basic idea is that
> > - each cgroup will its own associated bpf program,
> > - bpf program with parent cgroup will call bpf programs
> > in immediate child cgroups.
> >
> > Let us say we have the following cgroup hierarchy:
> > root_cg (prog0):
> > cg1 (prog1):
> > cg11 (prog11):
> > cg111 (prog111)
> > cg112 (prog112)
> > cg12 (prog12):
> > cg121 (prog121)
> > cg122 (prog122)
> > cg2 (prog2):
> > cg21 (prog21)
> > cg22 (prog22)
> > cg23 (prog23)
> >
> > In the above example, prog0 will call a kfunc which will
> > call prog1 and prog2 to get sched info for cg1 and cg2 and
> > then the information is summarized and sent back to prog0.
> > Similarly, prog11 and prog12 will be invoked in the kfunc
> > and the result will be summarized and sent back to prog1, etc.
> >
> > Currently, for each thread, the x86 kernel allocate 8KB stack.
> > The each bpf program (including its subprograms) has maximum
> > 512B stack size to avoid potential stack overflow.
> > And nested bpf programs increase the risk of stack overflow.
> > To avoid potential stack overflow caused by bpf programs,
> > this patch implemented a private stack so bpf program stack
> > space is allocated dynamically when the program is jited.
> > Such private stack is applied to tracing programs like
> > kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
> > tracing.
> >
> > But more than one instance of the same bpf program may
> > run in the system. To make things simple, percpu private
> > stack is allocated for each program, so if the same program
> > is running on different cpus concurrently, we won't have
> > any issue. Note that the kernel already have logic to prevent
> > the recursion for the same bpf program on the same cpu
> > (kprobe, fentry, etc.).
> >
> > The patch implemented a percpu private stack based approach
> > for x86 arch.
> > - The stack size will be 0 and any stack access is from
> > jit-time allocated percpu storage.
> > - In the beginning of jit, r9 is used to save percpu
> > private stack pointer.
> > - Each rbp in the bpf asm insn is replaced by r9.
> > - For each call, push r9 before the call and pop r9
> > after the call to preserve r9 value.
> >
> > Compared to previous RFC patch [1], this patch added
> > some conditions to enable private stack, e.g., verifier
> > calculated stack size, prog type, etc. The new patch
> > also added a performance test to compare private stack
> > vs. no private stack.
> >
> > The following are some code example to illustrate the idea
> > for selftest cgroup_skb_sk_lookup:
> >
> > the existing code the private-stack approach code
> > endbr64 endbr64
> > nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
> > xchg ax,ax xchg ax,ax
> > push rbp push rbp
> > mov rbp,rsp mov rbp,rsp
> > endbr64 endbr64
> > sub rsp,0x68
> > push rbx push rbx
> > ... ...
> > ... mov r9d,0x8c1c860
> > ... add r9,QWORD PTR gs:0x21a00
> > ... ...
> > mov rdx,rbp mov rdx, r9
> > add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
> > ... ...
> > mov ecx,0x28 mov ecx,0x28
> > push r9
> > call 0xffffffffe305e474 call 0xffffffffe305e524
> > pop r9
> > mov rdi,rax mov rdi,rax
> > ... ...
> > movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
> > ... ...
> >
>
> Eduard nerd-sniped me today with this a bit... :)
>
> I have a few questions and suggestions.
>
> So it seems like each *subprogram* (not the entire BPF program) gets
> its own per-CPU private stack allocation. Is that intentional? That
> seems a bit unnecessary. It also prevents any sort of actual
> recursion. Not sure if it's possible to write recursive BPF subprogram
> today, verifier seems to reject obvious limited recursion cases, but
> still, eventually we might need/want to support that, and this will be
> just another hurdle to overcome (so it's best to avoid adding it in
> the first place).
>
> I'm sure Eduard is going to try something like below and it will
> probably break badly (I haven't tried, sorry):
>
> int entry(void *ctx);
>
> struct {
> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> __uint(max_entries, 1);
> __uint(key_size, sizeof(__u32));
> __array(values, int (void *));
> } prog_array_init SEC(".maps") = {
> .values = {
> [0] = (void *)&entry,
> },
> };
>
> static __noinline int subprog1(void)
> {
> <some state on the stack>
>
> /* here entry will replace subprog1, and so we'll have
> * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
> */
> bpf_tail_call(ctx, &prog_array_init, 0);
>
> return 0;
> }
>
>
> SEC("raw_tp/sys_enter")
> int entry(void *ctx)
> {
> <some state on the stack>
>
> subprog1();
> }
>
> And we effectively have limited recursion where the entry's stack
> state is clobbered, no?
>
> So it seems like we need to support recursion.
>
How come everyone just completely ignored the main point of my entire
email and a real problem that has to be solved?...
Anyways, I did write a below program:
$ cat minimal.bpf.c
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2020 Facebook */
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
char LICENSE[] SEC("license") = "Dual BSD/GPL";
int my_pid = 0;
int handle_tp(void *ctx);
struct {
__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
__uint(max_entries, 1);
__uint(key_size, sizeof(__u32));
__array(values, int (void *));
} prog_array_init SEC(".maps") = {
.values = {
[0] = (void *)&handle_tp,
},
};
static __noinline int subprog(void *ctx)
{
static int cnt;
cnt++;
bpf_printk("SUBPROG - BEFORE %d", cnt);
bpf_tail_call(ctx, &prog_array_init, 0);
bpf_printk("SUBPROG - AFTER %d", cnt);
return 0;
}
SEC("tp/syscalls/sys_enter_write")
int handle_tp(void *ctx)
{
static int cnt;
int pid = bpf_get_current_pid_tgid() >> 32;
if (pid != my_pid)
return 0;
cnt++;
bpf_printk("ENTRY - BEFORE %d", cnt);
subprog(ctx);
bpf_printk("ENTRY - AFTER %d", cnt);
return 0;
}
And triggered one write syscall, getting the log above. You can see
that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
to the tail call limit). And we do indeed get lots of entry program
recurrence.
We *need to support recursion* is my main point. rbp is a distraction, sorry.
$ sudo cat /sys/kernel/tracing/trace_pipe
minimal-1219321 [025] ....1 8119446.322300: bpf_trace_printk:
ENTRY - BEFORE 1
minimal-1219321 [025] ....1 8119446.322303: bpf_trace_printk:
SUBPROG - BEFORE 1
minimal-1219321 [025] ....1 8119446.322304: bpf_trace_printk:
ENTRY - BEFORE 2
minimal-1219321 [025] ....1 8119446.322304: bpf_trace_printk:
SUBPROG - BEFORE 2
minimal-1219321 [025] ....1 8119446.322304: bpf_trace_printk:
ENTRY - BEFORE 3
minimal-1219321 [025] ....1 8119446.322305: bpf_trace_printk:
SUBPROG - BEFORE 3
minimal-1219321 [025] ....1 8119446.322305: bpf_trace_printk:
ENTRY - BEFORE 4
minimal-1219321 [025] ....1 8119446.322305: bpf_trace_printk:
SUBPROG - BEFORE 4
minimal-1219321 [025] ....1 8119446.322305: bpf_trace_printk:
ENTRY - BEFORE 5
minimal-1219321 [025] ....1 8119446.322306: bpf_trace_printk:
SUBPROG - BEFORE 5
minimal-1219321 [025] ....1 8119446.322306: bpf_trace_printk:
ENTRY - BEFORE 6
minimal-1219321 [025] ....1 8119446.322306: bpf_trace_printk:
SUBPROG - BEFORE 6
minimal-1219321 [025] ....1 8119446.322307: bpf_trace_printk:
ENTRY - BEFORE 7
minimal-1219321 [025] ....1 8119446.322307: bpf_trace_printk:
SUBPROG - BEFORE 7
minimal-1219321 [025] ....1 8119446.322307: bpf_trace_printk:
ENTRY - BEFORE 8
minimal-1219321 [025] ....1 8119446.322307: bpf_trace_printk:
SUBPROG - BEFORE 8
minimal-1219321 [025] ....1 8119446.322308: bpf_trace_printk:
ENTRY - BEFORE 9
minimal-1219321 [025] ....1 8119446.322308: bpf_trace_printk:
SUBPROG - BEFORE 9
minimal-1219321 [025] ....1 8119446.322308: bpf_trace_printk:
ENTRY - BEFORE 10
minimal-1219321 [025] ....1 8119446.322308: bpf_trace_printk:
SUBPROG - BEFORE 10
minimal-1219321 [025] ....1 8119446.322309: bpf_trace_printk:
ENTRY - BEFORE 11
minimal-1219321 [025] ....1 8119446.322309: bpf_trace_printk:
SUBPROG - BEFORE 11
minimal-1219321 [025] ....1 8119446.322309: bpf_trace_printk:
ENTRY - BEFORE 12
minimal-1219321 [025] ....1 8119446.322309: bpf_trace_printk:
SUBPROG - BEFORE 12
minimal-1219321 [025] ....1 8119446.322310: bpf_trace_printk:
ENTRY - BEFORE 13
minimal-1219321 [025] ....1 8119446.322310: bpf_trace_printk:
SUBPROG - BEFORE 13
minimal-1219321 [025] ....1 8119446.322310: bpf_trace_printk:
ENTRY - BEFORE 14
minimal-1219321 [025] ....1 8119446.322312: bpf_trace_printk:
SUBPROG - BEFORE 14
minimal-1219321 [025] ....1 8119446.322313: bpf_trace_printk:
ENTRY - BEFORE 15
minimal-1219321 [025] ....1 8119446.322313: bpf_trace_printk:
SUBPROG - BEFORE 15
minimal-1219321 [025] ....1 8119446.322313: bpf_trace_printk:
ENTRY - BEFORE 16
minimal-1219321 [025] ....1 8119446.322313: bpf_trace_printk:
SUBPROG - BEFORE 16
minimal-1219321 [025] ....1 8119446.322314: bpf_trace_printk:
ENTRY - BEFORE 17
minimal-1219321 [025] ....1 8119446.322314: bpf_trace_printk:
SUBPROG - BEFORE 17
minimal-1219321 [025] ....1 8119446.322314: bpf_trace_printk:
ENTRY - BEFORE 18
minimal-1219321 [025] ....1 8119446.322314: bpf_trace_printk:
SUBPROG - BEFORE 18
minimal-1219321 [025] ....1 8119446.322315: bpf_trace_printk:
ENTRY - BEFORE 19
minimal-1219321 [025] ....1 8119446.322315: bpf_trace_printk:
SUBPROG - BEFORE 19
minimal-1219321 [025] ....1 8119446.322315: bpf_trace_printk:
ENTRY - BEFORE 20
minimal-1219321 [025] ....1 8119446.322315: bpf_trace_printk:
SUBPROG - BEFORE 20
minimal-1219321 [025] ....1 8119446.322316: bpf_trace_printk:
ENTRY - BEFORE 21
minimal-1219321 [025] ....1 8119446.322316: bpf_trace_printk:
SUBPROG - BEFORE 21
minimal-1219321 [025] ....1 8119446.322316: bpf_trace_printk:
ENTRY - BEFORE 22
minimal-1219321 [025] ....1 8119446.322316: bpf_trace_printk:
SUBPROG - BEFORE 22
minimal-1219321 [025] ....1 8119446.322316: bpf_trace_printk:
ENTRY - BEFORE 23
minimal-1219321 [025] ....1 8119446.322317: bpf_trace_printk:
SUBPROG - BEFORE 23
minimal-1219321 [025] ....1 8119446.322318: bpf_trace_printk:
ENTRY - BEFORE 24
minimal-1219321 [025] ....1 8119446.322318: bpf_trace_printk:
SUBPROG - BEFORE 24
minimal-1219321 [025] ....1 8119446.322319: bpf_trace_printk:
ENTRY - BEFORE 25
minimal-1219321 [025] ....1 8119446.322319: bpf_trace_printk:
SUBPROG - BEFORE 25
minimal-1219321 [025] ....1 8119446.322319: bpf_trace_printk:
ENTRY - BEFORE 26
minimal-1219321 [025] ....1 8119446.322320: bpf_trace_printk:
SUBPROG - BEFORE 26
minimal-1219321 [025] ....1 8119446.322321: bpf_trace_printk:
ENTRY - BEFORE 27
minimal-1219321 [025] ....1 8119446.322321: bpf_trace_printk:
SUBPROG - BEFORE 27
minimal-1219321 [025] ....1 8119446.322321: bpf_trace_printk:
ENTRY - BEFORE 28
minimal-1219321 [025] ....1 8119446.322322: bpf_trace_printk:
SUBPROG - BEFORE 28
minimal-1219321 [025] ....1 8119446.322322: bpf_trace_printk:
ENTRY - BEFORE 29
minimal-1219321 [025] ....1 8119446.322323: bpf_trace_printk:
SUBPROG - BEFORE 29
minimal-1219321 [025] ....1 8119446.322323: bpf_trace_printk:
ENTRY - BEFORE 30
minimal-1219321 [025] ....1 8119446.322324: bpf_trace_printk:
SUBPROG - BEFORE 30
minimal-1219321 [025] ....1 8119446.322324: bpf_trace_printk:
ENTRY - BEFORE 31
minimal-1219321 [025] ....1 8119446.322324: bpf_trace_printk:
SUBPROG - BEFORE 31
minimal-1219321 [025] ....1 8119446.322324: bpf_trace_printk:
ENTRY - BEFORE 32
minimal-1219321 [025] ....1 8119446.322325: bpf_trace_printk:
SUBPROG - BEFORE 32
minimal-1219321 [025] ....1 8119446.322325: bpf_trace_printk:
ENTRY - BEFORE 33
minimal-1219321 [025] ....1 8119446.322326: bpf_trace_printk:
SUBPROG - BEFORE 33
minimal-1219321 [025] ....1 8119446.322327: bpf_trace_printk:
ENTRY - BEFORE 34
minimal-1219321 [025] ....1 8119446.322327: bpf_trace_printk:
SUBPROG - BEFORE 34
minimal-1219321 [025] ....1 8119446.322327: bpf_trace_printk:
SUBPROG - AFTER 34
minimal-1219321 [025] ....1 8119446.322328: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322328: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322328: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322328: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322329: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322329: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322329: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322329: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322329: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322330: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322330: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322330: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322330: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322331: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322331: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322331: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322331: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322331: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322332: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322332: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322332: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322332: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322332: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322333: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322333: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322333: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322333: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322334: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322334: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322334: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322334: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322334: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322335: bpf_trace_printk:
ENTRY - AFTER 34
minimal-1219321 [025] ....1 8119446.322335: bpf_trace_printk:
ENTRY - AFTER 34
>
> So, the question I have is. Why not do the following:
> a) only setup r9 *once* in entry program's prologue (before tail call
> jump target)
> b) before each call we can adjust r9 with current prog/subprog's
> maximum *own* stack, something like:
>
> push r9;
> r9 += 128; // 128 is subprog's stack usage
> call <some-subprog>
> pop r9;
>
> The idea being that on tail call or in subprog call we assume r9 is
> already pointing to the right place. We can probably also figure out
> how to avoid push/pop r9 if we make sure that subprogram always
> restores r9 (taking tail calls into account and all that, of course)?
>
> Is this feasible?
>
> Another question I have is whether it would be possible to just plain
> set rbp to private stack and keep using rbp in such a way that stack
> traces are preserved? I.e., save return address on private stack to
> unwinders can correctly jump back to kernel's stack?
>
> How stupid is what I propose above?
>
>
> > So the number of insns is increased by 1 + num_of_calls * 2.
> > Here the number of calls are those calls in the final jited binary.
> > Comparing function call itself, the push/pop overhead should be
> > minimum in most common cases.
> >
> > Our original use case is for sched-ext nested scheduler. This will be done
> > in the future.
> >
> > [1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@linux.dev/T/
> >
> > Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> > ---
> > arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++---
> > include/linux/bpf.h | 2 ++
> > kernel/bpf/core.c | 20 ++++++++++++
> > kernel/bpf/syscall.c | 1 +
> > 4 files changed, 82 insertions(+), 4 deletions(-)
> >
>
> [...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 20:57 ` Andrii Nakryiko
@ 2024-07-23 1:05 ` Alexei Starovoitov
2024-07-23 3:26 ` Andrii Nakryiko
2024-07-23 5:30 ` Yonghong Song
0 siblings, 2 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-23 1:05 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > >
> > > The main motivation for private stack comes from nested
> > > scheduler in sched-ext from Tejun. The basic idea is that
> > > - each cgroup will its own associated bpf program,
> > > - bpf program with parent cgroup will call bpf programs
> > > in immediate child cgroups.
> > >
> > > Let us say we have the following cgroup hierarchy:
> > > root_cg (prog0):
> > > cg1 (prog1):
> > > cg11 (prog11):
> > > cg111 (prog111)
> > > cg112 (prog112)
> > > cg12 (prog12):
> > > cg121 (prog121)
> > > cg122 (prog122)
> > > cg2 (prog2):
> > > cg21 (prog21)
> > > cg22 (prog22)
> > > cg23 (prog23)
> > >
> > > In the above example, prog0 will call a kfunc which will
> > > call prog1 and prog2 to get sched info for cg1 and cg2 and
> > > then the information is summarized and sent back to prog0.
> > > Similarly, prog11 and prog12 will be invoked in the kfunc
> > > and the result will be summarized and sent back to prog1, etc.
> > >
> > > Currently, for each thread, the x86 kernel allocate 8KB stack.
> > > The each bpf program (including its subprograms) has maximum
> > > 512B stack size to avoid potential stack overflow.
> > > And nested bpf programs increase the risk of stack overflow.
> > > To avoid potential stack overflow caused by bpf programs,
> > > this patch implemented a private stack so bpf program stack
> > > space is allocated dynamically when the program is jited.
> > > Such private stack is applied to tracing programs like
> > > kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
> > > tracing.
> > >
> > > But more than one instance of the same bpf program may
> > > run in the system. To make things simple, percpu private
> > > stack is allocated for each program, so if the same program
> > > is running on different cpus concurrently, we won't have
> > > any issue. Note that the kernel already have logic to prevent
> > > the recursion for the same bpf program on the same cpu
> > > (kprobe, fentry, etc.).
> > >
> > > The patch implemented a percpu private stack based approach
> > > for x86 arch.
> > > - The stack size will be 0 and any stack access is from
> > > jit-time allocated percpu storage.
> > > - In the beginning of jit, r9 is used to save percpu
> > > private stack pointer.
> > > - Each rbp in the bpf asm insn is replaced by r9.
> > > - For each call, push r9 before the call and pop r9
> > > after the call to preserve r9 value.
> > >
> > > Compared to previous RFC patch [1], this patch added
> > > some conditions to enable private stack, e.g., verifier
> > > calculated stack size, prog type, etc. The new patch
> > > also added a performance test to compare private stack
> > > vs. no private stack.
> > >
> > > The following are some code example to illustrate the idea
> > > for selftest cgroup_skb_sk_lookup:
> > >
> > > the existing code the private-stack approach code
> > > endbr64 endbr64
> > > nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
> > > xchg ax,ax xchg ax,ax
> > > push rbp push rbp
> > > mov rbp,rsp mov rbp,rsp
> > > endbr64 endbr64
> > > sub rsp,0x68
> > > push rbx push rbx
> > > ... ...
> > > ... mov r9d,0x8c1c860
> > > ... add r9,QWORD PTR gs:0x21a00
> > > ... ...
> > > mov rdx,rbp mov rdx, r9
> > > add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
> > > ... ...
> > > mov ecx,0x28 mov ecx,0x28
> > > push r9
> > > call 0xffffffffe305e474 call 0xffffffffe305e524
> > > pop r9
> > > mov rdi,rax mov rdi,rax
> > > ... ...
> > > movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
> > > ... ...
> > >
> >
> > Eduard nerd-sniped me today with this a bit... :)
> >
> > I have a few questions and suggestions.
> >
> > So it seems like each *subprogram* (not the entire BPF program) gets
> > its own per-CPU private stack allocation. Is that intentional? That
> > seems a bit unnecessary. It also prevents any sort of actual
> > recursion. Not sure if it's possible to write recursive BPF subprogram
> > today, verifier seems to reject obvious limited recursion cases, but
> > still, eventually we might need/want to support that, and this will be
> > just another hurdle to overcome (so it's best to avoid adding it in
> > the first place).
> >
> > I'm sure Eduard is going to try something like below and it will
> > probably break badly (I haven't tried, sorry):
> >
> > int entry(void *ctx);
> >
> > struct {
> > __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> > __uint(max_entries, 1);
> > __uint(key_size, sizeof(__u32));
> > __array(values, int (void *));
> > } prog_array_init SEC(".maps") = {
> > .values = {
> > [0] = (void *)&entry,
> > },
> > };
> >
> > static __noinline int subprog1(void)
> > {
> > <some state on the stack>
> >
> > /* here entry will replace subprog1, and so we'll have
> > * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
> > */
> > bpf_tail_call(ctx, &prog_array_init, 0);
> >
> > return 0;
> > }
> >
> >
> > SEC("raw_tp/sys_enter")
> > int entry(void *ctx)
> > {
> > <some state on the stack>
> >
> > subprog1();
> > }
> >
> > And we effectively have limited recursion where the entry's stack
> > state is clobbered, no?
> >
> > So it seems like we need to support recursion.
> >
>
> How come everyone just completely ignored the main point of my entire
> email and a real problem that has to be solved?...
>
> Anyways, I did write a below program:
>
> $ cat minimal.bpf.c
> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> /* Copyright (c) 2020 Facebook */
> #include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>
>
> char LICENSE[] SEC("license") = "Dual BSD/GPL";
>
> int my_pid = 0;
>
> int handle_tp(void *ctx);
>
> struct {
> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> __uint(max_entries, 1);
> __uint(key_size, sizeof(__u32));
> __array(values, int (void *));
> } prog_array_init SEC(".maps") = {
> .values = {
> [0] = (void *)&handle_tp,
> },
> };
>
> static __noinline int subprog(void *ctx)
> {
> static int cnt;
>
> cnt++;
>
> bpf_printk("SUBPROG - BEFORE %d", cnt);
>
> bpf_tail_call(ctx, &prog_array_init, 0);
>
> bpf_printk("SUBPROG - AFTER %d", cnt);
>
> return 0;
> }
>
> SEC("tp/syscalls/sys_enter_write")
> int handle_tp(void *ctx)
> {
> static int cnt;
> int pid = bpf_get_current_pid_tgid() >> 32;
>
> if (pid != my_pid)
> return 0;
>
> cnt++;
>
> bpf_printk("ENTRY - BEFORE %d", cnt);
>
> subprog(ctx);
>
> bpf_printk("ENTRY - AFTER %d", cnt);
>
> return 0;
> }
>
>
> And triggered one write syscall, getting the log above. You can see
> that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
> to the tail call limit). And we do indeed get lots of entry program
> recurrence.
>
> We *need to support recursion* is my main point.
Not quite.
It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
static int cnt counts stuff because it's static.
So we don't need to support recursion with private stack,
but tail_calls and private stack are buggy indeed.
emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
stack is used.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-23 1:05 ` Alexei Starovoitov
@ 2024-07-23 3:26 ` Andrii Nakryiko
2024-07-24 3:17 ` Alexei Starovoitov
2024-07-23 5:30 ` Yonghong Song
1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2024-07-23 3:26 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Mon, Jul 22, 2024 at 6:06 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > > >
> > > > The main motivation for private stack comes from nested
> > > > scheduler in sched-ext from Tejun. The basic idea is that
> > > > - each cgroup will its own associated bpf program,
> > > > - bpf program with parent cgroup will call bpf programs
> > > > in immediate child cgroups.
> > > >
> > > > Let us say we have the following cgroup hierarchy:
> > > > root_cg (prog0):
> > > > cg1 (prog1):
> > > > cg11 (prog11):
> > > > cg111 (prog111)
> > > > cg112 (prog112)
> > > > cg12 (prog12):
> > > > cg121 (prog121)
> > > > cg122 (prog122)
> > > > cg2 (prog2):
> > > > cg21 (prog21)
> > > > cg22 (prog22)
> > > > cg23 (prog23)
> > > >
> > > > In the above example, prog0 will call a kfunc which will
> > > > call prog1 and prog2 to get sched info for cg1 and cg2 and
> > > > then the information is summarized and sent back to prog0.
> > > > Similarly, prog11 and prog12 will be invoked in the kfunc
> > > > and the result will be summarized and sent back to prog1, etc.
> > > >
> > > > Currently, for each thread, the x86 kernel allocate 8KB stack.
> > > > The each bpf program (including its subprograms) has maximum
> > > > 512B stack size to avoid potential stack overflow.
> > > > And nested bpf programs increase the risk of stack overflow.
> > > > To avoid potential stack overflow caused by bpf programs,
> > > > this patch implemented a private stack so bpf program stack
> > > > space is allocated dynamically when the program is jited.
> > > > Such private stack is applied to tracing programs like
> > > > kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
> > > > tracing.
> > > >
> > > > But more than one instance of the same bpf program may
> > > > run in the system. To make things simple, percpu private
> > > > stack is allocated for each program, so if the same program
> > > > is running on different cpus concurrently, we won't have
> > > > any issue. Note that the kernel already have logic to prevent
> > > > the recursion for the same bpf program on the same cpu
> > > > (kprobe, fentry, etc.).
> > > >
> > > > The patch implemented a percpu private stack based approach
> > > > for x86 arch.
> > > > - The stack size will be 0 and any stack access is from
> > > > jit-time allocated percpu storage.
> > > > - In the beginning of jit, r9 is used to save percpu
> > > > private stack pointer.
> > > > - Each rbp in the bpf asm insn is replaced by r9.
> > > > - For each call, push r9 before the call and pop r9
> > > > after the call to preserve r9 value.
> > > >
> > > > Compared to previous RFC patch [1], this patch added
> > > > some conditions to enable private stack, e.g., verifier
> > > > calculated stack size, prog type, etc. The new patch
> > > > also added a performance test to compare private stack
> > > > vs. no private stack.
> > > >
> > > > The following are some code example to illustrate the idea
> > > > for selftest cgroup_skb_sk_lookup:
> > > >
> > > > the existing code the private-stack approach code
> > > > endbr64 endbr64
> > > > nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
> > > > xchg ax,ax xchg ax,ax
> > > > push rbp push rbp
> > > > mov rbp,rsp mov rbp,rsp
> > > > endbr64 endbr64
> > > > sub rsp,0x68
> > > > push rbx push rbx
> > > > ... ...
> > > > ... mov r9d,0x8c1c860
> > > > ... add r9,QWORD PTR gs:0x21a00
> > > > ... ...
> > > > mov rdx,rbp mov rdx, r9
> > > > add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
> > > > ... ...
> > > > mov ecx,0x28 mov ecx,0x28
> > > > push r9
> > > > call 0xffffffffe305e474 call 0xffffffffe305e524
> > > > pop r9
> > > > mov rdi,rax mov rdi,rax
> > > > ... ...
> > > > movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
> > > > ... ...
> > > >
> > >
> > > Eduard nerd-sniped me today with this a bit... :)
> > >
> > > I have a few questions and suggestions.
> > >
> > > So it seems like each *subprogram* (not the entire BPF program) gets
> > > its own per-CPU private stack allocation. Is that intentional? That
> > > seems a bit unnecessary. It also prevents any sort of actual
> > > recursion. Not sure if it's possible to write recursive BPF subprogram
> > > today, verifier seems to reject obvious limited recursion cases, but
> > > still, eventually we might need/want to support that, and this will be
> > > just another hurdle to overcome (so it's best to avoid adding it in
> > > the first place).
> > >
> > > I'm sure Eduard is going to try something like below and it will
> > > probably break badly (I haven't tried, sorry):
> > >
> > > int entry(void *ctx);
> > >
> > > struct {
> > > __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> > > __uint(max_entries, 1);
> > > __uint(key_size, sizeof(__u32));
> > > __array(values, int (void *));
> > > } prog_array_init SEC(".maps") = {
> > > .values = {
> > > [0] = (void *)&entry,
> > > },
> > > };
> > >
> > > static __noinline int subprog1(void)
> > > {
> > > <some state on the stack>
> > >
> > > /* here entry will replace subprog1, and so we'll have
> > > * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
> > > */
> > > bpf_tail_call(ctx, &prog_array_init, 0);
> > >
> > > return 0;
> > > }
> > >
> > >
> > > SEC("raw_tp/sys_enter")
> > > int entry(void *ctx)
> > > {
> > > <some state on the stack>
> > >
> > > subprog1();
> > > }
> > >
> > > And we effectively have limited recursion where the entry's stack
> > > state is clobbered, no?
> > >
> > > So it seems like we need to support recursion.
> > >
> >
> > How come everyone just completely ignored the main point of my entire
> > email and a real problem that has to be solved?...
> >
> > Anyways, I did write a below program:
> >
> > $ cat minimal.bpf.c
> > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> > /* Copyright (c) 2020 Facebook */
> > #include <linux/bpf.h>
> > #include <bpf/bpf_helpers.h>
> >
> > char LICENSE[] SEC("license") = "Dual BSD/GPL";
> >
> > int my_pid = 0;
> >
> > int handle_tp(void *ctx);
> >
> > struct {
> > __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> > __uint(max_entries, 1);
> > __uint(key_size, sizeof(__u32));
> > __array(values, int (void *));
> > } prog_array_init SEC(".maps") = {
> > .values = {
> > [0] = (void *)&handle_tp,
> > },
> > };
> >
> > static __noinline int subprog(void *ctx)
> > {
> > static int cnt;
> >
> > cnt++;
> >
> > bpf_printk("SUBPROG - BEFORE %d", cnt);
> >
> > bpf_tail_call(ctx, &prog_array_init, 0);
> >
> > bpf_printk("SUBPROG - AFTER %d", cnt);
> >
> > return 0;
> > }
> >
> > SEC("tp/syscalls/sys_enter_write")
> > int handle_tp(void *ctx)
> > {
> > static int cnt;
> > int pid = bpf_get_current_pid_tgid() >> 32;
> >
> > if (pid != my_pid)
> > return 0;
> >
> > cnt++;
> >
> > bpf_printk("ENTRY - BEFORE %d", cnt);
> >
> > subprog(ctx);
> >
> > bpf_printk("ENTRY - AFTER %d", cnt);
> >
> > return 0;
> > }
> >
> >
> > And triggered one write syscall, getting the log above. You can see
> > that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
> > to the tail call limit). And we do indeed get lots of entry program
> > recurrence.
> >
> > We *need to support recursion* is my main point.
>
> Not quite.
> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
Only of subprog(), not of handle_tp(). See all those "ENTRY - AFTER"
messages. We do return to all the nested handle_tp() calls and
continue just fine.
I put the log into [0] for a bit easier visual inspection.
[0] https://gist.github.com/anakryiko/6ccdfc62188f8ad4991641fb637d954c
> static int cnt counts stuff because it's static.
>
> So we don't need to support recursion with private stack,
> but tail_calls and private stack are buggy indeed.
>
> emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
> stack is used.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-23 1:05 ` Alexei Starovoitov
2024-07-23 3:26 ` Andrii Nakryiko
@ 2024-07-23 5:30 ` Yonghong Song
2024-07-23 7:02 ` Yonghong Song
1 sibling, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2024-07-23 5:30 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 7/22/24 6:05 PM, Alexei Starovoitov wrote:
> On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> The main motivation for private stack comes from nested
>>>> scheduler in sched-ext from Tejun. The basic idea is that
>>>> - each cgroup will its own associated bpf program,
>>>> - bpf program with parent cgroup will call bpf programs
>>>> in immediate child cgroups.
>>>>
>>>> Let us say we have the following cgroup hierarchy:
>>>> root_cg (prog0):
>>>> cg1 (prog1):
>>>> cg11 (prog11):
>>>> cg111 (prog111)
>>>> cg112 (prog112)
>>>> cg12 (prog12):
>>>> cg121 (prog121)
>>>> cg122 (prog122)
>>>> cg2 (prog2):
>>>> cg21 (prog21)
>>>> cg22 (prog22)
>>>> cg23 (prog23)
>>>>
>>>> In the above example, prog0 will call a kfunc which will
>>>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>>>> then the information is summarized and sent back to prog0.
>>>> Similarly, prog11 and prog12 will be invoked in the kfunc
>>>> and the result will be summarized and sent back to prog1, etc.
>>>>
>>>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>>>> The each bpf program (including its subprograms) has maximum
>>>> 512B stack size to avoid potential stack overflow.
>>>> And nested bpf programs increase the risk of stack overflow.
>>>> To avoid potential stack overflow caused by bpf programs,
>>>> this patch implemented a private stack so bpf program stack
>>>> space is allocated dynamically when the program is jited.
>>>> Such private stack is applied to tracing programs like
>>>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>>>> tracing.
>>>>
>>>> But more than one instance of the same bpf program may
>>>> run in the system. To make things simple, percpu private
>>>> stack is allocated for each program, so if the same program
>>>> is running on different cpus concurrently, we won't have
>>>> any issue. Note that the kernel already have logic to prevent
>>>> the recursion for the same bpf program on the same cpu
>>>> (kprobe, fentry, etc.).
>>>>
>>>> The patch implemented a percpu private stack based approach
>>>> for x86 arch.
>>>> - The stack size will be 0 and any stack access is from
>>>> jit-time allocated percpu storage.
>>>> - In the beginning of jit, r9 is used to save percpu
>>>> private stack pointer.
>>>> - Each rbp in the bpf asm insn is replaced by r9.
>>>> - For each call, push r9 before the call and pop r9
>>>> after the call to preserve r9 value.
>>>>
>>>> Compared to previous RFC patch [1], this patch added
>>>> some conditions to enable private stack, e.g., verifier
>>>> calculated stack size, prog type, etc. The new patch
>>>> also added a performance test to compare private stack
>>>> vs. no private stack.
>>>>
>>>> The following are some code example to illustrate the idea
>>>> for selftest cgroup_skb_sk_lookup:
>>>>
>>>> the existing code the private-stack approach code
>>>> endbr64 endbr64
>>>> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
>>>> xchg ax,ax xchg ax,ax
>>>> push rbp push rbp
>>>> mov rbp,rsp mov rbp,rsp
>>>> endbr64 endbr64
>>>> sub rsp,0x68
>>>> push rbx push rbx
>>>> ... ...
>>>> ... mov r9d,0x8c1c860
>>>> ... add r9,QWORD PTR gs:0x21a00
>>>> ... ...
>>>> mov rdx,rbp mov rdx, r9
>>>> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
>>>> ... ...
>>>> mov ecx,0x28 mov ecx,0x28
>>>> push r9
>>>> call 0xffffffffe305e474 call 0xffffffffe305e524
>>>> pop r9
>>>> mov rdi,rax mov rdi,rax
>>>> ... ...
>>>> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
>>>> ... ...
>>>>
>>> Eduard nerd-sniped me today with this a bit... :)
>>>
>>> I have a few questions and suggestions.
>>>
>>> So it seems like each *subprogram* (not the entire BPF program) gets
>>> its own per-CPU private stack allocation. Is that intentional? That
>>> seems a bit unnecessary. It also prevents any sort of actual
>>> recursion. Not sure if it's possible to write recursive BPF subprogram
>>> today, verifier seems to reject obvious limited recursion cases, but
>>> still, eventually we might need/want to support that, and this will be
>>> just another hurdle to overcome (so it's best to avoid adding it in
>>> the first place).
>>>
>>> I'm sure Eduard is going to try something like below and it will
>>> probably break badly (I haven't tried, sorry):
>>>
>>> int entry(void *ctx);
>>>
>>> struct {
>>> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>> __uint(max_entries, 1);
>>> __uint(key_size, sizeof(__u32));
>>> __array(values, int (void *));
>>> } prog_array_init SEC(".maps") = {
>>> .values = {
>>> [0] = (void *)&entry,
>>> },
>>> };
>>>
>>> static __noinline int subprog1(void)
>>> {
>>> <some state on the stack>
>>>
>>> /* here entry will replace subprog1, and so we'll have
>>> * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
>>> */
>>> bpf_tail_call(ctx, &prog_array_init, 0);
>>>
>>> return 0;
>>> }
>>>
>>>
>>> SEC("raw_tp/sys_enter")
>>> int entry(void *ctx)
>>> {
>>> <some state on the stack>
>>>
>>> subprog1();
>>> }
>>>
>>> And we effectively have limited recursion where the entry's stack
>>> state is clobbered, no?
>>>
>>> So it seems like we need to support recursion.
>>>
>> How come everyone just completely ignored the main point of my entire
>> email and a real problem that has to be solved?...
>>
>> Anyways, I did write a below program:
>>
>> $ cat minimal.bpf.c
>> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> /* Copyright (c) 2020 Facebook */
>> #include <linux/bpf.h>
>> #include <bpf/bpf_helpers.h>
>>
>> char LICENSE[] SEC("license") = "Dual BSD/GPL";
>>
>> int my_pid = 0;
>>
>> int handle_tp(void *ctx);
>>
>> struct {
>> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>> __uint(max_entries, 1);
>> __uint(key_size, sizeof(__u32));
>> __array(values, int (void *));
>> } prog_array_init SEC(".maps") = {
>> .values = {
>> [0] = (void *)&handle_tp,
>> },
>> };
>>
>> static __noinline int subprog(void *ctx)
>> {
>> static int cnt;
>>
>> cnt++;
>>
>> bpf_printk("SUBPROG - BEFORE %d", cnt);
>>
>> bpf_tail_call(ctx, &prog_array_init, 0);
>>
>> bpf_printk("SUBPROG - AFTER %d", cnt);
>>
>> return 0;
>> }
>>
>> SEC("tp/syscalls/sys_enter_write")
>> int handle_tp(void *ctx)
>> {
>> static int cnt;
>> int pid = bpf_get_current_pid_tgid() >> 32;
>>
>> if (pid != my_pid)
>> return 0;
>>
>> cnt++;
>>
>> bpf_printk("ENTRY - BEFORE %d", cnt);
>>
>> subprog(ctx);
>>
>> bpf_printk("ENTRY - AFTER %d", cnt);
>>
>> return 0;
>> }
>>
>>
>> And triggered one write syscall, getting the log above. You can see
>> that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
>> to the tail call limit). And we do indeed get lots of entry program
>> recurrence.
>>
>> We *need to support recursion* is my main point.
> Not quite.
> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
> static int cnt counts stuff because it's static.
>
> So we don't need to support recursion with private stack,
> but tail_calls and private stack are buggy indeed.
>
> emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
> stack is used.
Right, stack_depth argument in emit_bpf_tail_call_direct()/emit_bpf_tail_call_indirect()
should be 0 if private stack is used. Will fix in next revision.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-23 5:30 ` Yonghong Song
@ 2024-07-23 7:02 ` Yonghong Song
0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-23 7:02 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 7/22/24 10:30 PM, Yonghong Song wrote:
>
> On 7/22/24 6:05 PM, Alexei Starovoitov wrote:
>> On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>> On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song
>>>> <yonghong.song@linux.dev> wrote:
>>>>> The main motivation for private stack comes from nested
>>>>> scheduler in sched-ext from Tejun. The basic idea is that
>>>>> - each cgroup will its own associated bpf program,
>>>>> - bpf program with parent cgroup will call bpf programs
>>>>> in immediate child cgroups.
>>>>>
>>>>> Let us say we have the following cgroup hierarchy:
>>>>> root_cg (prog0):
>>>>> cg1 (prog1):
>>>>> cg11 (prog11):
>>>>> cg111 (prog111)
>>>>> cg112 (prog112)
>>>>> cg12 (prog12):
>>>>> cg121 (prog121)
>>>>> cg122 (prog122)
>>>>> cg2 (prog2):
>>>>> cg21 (prog21)
>>>>> cg22 (prog22)
>>>>> cg23 (prog23)
>>>>>
>>>>> In the above example, prog0 will call a kfunc which will
>>>>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>>>>> then the information is summarized and sent back to prog0.
>>>>> Similarly, prog11 and prog12 will be invoked in the kfunc
>>>>> and the result will be summarized and sent back to prog1, etc.
>>>>>
>>>>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>>>>> The each bpf program (including its subprograms) has maximum
>>>>> 512B stack size to avoid potential stack overflow.
>>>>> And nested bpf programs increase the risk of stack overflow.
>>>>> To avoid potential stack overflow caused by bpf programs,
>>>>> this patch implemented a private stack so bpf program stack
>>>>> space is allocated dynamically when the program is jited.
>>>>> Such private stack is applied to tracing programs like
>>>>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>>>>> tracing.
>>>>>
>>>>> But more than one instance of the same bpf program may
>>>>> run in the system. To make things simple, percpu private
>>>>> stack is allocated for each program, so if the same program
>>>>> is running on different cpus concurrently, we won't have
>>>>> any issue. Note that the kernel already have logic to prevent
>>>>> the recursion for the same bpf program on the same cpu
>>>>> (kprobe, fentry, etc.).
>>>>>
>>>>> The patch implemented a percpu private stack based approach
>>>>> for x86 arch.
>>>>> - The stack size will be 0 and any stack access is from
>>>>> jit-time allocated percpu storage.
>>>>> - In the beginning of jit, r9 is used to save percpu
>>>>> private stack pointer.
>>>>> - Each rbp in the bpf asm insn is replaced by r9.
>>>>> - For each call, push r9 before the call and pop r9
>>>>> after the call to preserve r9 value.
>>>>>
>>>>> Compared to previous RFC patch [1], this patch added
>>>>> some conditions to enable private stack, e.g., verifier
>>>>> calculated stack size, prog type, etc. The new patch
>>>>> also added a performance test to compare private stack
>>>>> vs. no private stack.
>>>>>
>>>>> The following are some code example to illustrate the idea
>>>>> for selftest cgroup_skb_sk_lookup:
>>>>>
>>>>> the existing code the private-stack
>>>>> approach code
>>>>> endbr64 endbr64
>>>>> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR
>>>>> [rax+rax*1+0x0]
>>>>> xchg ax,ax xchg ax,ax
>>>>> push rbp push rbp
>>>>> mov rbp,rsp mov rbp,rsp
>>>>> endbr64 endbr64
>>>>> sub rsp,0x68
>>>>> push rbx push rbx
>>>>> ... ...
>>>>> ... mov r9d,0x8c1c860
>>>>> ... add r9,QWORD PTR
>>>>> gs:0x21a00
>>>>> ... ...
>>>>> mov rdx,rbp mov rdx, r9
>>>>> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
>>>>> ... ...
>>>>> mov ecx,0x28 mov ecx,0x28
>>>>> push r9
>>>>> call 0xffffffffe305e474 call 0xffffffffe305e524
>>>>> pop r9
>>>>> mov rdi,rax mov rdi,rax
>>>>> ... ...
>>>>> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR
>>>>> [r9-0x46]
>>>>> ... ...
>>>>>
>>>> Eduard nerd-sniped me today with this a bit... :)
>>>>
>>>> I have a few questions and suggestions.
>>>>
>>>> So it seems like each *subprogram* (not the entire BPF program) gets
>>>> its own per-CPU private stack allocation. Is that intentional? That
>>>> seems a bit unnecessary. It also prevents any sort of actual
>>>> recursion. Not sure if it's possible to write recursive BPF subprogram
>>>> today, verifier seems to reject obvious limited recursion cases, but
>>>> still, eventually we might need/want to support that, and this will be
>>>> just another hurdle to overcome (so it's best to avoid adding it in
>>>> the first place).
>>>>
>>>> I'm sure Eduard is going to try something like below and it will
>>>> probably break badly (I haven't tried, sorry):
>>>>
>>>> int entry(void *ctx);
>>>>
>>>> struct {
>>>> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>>> __uint(max_entries, 1);
>>>> __uint(key_size, sizeof(__u32));
>>>> __array(values, int (void *));
>>>> } prog_array_init SEC(".maps") = {
>>>> .values = {
>>>> [0] = (void *)&entry,
>>>> },
>>>> };
>>>>
>>>> static __noinline int subprog1(void)
>>>> {
>>>> <some state on the stack>
>>>>
>>>> /* here entry will replace subprog1, and so we'll have
>>>> * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
>>>> */
>>>> bpf_tail_call(ctx, &prog_array_init, 0);
>>>>
>>>> return 0;
>>>> }
>>>>
>>>>
>>>> SEC("raw_tp/sys_enter")
>>>> int entry(void *ctx)
>>>> {
>>>> <some state on the stack>
>>>>
>>>> subprog1();
>>>> }
>>>>
>>>> And we effectively have limited recursion where the entry's stack
>>>> state is clobbered, no?
>>>>
>>>> So it seems like we need to support recursion.
>>>>
>>> How come everyone just completely ignored the main point of my entire
>>> email and a real problem that has to be solved?...
>>>
>>> Anyways, I did write a below program:
>>>
>>> $ cat minimal.bpf.c
>>> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>>> /* Copyright (c) 2020 Facebook */
>>> #include <linux/bpf.h>
>>> #include <bpf/bpf_helpers.h>
>>>
>>> char LICENSE[] SEC("license") = "Dual BSD/GPL";
>>>
>>> int my_pid = 0;
>>>
>>> int handle_tp(void *ctx);
>>>
>>> struct {
>>> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>> __uint(max_entries, 1);
>>> __uint(key_size, sizeof(__u32));
>>> __array(values, int (void *));
>>> } prog_array_init SEC(".maps") = {
>>> .values = {
>>> [0] = (void *)&handle_tp,
>>> },
>>> };
>>>
>>> static __noinline int subprog(void *ctx)
>>> {
>>> static int cnt;
>>>
>>> cnt++;
>>>
>>> bpf_printk("SUBPROG - BEFORE %d", cnt);
>>>
>>> bpf_tail_call(ctx, &prog_array_init, 0);
>>>
>>> bpf_printk("SUBPROG - AFTER %d", cnt);
>>>
>>> return 0;
>>> }
>>>
>>> SEC("tp/syscalls/sys_enter_write")
>>> int handle_tp(void *ctx)
>>> {
>>> static int cnt;
>>> int pid = bpf_get_current_pid_tgid() >> 32;
>>>
>>> if (pid != my_pid)
>>> return 0;
>>>
>>> cnt++;
>>>
>>> bpf_printk("ENTRY - BEFORE %d", cnt);
>>>
>>> subprog(ctx);
>>>
>>> bpf_printk("ENTRY - AFTER %d", cnt);
>>>
>>> return 0;
>>> }
>>>
>>>
>>> And triggered one write syscall, getting the log above. You can see
>>> that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
>>> to the tail call limit). And we do indeed get lots of entry program
>>> recurrence.
>>>
>>> We *need to support recursion* is my main point.
>> Not quite.
>> It's not a recursion. The stack collapsed/gone/wiped out before
>> tail_call.
>> static int cnt counts stuff because it's static.
>>
>> So we don't need to support recursion with private stack,
>> but tail_calls and private stack are buggy indeed.
>>
>> emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
>> stack is used.
>
> Right, stack_depth argument in
> emit_bpf_tail_call_direct()/emit_bpf_tail_call_indirect()
> should be 0 if private stack is used. Will fix in next revision.
Actually, the current implementation is correct. We already set stack_depth to be 0
if private stack is used. So we should be fine here.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-23 3:26 ` Andrii Nakryiko
@ 2024-07-24 3:17 ` Alexei Starovoitov
2024-07-24 4:06 ` Andrii Nakryiko
2024-07-24 4:32 ` Yonghong Song
0 siblings, 2 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-24 3:17 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Mon, Jul 22, 2024 at 8:27 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> > > We *need to support recursion* is my main point.
> >
> > Not quite.
> > It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
>
> Only of subprog(), not of handle_tp(). See all those "ENTRY - AFTER"
> messages. We do return to all the nested handle_tp() calls and
> continue just fine.
>
> I put the log into [0] for a bit easier visual inspection.
>
> [0] https://gist.github.com/anakryiko/6ccdfc62188f8ad4991641fb637d954c
Argh. So the pathological prog can consume 512*33 of stack.
We have to reject it somehow in the verifier or tailor private stack
to support it. Then private stack will be a feature and a fix for this issue.
But then it would need to preallocate 512*33 per cpu per program.
Which is too much.
Maybe we can preallocate _aligned_ 512 or 1k per cpu per prog,
then adjust r9 before call or tail_call and if r9 is about to cross
alignment before tail_call fail the tail call (like tail call cnt was
over limit).
Hopefully there are better ideas, since it's all quite messy.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-24 3:17 ` Alexei Starovoitov
@ 2024-07-24 4:06 ` Andrii Nakryiko
2024-07-24 4:46 ` Yonghong Song
2024-07-24 4:32 ` Yonghong Song
1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2024-07-24 4:06 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Tue, Jul 23, 2024 at 8:17 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 8:27 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > > > We *need to support recursion* is my main point.
> > >
> > > Not quite.
> > > It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
> >
> > Only of subprog(), not of handle_tp(). See all those "ENTRY - AFTER"
> > messages. We do return to all the nested handle_tp() calls and
> > continue just fine.
> >
> > I put the log into [0] for a bit easier visual inspection.
> >
> > [0] https://gist.github.com/anakryiko/6ccdfc62188f8ad4991641fb637d954c
>
> Argh. So the pathological prog can consume 512*33 of stack.
> We have to reject it somehow in the verifier or tailor private stack
> to support it. Then private stack will be a feature and a fix for this issue.
> But then it would need to preallocate 512*33 per cpu per program.
> Which is too much.
> Maybe we can preallocate _aligned_ 512 or 1k per cpu per prog,
> then adjust r9 before call or tail_call and if r9 is about to cross
> alignment before tail_call fail the tail call (like tail call cnt was
> over limit).
This is close to what I proposed to Yonghong offline. One approach I
had in mind was as follows. If we know that a BPF program can do a
tail call, then allocate some larger private stack (1KB, 4KB, 8KB,
don't know), compared what the BPF program itself would need. Then in
bpf_tail_call() helper's inlining itself check whether R9 +
<max_prog_stack_size> is larger than the private stack's size. And if
yes, then don't do tail call (as if we reached max number of tail
calls). Tail call interface allows for that.
This way we don't slow down typical non-tail call cases and don't pay
unnecessary memory price, but we still make tail call work just fine
in most cases, except some pathological ones like my example. I think
the expected situation for tail call is to replace main program with
another main program, so the typical case will work perfectly fine.
> Hopefully there are better ideas, since it's all quite messy.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-24 3:17 ` Alexei Starovoitov
2024-07-24 4:06 ` Andrii Nakryiko
@ 2024-07-24 4:32 ` Yonghong Song
1 sibling, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-24 4:32 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 7/23/24 8:17 PM, Alexei Starovoitov wrote:
> On Mon, Jul 22, 2024 at 8:27 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>>>> We *need to support recursion* is my main point.
>>> Not quite.
>>> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
>> Only of subprog(), not of handle_tp(). See all those "ENTRY - AFTER"
>> messages. We do return to all the nested handle_tp() calls and
>> continue just fine.
>>
>> I put the log into [0] for a bit easier visual inspection.
>>
>> [0] https://gist.github.com/anakryiko/6ccdfc62188f8ad4991641fb637d954c
> Argh. So the pathological prog can consume 512*33 of stack.
We have a check in verifier like below:
if (idx && subprog[idx].has_tail_call && depth >= 256) {
verbose(env,
"tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
depth);
return -EACCES;
}
So the maximum stack size could be around 256 * 33 which is a little bit more than 8KB.
> We have to reject it somehow in the verifier or tailor private stack
> to support it. Then private stack will be a feature and a fix for this issue.
> But then it would need to preallocate 512*33 per cpu per program.
> Which is too much.
> Maybe we can preallocate _aligned_ 512 or 1k per cpu per prog,
> then adjust r9 before call or tail_call and if r9 is about to cross
> alignment before tail_call fail the tail call (like tail call cnt was
> over limit).
> Hopefully there are better ideas, since it's all quite messy.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-24 4:06 ` Andrii Nakryiko
@ 2024-07-24 4:46 ` Yonghong Song
0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-24 4:46 UTC (permalink / raw)
To: Andrii Nakryiko, Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 7/23/24 9:06 PM, Andrii Nakryiko wrote:
> On Tue, Jul 23, 2024 at 8:17 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Jul 22, 2024 at 8:27 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>>>> We *need to support recursion* is my main point.
>>>> Not quite.
>>>> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
>>> Only of subprog(), not of handle_tp(). See all those "ENTRY - AFTER"
>>> messages. We do return to all the nested handle_tp() calls and
>>> continue just fine.
>>>
>>> I put the log into [0] for a bit easier visual inspection.
>>>
>>> [0] https://gist.github.com/anakryiko/6ccdfc62188f8ad4991641fb637d954c
>> Argh. So the pathological prog can consume 512*33 of stack.
>> We have to reject it somehow in the verifier or tailor private stack
>> to support it. Then private stack will be a feature and a fix for this issue.
>> But then it would need to preallocate 512*33 per cpu per program.
>> Which is too much.
>> Maybe we can preallocate _aligned_ 512 or 1k per cpu per prog,
>> then adjust r9 before call or tail_call and if r9 is about to cross
>> alignment before tail_call fail the tail call (like tail call cnt was
>> over limit).
> This is close to what I proposed to Yonghong offline. One approach I
> had in mind was as follows. If we know that a BPF program can do a
> tail call, then allocate some larger private stack (1KB, 4KB, 8KB,
> don't know), compared what the BPF program itself would need. Then in
> bpf_tail_call() helper's inlining itself check whether R9 +
> <max_prog_stack_size> is larger than the private stack's size. And if
> yes, then don't do tail call (as if we reached max number of tail
> calls). Tail call interface allows for that.
>
> This way we don't slow down typical non-tail call cases and don't pay
> unnecessary memory price, but we still make tail call work just fine
> in most cases, except some pathological ones like my example. I think
> the expected situation for tail call is to replace main program with
> another main program, so the typical case will work perfectly fine.
Indeed, this is an approach we could use. Based on recursion 'cnt',
we could calculate the frame pointer properly based on original
allocated frame pointer. Currently, private stack only supports
tracing programs. It should be extremely rare for a tracing program
to self recurse with tail call. So we could allocate smaller amount
of memory e.g. 1KB or 2KB and add runtime checking against the private stack
size to prevent stack overflow.
>
>> Hopefully there are better ideas, since it's all quite messy.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 16:43 ` Yonghong Song
@ 2024-07-24 5:08 ` Yonghong Song
2024-07-24 16:54 ` Alexei Starovoitov
0 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2024-07-24 5:08 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau
On 7/22/24 9:43 AM, Yonghong Song wrote:
>
> On 7/19/24 8:28 PM, Andrii Nakryiko wrote:
>> On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song
>> <yonghong.song@linux.dev> wrote:
>>> The main motivation for private stack comes from nested
>>> scheduler in sched-ext from Tejun. The basic idea is that
>>> - each cgroup will its own associated bpf program,
>>> - bpf program with parent cgroup will call bpf programs
>>> in immediate child cgroups.
>>>
>>> Let us say we have the following cgroup hierarchy:
>>> root_cg (prog0):
>>> cg1 (prog1):
>>> cg11 (prog11):
>>> cg111 (prog111)
>>> cg112 (prog112)
>>> cg12 (prog12):
>>> cg121 (prog121)
>>> cg122 (prog122)
>>> cg2 (prog2):
>>> cg21 (prog21)
>>> cg22 (prog22)
>>> cg23 (prog23)
>>>
>>> In the above example, prog0 will call a kfunc which will
>>> call prog1 and prog2 to get sched info for cg1 and cg2 and
>>> then the information is summarized and sent back to prog0.
>>> Similarly, prog11 and prog12 will be invoked in the kfunc
>>> and the result will be summarized and sent back to prog1, etc.
>>>
>>> Currently, for each thread, the x86 kernel allocate 8KB stack.
>>> The each bpf program (including its subprograms) has maximum
>>> 512B stack size to avoid potential stack overflow.
>>> And nested bpf programs increase the risk of stack overflow.
>>> To avoid potential stack overflow caused by bpf programs,
>>> this patch implemented a private stack so bpf program stack
>>> space is allocated dynamically when the program is jited.
>>> Such private stack is applied to tracing programs like
>>> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
>>> tracing.
>>>
>>> But more than one instance of the same bpf program may
>>> run in the system. To make things simple, percpu private
>>> stack is allocated for each program, so if the same program
>>> is running on different cpus concurrently, we won't have
>>> any issue. Note that the kernel already have logic to prevent
>>> the recursion for the same bpf program on the same cpu
>>> (kprobe, fentry, etc.).
>>>
>>> The patch implemented a percpu private stack based approach
>>> for x86 arch.
>>> - The stack size will be 0 and any stack access is from
>>> jit-time allocated percpu storage.
>>> - In the beginning of jit, r9 is used to save percpu
>>> private stack pointer.
>>> - Each rbp in the bpf asm insn is replaced by r9.
>>> - For each call, push r9 before the call and pop r9
>>> after the call to preserve r9 value.
>>>
>>> Compared to previous RFC patch [1], this patch added
>>> some conditions to enable private stack, e.g., verifier
>>> calculated stack size, prog type, etc. The new patch
>>> also added a performance test to compare private stack
>>> vs. no private stack.
>>>
>>> The following are some code example to illustrate the idea
>>> for selftest cgroup_skb_sk_lookup:
>>>
>>> the existing code the private-stack
>>> approach code
>>> endbr64 endbr64
>>> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR
>>> [rax+rax*1+0x0]
>>> xchg ax,ax xchg ax,ax
>>> push rbp push rbp
>>> mov rbp,rsp mov rbp,rsp
>>> endbr64 endbr64
>>> sub rsp,0x68
>>> push rbx push rbx
>>> ... ...
>>> ... mov r9d,0x8c1c860
>>> ... add r9,QWORD PTR
>>> gs:0x21a00
>>> ... ...
>>> mov rdx,rbp mov rdx, r9
>>> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
>>> ... ...
>>> mov ecx,0x28 mov ecx,0x28
>>> push r9
>>> call 0xffffffffe305e474 call 0xffffffffe305e524
>>> pop r9
>>> mov rdi,rax mov rdi,rax
>>> ... ...
>>> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR
>>> [r9-0x46]
>>> ... ...
>>>
>> Eduard nerd-sniped me today with this a bit... :)
>>
>> I have a few questions and suggestions.
>>
>> So it seems like each *subprogram* (not the entire BPF program) gets
>> its own per-CPU private stack allocation. Is that intentional? That
>
> Currently yes. The reason is the same prog could be run on different
> cpus at the same time.
>
>> seems a bit unnecessary. It also prevents any sort of actual
>> recursion. Not sure if it's possible to write recursive BPF subprogram
>> today, verifier seems to reject obvious limited recursion cases, but
>> still, eventually we might need/want to support that, and this will be
>> just another hurdle to overcome (so it's best to avoid adding it in
>> the first place).
>>
>> I'm sure Eduard is going to try something like below and it will
>> probably break badly (I haven't tried, sorry):
>>
>> int entry(void *ctx);
>>
>> struct {
>> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>> __uint(max_entries, 1);
>> __uint(key_size, sizeof(__u32));
>> __array(values, int (void *));
>> } prog_array_init SEC(".maps") = {
>> .values = {
>> [0] = (void *)&entry,
>> },
>> };
>>
>> static __noinline int subprog1(void)
>> {
>> <some state on the stack>
>>
>> /* here entry will replace subprog1, and so we'll have
>> * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
>> */
>> bpf_tail_call(ctx, &prog_array_init, 0);
>>
>> return 0;
>> }
>>
>>
>> SEC("raw_tp/sys_enter")
>> int entry(void *ctx)
>> {
>> <some state on the stack>
>>
>> subprog1();
>> }
>>
>> And we effectively have limited recursion where the entry's stack
>> state is clobbered, no?
>>
>> So it seems like we need to support recursion.
>>
>>
>> So, the question I have is. Why not do the following:
>> a) only setup r9 *once* in entry program's prologue (before tail call
>> jump target)
>> b) before each call we can adjust r9 with current prog/subprog's
>> maximum *own* stack, something like:
>>
>> push r9;
>> r9 += 128; // 128 is subprog's stack usage
>> call <some-subprog>
>> pop r9;
>>
>> The idea being that on tail call or in subprog call we assume r9 is
>> already pointing to the right place. We can probably also figure out
>> how to avoid push/pop r9 if we make sure that subprogram always
>> restores r9 (taking tail calls into account and all that, of course)?
>>
>> Is this feasible?
>
> This is possible. I actually hacked such an idea easily. The basic
> idea is push frame pointer as an additional argument to the bpf
> static sub-prog. This is a little bit complicated. It will probably
> save some stack size but I am not sure how much it is.
Discussed with Andrii. I think the following approach should work.
For each non-static prog, the private stack is allocated including
that non-static prog and the called static progs. For example,
main_prog
static_prog_1
static_prog_11
global_prog
static_prog_12
static_prog_2
So in verifier we calculate stack size for
main_prog
static_prog_1
static_prog_11
static_prog_2
and
global_prog
static_prog_12
Let us say the stack size for main_prog like below for each (sub)prog
main_prog // stack size 100
static_prog_1 // stack size 100
static_prog_11 // stack size 100
static_prog_2 // static size 100
so total static size is 300 so the private stack size will be 300.
So R9 is calculated like below
main_prog
R9 = ... // for tailcall reachable, R9 may be original R9 + offset
// for non-tailcall reachable, R9 equals the original R9 (based on jit-time allocation).
... R9 ...
R9 += 100
static_prog_1
... R9 ...
R9 += 100
static_prog_11
... R9 ...
R9 -= 100
R9 -= 100
... R9 ...
R9 += 100
static_prog_2
... R9 ...
R9 -= 100
Similary, we can calculate R9 offset for
global_prog
static_prog_12
as well.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-24 5:08 ` Yonghong Song
@ 2024-07-24 16:54 ` Alexei Starovoitov
2024-07-24 17:56 ` Yonghong Song
0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-24 16:54 UTC (permalink / raw)
To: Yonghong Song
Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Tue, Jul 23, 2024 at 10:09 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> Discussed with Andrii. I think the following approach should work.
> For each non-static prog, the private stack is allocated including
> that non-static prog and the called static progs. For example,
> main_prog
> static_prog_1
> static_prog_11
> global_prog
> static_prog_12
> static_prog_2
>
> So in verifier we calculate stack size for
> main_prog
> static_prog_1
> static_prog_11
> static_prog_2
> and
> global_prog
> static_prog_12
>
> Let us say the stack size for main_prog like below for each (sub)prog
> main_prog // stack size 100
> static_prog_1 // stack size 100
> static_prog_11 // stack size 100
> static_prog_2 // static size 100
> so total static size is 300 so the private stack size will be 300.
> So R9 is calculated like below
> main_prog
> R9 = ... // for tailcall reachable, R9 may be original R9 + offset
> // for non-tailcall reachable, R9 equals the original R9 (based on jit-time allocation).
> ... R9 ...
> R9 += 100
> static_prog_1
> ... R9 ...
> R9 += 100
> static_prog_11
> ... R9 ...
> R9 -= 100
> R9 -= 100
> ... R9 ...
> R9 += 100
> static_prog_2
> ... R9 ...
> R9 -= 100
>
> Similary, we can calculate R9 offset for
> global_prog
> static_prog_12
> as well.
I don't understand why differentiate static and global surprogs.
But, mainly, += and -= around the call is suboptimal.
Can we do it as a normal stack does ?
Each prog knows how much stack it needs,
so it can do:
r9 += stack_depth in the prologue
and all accesses are done as r9 - off.
Then to do a call nothing extra is needed.
The callee will do r9 += its own stack depth.
Whether private stack growth up or down is tbd.
The challenge is how to supply proper r9 on entry
into the main prog. Potentially a task for bpf trampoline,
and kprobe/tp/etc will need special hack in bpf_dispatcher_nop_func.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-24 16:54 ` Alexei Starovoitov
@ 2024-07-24 17:56 ` Yonghong Song
0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2024-07-24 17:56 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On 7/24/24 9:54 AM, Alexei Starovoitov wrote:
> On Tue, Jul 23, 2024 at 10:09 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> Discussed with Andrii. I think the following approach should work.
>> For each non-static prog, the private stack is allocated including
>> that non-static prog and the called static progs. For example,
>> main_prog
>> static_prog_1
>> static_prog_11
>> global_prog
>> static_prog_12
>> static_prog_2
>>
>> So in verifier we calculate stack size for
>> main_prog
>> static_prog_1
>> static_prog_11
>> static_prog_2
>> and
>> global_prog
>> static_prog_12
>>
>> Let us say the stack size for main_prog like below for each (sub)prog
>> main_prog // stack size 100
>> static_prog_1 // stack size 100
>> static_prog_11 // stack size 100
>> static_prog_2 // static size 100
>> so total static size is 300 so the private stack size will be 300.
>> So R9 is calculated like below
>> main_prog
>> R9 = ... // for tailcall reachable, R9 may be original R9 + offset
>> // for non-tailcall reachable, R9 equals the original R9 (based on jit-time allocation).
>> ... R9 ...
>> R9 += 100
>> static_prog_1
>> ... R9 ...
>> R9 += 100
>> static_prog_11
>> ... R9 ...
>> R9 -= 100
>> R9 -= 100
>> ... R9 ...
>> R9 += 100
>> static_prog_2
>> ... R9 ...
>> R9 -= 100
>>
>> Similary, we can calculate R9 offset for
>> global_prog
>> static_prog_12
>> as well.
> I don't understand why differentiate static and global surprogs.
Specially handling global subprog is for potential BPF_PROG_TYPE_EXT
prog which may replace global subprog.
Therefore, so private stack, global subprog will terminate
stack accounting to minimize stack usage. If we treat
static/global subprogs the same, and if freplace does happen,
we might allocate more-than-necessary private stack.
freplace probably not a common use case. If it does happen,
the original global subprog may be a stub func which does
not have any stack usage and the freplace prog is the one
implementing the business logic. So from that perspective,
we do not need to differentiate static and global subprogs.
>
> But, mainly, += and -= around the call is suboptimal.
> Can we do it as a normal stack does ?
> Each prog knows how much stack it needs,
> so it can do:
> r9 += stack_depth in the prologue
> and all accesses are done as r9 - off.
> Then to do a call nothing extra is needed.
> The callee will do r9 += its own stack depth.
I thought the += and -= at call site are easier to understand.
But certainly, doing r9 += stack_depth and
r9 -= stack_depth inside the subprog works too.
> Whether private stack growth up or down is tbd.
My current approach is that private stack growth down
similar to normal stack. But we have flexibility
to grow up at frame level.
>
> The challenge is how to supply proper r9 on entry
> into the main prog. Potentially a task for bpf trampoline,
> and kprobe/tp/etc will need special hack in bpf_dispatcher_nop_func.
I have an early hack for bpf trampoline and
bpf_dispatcher_nop_func to pass private stack pointer
as the third argument to the bpf program.
In this particular case, we can just pass private
stack pointer in R9. I will pick this up.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-22 3:33 ` Eduard Zingerman
2024-07-22 16:54 ` Yonghong Song
2024-07-22 17:51 ` Alexei Starovoitov
@ 2024-07-24 21:28 ` Yonghong Song
2024-07-25 4:55 ` Alexei Starovoitov
2 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2024-07-24 21:28 UTC (permalink / raw)
To: Eduard Zingerman, bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
On 7/21/24 8:33 PM, Eduard Zingerman wrote:
> Hi Yonghong,
>
> In general I think that changes in this patch are logical and make sense.
> I have a suggestion regarding testing JIT related changes.
>
> We currently lack a convenient way to verify jit behaviour modulo
> runtime tests. I think we should have a capability to write tests like below:
>
> SEC("tp")
> __jited_x86("f: endbr64")
> __jited_x86("13: movabs $0x.*,%r9")
> __jited_x86("1d: add %gs:0x.*,%r9")
> __jited_x86("26: mov $0x1,%edi")
> __jited_x86("2b: mov %rdi,-0x8(%r9)")
> __jited_x86("2f: mov -0x8(%r9),%rdi")
> __jited_x86("33: xor %eax,%eax")
> __jited_x86("35: lock xchg %rax,-0x8(%r9)")
> __jited_x86("3a: lock xadd %rax,-0x8(%r9)")
> __naked void stack_access_insns(void)
> {
> asm volatile (
> "r1 = 1;"
> "*(u64 *)(r10 - 8) = r1;"
> "r1 = *(u64 *)(r10 - 8);"
> "r0 = 0;"
> "r0 = xchg_64(r10 - 8, r0);"
> "r0 = atomic_fetch_add((u64 *)(r10 - 8), r0);"
> "exit;"
> ::: __clobber_all);
> }
>
> In the following branch I explored a way to add such capability:
> https://github.com/eddyz87/bpf/tree/yhs-private-stack-plus-jit-testing
>
> Beside testing exact translation, such tests also provide good
> starting point for people trying to figure out how some jit features work.
>
> The below two commits are the gist of the feature:
> 8f9361be2fb3 ("selftests/bpf: __jited_x86 test tag to check x86 assembly after jit")
> 0156b148b5b4 ("selftests/bpf: utility function to get program disassembly after jit")
>
> For "0156b148b5b4" I opted to do a popen() call and execute bpftool process,
> an alternative would be to:
> a. either link tools/bpf/bpftool/jit_disasm.c as a part of the
> test_progs executable;
> b. call libbfd (binutils dis-assembler) directly from the bpftool.
>
> Currently bpftool can use two dis-assemblers: libbfd and llvm library,
> depending on the build environment. For CI builds libbfd is used.
> I don't know if llvm and libbfd always produce same output for
> identical binary code. Imo, if people are Ok with adding libbfd
I tried a simple example like below.
$ cat test.c
#include <stdint.h>
typedef struct {
unsigned char x[8];
} buf_t;
void f(buf_t *buf, uint64_t y, uint64_t z) {
if (z > 8) z = 8;
unsigned char *y_bytes = (unsigned char *)&y;
for (int i = 0; i < z; ++i) {
buf->x[i] = y_bytes[i];
}
}
$ clang -c test.c
$ objdump -d test.o <==== gcc binutils based objdump
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 89 7d f8 mov %rdi,-0x8(%rbp)
8: 48 89 75 f0 mov %rsi,-0x10(%rbp)
c: 48 89 55 e8 mov %rdx,-0x18(%rbp)
10: 48 83 7d e8 08 cmpq $0x8,-0x18(%rbp)
15: 76 08 jbe 1f <f+0x1f>
17: 48 c7 45 e8 08 00 00 movq $0x8,-0x18(%rbp)
1e: 00
1f: 48 8d 45 f0 lea -0x10(%rbp),%rax
23: 48 89 45 e0 mov %rax,-0x20(%rbp)
27: c7 45 dc 00 00 00 00 movl $0x0,-0x24(%rbp)
2e: 48 63 45 dc movslq -0x24(%rbp),%rax
32: 48 3b 45 e8 cmp -0x18(%rbp),%rax
36: 73 21 jae 59 <f+0x59>
38: 48 8b 45 e0 mov -0x20(%rbp),%rax
3c: 48 63 4d dc movslq -0x24(%rbp),%rcx
40: 8a 14 08 mov (%rax,%rcx,1),%dl
43: 48 8b 45 f8 mov -0x8(%rbp),%rax
47: 48 63 4d dc movslq -0x24(%rbp),%rcx
4b: 88 14 08 mov %dl,(%rax,%rcx,1)
4e: 8b 45 dc mov -0x24(%rbp),%eax
51: 83 c0 01 add $0x1,%eax
54: 89 45 dc mov %eax,-0x24(%rbp)
57: eb d5 jmp 2e <f+0x2e>
59: 5d pop %rbp
5a: c3 ret
$ llvm-objdump -d test.o <== clang based objdump
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 48 89 7d f8 movq %rdi, -0x8(%rbp)
8: 48 89 75 f0 movq %rsi, -0x10(%rbp)
c: 48 89 55 e8 movq %rdx, -0x18(%rbp)
10: 48 83 7d e8 08 cmpq $0x8, -0x18(%rbp)
15: 76 08 jbe 0x1f <f+0x1f>
17: 48 c7 45 e8 08 00 00 00 movq $0x8, -0x18(%rbp)
1f: 48 8d 45 f0 leaq -0x10(%rbp), %rax
23: 48 89 45 e0 movq %rax, -0x20(%rbp)
27: c7 45 dc 00 00 00 00 movl $0x0, -0x24(%rbp)
2e: 48 63 45 dc movslq -0x24(%rbp), %rax
32: 48 3b 45 e8 cmpq -0x18(%rbp), %rax
36: 73 21 jae 0x59 <f+0x59>
38: 48 8b 45 e0 movq -0x20(%rbp), %rax
3c: 48 63 4d dc movslq -0x24(%rbp), %rcx
40: 8a 14 08 movb (%rax,%rcx), %dl
43: 48 8b 45 f8 movq -0x8(%rbp), %rax
47: 48 63 4d dc movslq -0x24(%rbp), %rcx
4b: 88 14 08 movb %dl, (%rax,%rcx)
4e: 8b 45 dc movl -0x24(%rbp), %eax
51: 83 c0 01 addl $0x1, %eax
54: 89 45 dc movl %eax, -0x24(%rbp)
57: eb d5 jmp 0x2e <f+0x2e>
59: 5d popq %rbp
5a: c3 retq
There are some differences like constant representation, e.g. jump offset hex number,
gcc does not have '0x' prefix while clang has. Insn at 4b is also difference.
But overall the difference is smaller.
> dependency to test_progs, option (b) is the best. If folks on the
> mailing list agree with this, I can work on updating the patches.
>
> -------------
>
> Aside from testing I agree with Andrii regarding rbp usage,
> it seems like it should be possible to do the following in prologue:
>
> movabs $0x...,%rsp
> add %gs:0x...,%rsp
> push %rbp
>
> and there would be no need to modify translation for instructions
> accessing r10, plus debugger stack unrolling logic should still work?.
> Or am I mistaken?
>
> Thanks,
> Eduard
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-24 21:28 ` Yonghong Song
@ 2024-07-25 4:55 ` Alexei Starovoitov
2024-07-25 17:20 ` Eduard Zingerman
0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2024-07-25 4:55 UTC (permalink / raw)
To: Yonghong Song
Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Wed, Jul 24, 2024 at 2:28 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> > In the following branch I explored a way to add such capability:
> > https://github.com/eddyz87/bpf/tree/yhs-private-stack-plus-jit-testing
...
> > 0156b148b5b4 ("selftests/bpf: utility function to get program disassembly after jit")
..
> There are some differences like constant representation, e.g. jump offset hex number,
> gcc does not have '0x' prefix while clang has. Insn at 4b is also difference.
> But overall the difference is smaller.
There is certainly a difference in disasm libs that will be causing
miscompare with expected text.
Also not everyone has disasm enabled in bpftool.
In my setup:
$ bpftool prog dump jited id 1
Error: No JIT disassembly support
So we can enable such feature in selftests,
but it would have to skip the tests if bpftool is not built
with the right disasm library, hence the value of such
tests will be small.
It's probably better to make test_progs use
LLVMDisasm* directly and converge on that disasm style
assuming distros have this lib easily available.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs
2024-07-25 4:55 ` Alexei Starovoitov
@ 2024-07-25 17:20 ` Eduard Zingerman
0 siblings, 0 replies; 33+ messages in thread
From: Eduard Zingerman @ 2024-07-25 17:20 UTC (permalink / raw)
To: Alexei Starovoitov, Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On Wed, 2024-07-24 at 21:55 -0700, Alexei Starovoitov wrote:
[...]
> So we can enable such feature in selftests,
> but it would have to skip the tests if bpftool is not built
> with the right disasm library, hence the value of such
> tests will be small.
>
> It's probably better to make test_progs use
> LLVMDisasm* directly and converge on that disasm style
> assuming distros have this lib easily available.
I agree that the differences in the disassembly are too big.
As Yonghong suggested, I checked why bpftool has two disassemblers,
this is explained in the commit [0]:
> ... To disassemble instructions for JIT-ed programs, bpftool has
> relied on the libbfd library. This has been problematic in the past:
> libbfd's interface is not meant to be stable and has changed several
> times ...
I'll update the disassembly patch to use LLVM library
(or skip the test if library is not available).
[0] eb9d1acf634b ("bpftool: Add LLVM as default library for disassembling JIT-ed programs")
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2024-07-25 17:20 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44 ` Yonghong Song
2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
2024-07-19 3:01 ` Yonghong Song
2024-07-19 0:36 ` Alexei Starovoitov
2024-07-19 2:21 ` Yonghong Song
2024-07-20 0:14 ` bot+bpf-ci
2024-07-20 1:08 ` Alexei Starovoitov
2024-07-22 16:33 ` Yonghong Song
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43 ` Yonghong Song
2024-07-24 5:08 ` Yonghong Song
2024-07-24 16:54 ` Alexei Starovoitov
2024-07-24 17:56 ` Yonghong Song
2024-07-22 20:57 ` Andrii Nakryiko
2024-07-23 1:05 ` Alexei Starovoitov
2024-07-23 3:26 ` Andrii Nakryiko
2024-07-24 3:17 ` Alexei Starovoitov
2024-07-24 4:06 ` Andrii Nakryiko
2024-07-24 4:46 ` Yonghong Song
2024-07-24 4:32 ` Yonghong Song
2024-07-23 5:30 ` Yonghong Song
2024-07-23 7:02 ` Yonghong Song
2024-07-22 3:33 ` Eduard Zingerman
2024-07-22 16:54 ` Yonghong Song
2024-07-22 17:53 ` Eduard Zingerman
2024-07-22 17:51 ` Alexei Starovoitov
2024-07-22 18:22 ` Eduard Zingerman
2024-07-22 20:08 ` Alexei Starovoitov
2024-07-24 21:28 ` Yonghong Song
2024-07-25 4:55 ` Alexei Starovoitov
2024-07-25 17:20 ` Eduard Zingerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox