* [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs
@ 2024-06-10 5:18 Yonghong Song
2024-06-14 0:30 ` Alexei Starovoitov
0 siblings, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2024-06-10 5:18 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau
The main motivation for shadow stack comes from nested
scheduler in sched-ext from Tejun. The basic idea is that
- each cgroup will its own associated bpf program,
- bpf program with parent cgroup will call bpf programs
in immediate child cgroups.
Let us say we have the following cgroup hierarchy:
root_cg (prog0):
cg1 (prog1):
cg11 (prog11):
cg111 (prog111)
cg112 (prog112)
cg12 (prog12):
cg121 (prog121)
cg122 (prog122)
cg2 (prog2):
cg21 (prog21)
cg22 (prog22)
cg23 (prog23)
In the above example, prog0 will call a kfunc which will
call prog1 and prog2 to get sched info for cg1 and cg2 and
then the information is summarized and sent back to prog0.
Similarly, prog11 and prog12 will be invoked in the kfunc
and the result will be summarized and sent back to prog1, etc.
Currently, for each thread, the x86 kernel allocate 8KB stack.
The each bpf program (including its subprograms) has maximum
512B stack size to avoid potential stack overflow.
And nested bpf programs increase the risk of stack overflow.
To avoid potential stack overflow caused by bpf programs,
this patch implemented a shadow stack so bpf program stack
space is allocated dynamically when the program is jited.
But more than one instance of the same bpf program may
run in the system. To make things simple, percpu shadow
stack is allocated for each program, so if the same program
is running on different cpus concurrently, we won't have
any issue. Note that the kernel already have logic to prevent
the recursion for the same bpf program on the same cpu
(kprobe, fentry, etc.).
The patch implemented a percpu shadow stack based approach
for x86 arch.
- The stack size will be 0.
- In the beginning of jit, r9 is used to save percpu
shadow stack pointer.
- Each rbp in the bpf asm insn is replaced by r9.
- For each call, push r9 before the call and pop r9
after the call to preserve r9 value.
The following are some code example to illustrate the idea
for selftest cgroup_skb_sk_lookup:
the existing code the shadow-stack approach code
endbr64 endbr64
nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0]
xchg ax,ax xchg ax,ax
push rbp push rbp
mov rbp,rsp mov rbp,rsp
endbr64 endbr64
sub rsp,0x68
push rbx push rbx
... ...
... mov r9d,0x8c1c860
... add r9,QWORD PTR gs:0x21a00
... ...
mov rdx,rbp mov rdx, r9
add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4
... ...
mov ecx,0x28 mov ecx,0x28
push r9
call 0xffffffffe305e474 call 0xffffffffe305e524
pop r9
mov rdi,rax mov rdi,rax
... ...
movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46]
... ...
So the number of insns is increased by 1 + num_of_calls * 2.
There are an alternative approach is to do
mov r9d,0x8c1c860
add r9,QWORD PTR gs:0x21a00
right before each rbp usage where the rbp is replaced with r9.
The number of insns is increased by num_of_rbp_usage * 2.
The current implementation is preferred since for a bpf prog
using a lot of stack space, the number of calls is mostly likely
much smaller than stack access.
This simple implementation passed all selftests. I marked it as
RFC since several issues need to be resolved.
- The tradeoff between simplicity/more_memory/reasonable_performance vs.
complex/less_memory/worse_performance.
The percpu shadow stack makes implementation much simpler.
My previous approach (as discussed in lsfmmbpf2024) used
a more complex way trying to avoid excess shadow stack memory by using
a few percpu pages for non-sleepable programs and runtime allocation
for sleepable programs. This can save some memory but will make jit
more complex and also degrade performance quite a bit esp. for sleepable
programs.
- Should we introduce a flag during program load to indicate whether
the program should use shadow stack or not? There are a couple of cases
here:
- for xdp prog, performance is hugely critical. the current
implement may still degreate performance slightly.
- for a system with huge number of cpus, e.g., 256, 1024,
and only one instance of the bpf program is running. It could
be quite some memory waste, esp. if there are quite some bpf
progs are running on the system.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
arch/x86/net/bpf_jit_comp.c | 61 ++++++++++++++++++++++++++++++++++---
include/linux/bpf.h | 1 +
kernel/bpf/syscall.c | 1 +
3 files changed, 59 insertions(+), 4 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5159c7a22922..1af62ade0ceb 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -19,6 +19,8 @@
#include <asm/unwind.h>
#include <asm/cfi.h>
+static const int global_enable_shadow_stack = 1;
+
static bool all_callee_regs_used[4] = {true, true, true, true};
static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
@@ -1311,6 +1313,21 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
*pprog = prog;
}
+static void emit_percpu_shadow_frame_ptr(u8 **pprog, void *shadow_frame_ptr)
+{
+ u8 *prog = *pprog;
+
+ /* movabs r9, shadow_frame_ptr */
+ emit_mov_imm32(&prog, false, X86_REG_R9, (u32) (long) shadow_frame_ptr);
+
+ /* add <r9>, gs:[<off>] */
+ EMIT2(0x65, 0x4c);
+ EMIT3(0x03, 0x0c, 0x25);
+ EMIT((u32)(unsigned long)&this_cpu_off, 4);
+
+ *pprog = prog;
+}
+
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
@@ -1326,13 +1343,31 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
int insn_cnt = bpf_prog->len;
bool tail_call_seen = false;
bool seen_exit = false;
+ void *percpu_shadow_stack_ptr, *shadow_frame_ptr = NULL;
+ bool enable_shadow_stack = global_enable_shadow_stack;
u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+ u32 stack_depth = bpf_prog->aux->stack_depth;
u64 arena_vm_start, user_vm_start;
int i, excnt = 0;
int ilen, proglen = 0;
u8 *prog = temp;
int err;
+ if (stack_depth && enable_shadow_stack) {
+ if (bpf_prog->percpu_shadow_stack_ptr) {
+ percpu_shadow_stack_ptr = bpf_prog->percpu_shadow_stack_ptr;
+ } else {
+ percpu_shadow_stack_ptr = __alloc_percpu_gfp(stack_depth, 8, GFP_KERNEL);
+ if (!percpu_shadow_stack_ptr)
+ return -ENOMEM;
+ bpf_prog->percpu_shadow_stack_ptr = percpu_shadow_stack_ptr;
+ }
+ shadow_frame_ptr = percpu_shadow_stack_ptr + round_up(stack_depth, 8);
+ stack_depth = 0;
+ } else {
+ enable_shadow_stack = 0;
+ }
+
arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
@@ -1342,7 +1377,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
/* tail call's presence in current prog implies it is reachable */
tail_call_reachable |= tail_call_seen;
- emit_prologue(&prog, bpf_prog->aux->stack_depth,
+ emit_prologue(&prog, stack_depth,
bpf_prog_was_classic(bpf_prog), tail_call_reachable,
bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
/* Exception callback will clobber callee regs for its own use, and
@@ -1364,6 +1399,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
emit_mov_imm64(&prog, X86_REG_R12,
arena_vm_start >> 32, (u32) arena_vm_start);
+ if (enable_shadow_stack)
+ emit_percpu_shadow_frame_ptr(&prog, shadow_frame_ptr);
+
ilen = prog - temp;
if (rw_image)
memcpy(rw_image + proglen, temp, ilen);
@@ -1383,6 +1421,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
u8 *func;
int nops;
+ if (enable_shadow_stack) {
+ if (src_reg == BPF_REG_FP)
+ src_reg = X86_REG_R9;
+
+ if (dst_reg == BPF_REG_FP)
+ dst_reg = X86_REG_R9;
+ }
+
switch (insn->code) {
/* ALU */
case BPF_ALU | BPF_ADD | BPF_X:
@@ -2014,6 +2060,7 @@ st: if (is_imm8(insn->off))
emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
/* Restore R0 after clobbering RAX */
emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
+
break;
}
@@ -2038,14 +2085,20 @@ st: if (is_imm8(insn->off))
func = (u8 *) __bpf_call_base + imm32;
if (tail_call_reachable) {
- RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
+ RESTORE_TAIL_CALL_CNT(stack_depth);
ip += 7;
}
if (!imm32)
return -EINVAL;
+ if (enable_shadow_stack) {
+ EMIT2(0x41, 0x51);
+ ip += 2;
+ }
ip += x86_call_depth_emit_accounting(&prog, func, ip);
if (emit_call(&prog, func, ip))
return -EINVAL;
+ if (enable_shadow_stack)
+ EMIT2(0x41, 0x59);
break;
}
@@ -2055,13 +2108,13 @@ st: if (is_imm8(insn->off))
&bpf_prog->aux->poke_tab[imm32 - 1],
&prog, image + addrs[i - 1],
callee_regs_used,
- bpf_prog->aux->stack_depth,
+ stack_depth,
ctx);
else
emit_bpf_tail_call_indirect(bpf_prog,
&prog,
callee_regs_used,
- bpf_prog->aux->stack_depth,
+ stack_depth,
image + addrs[i - 1],
ctx);
break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a834f4b761bc..08014b4954f0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1563,6 +1563,7 @@ struct bpf_prog {
const struct bpf_insn *insn);
struct bpf_prog_aux *aux; /* Auxiliary fields */
struct sock_fprog_kern *orig_prog; /* Original BPF program */
+ void *percpu_shadow_stack_ptr;
/* Instructions for interpreter */
union {
DECLARE_FLEX_ARRAY(struct sock_filter, insns);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5070fa20d05c..7b9093cc3671 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2244,6 +2244,7 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
kvfree(aux->func_info);
kfree(aux->func_info_aux);
+ free_percpu(aux->prog->percpu_shadow_stack_ptr);
free_uid(aux->user);
security_bpf_prog_free(aux->prog);
bpf_prog_free(aux->prog);
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs
2024-06-10 5:18 [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs Yonghong Song
@ 2024-06-14 0:30 ` Alexei Starovoitov
2024-06-16 5:52 ` Yonghong Song
0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2024-06-14 0:30 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On Sun, Jun 9, 2024 at 10:18 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
I think "shadow stack" already has at least two different meanings
in the kernel.
Let's avoid adding 3rd.
How about "divided stack" ?
> +static void emit_percpu_shadow_frame_ptr(u8 **pprog, void *shadow_frame_ptr)
> +{
> + u8 *prog = *pprog;
> +
> + /* movabs r9, shadow_frame_ptr */
> + emit_mov_imm32(&prog, false, X86_REG_R9, (u32) (long) shadow_frame_ptr);
> +
> + /* add <r9>, gs:[<off>] */
> + EMIT2(0x65, 0x4c);
> + EMIT3(0x03, 0x0c, 0x25);
> + EMIT((u32)(unsigned long)&this_cpu_off, 4);
I think this can be one insn:
lea r9, gs:[(u32)shadow_frame_ptr]
> + if (stack_depth && enable_shadow_stack) {
I think enabling it for progs with small stack usage
is unnecessary.
The definition of "small" is complicated.
I feel stack_depth <= 64 can stay as-is and
all networking progs don't have to use it either,
since they're called from known places.
While tracing progs can be anywhere, so I'd enable
divided stack for
stack_depth > 64 && prog_type == kprobe, tp, raw_tp, tracing, perf_event.
> + if (bpf_prog->percpu_shadow_stack_ptr) {
> + percpu_shadow_stack_ptr = bpf_prog->percpu_shadow_stack_ptr;
> + } else {
> + percpu_shadow_stack_ptr = __alloc_percpu_gfp(stack_depth, 8, GFP_KERNEL);
> + if (!percpu_shadow_stack_ptr)
> + return -ENOMEM;
> + bpf_prog->percpu_shadow_stack_ptr = percpu_shadow_stack_ptr;
> + }
> + shadow_frame_ptr = percpu_shadow_stack_ptr + round_up(stack_depth, 8);
> + stack_depth = 0;
> + } else {
> + enable_shadow_stack = 0;
> + }
> +
> arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
> user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
>
> @@ -1342,7 +1377,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> /* tail call's presence in current prog implies it is reachable */
> tail_call_reachable |= tail_call_seen;
>
> - emit_prologue(&prog, bpf_prog->aux->stack_depth,
> + emit_prologue(&prog, stack_depth,
> bpf_prog_was_classic(bpf_prog), tail_call_reachable,
> bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
> /* Exception callback will clobber callee regs for its own use, and
> @@ -1364,6 +1399,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> emit_mov_imm64(&prog, X86_REG_R12,
> arena_vm_start >> 32, (u32) arena_vm_start);
>
> + if (enable_shadow_stack)
> + emit_percpu_shadow_frame_ptr(&prog, shadow_frame_ptr);
> +
> ilen = prog - temp;
> if (rw_image)
> memcpy(rw_image + proglen, temp, ilen);
> @@ -1383,6 +1421,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> u8 *func;
> int nops;
>
> + if (enable_shadow_stack) {
> + if (src_reg == BPF_REG_FP)
> + src_reg = X86_REG_R9;
> +
> + if (dst_reg == BPF_REG_FP)
> + dst_reg = X86_REG_R9;
the verifier will reject a prog that attempts to write into R10.
So the above shouldn't be necessary.
> + }
> +
> switch (insn->code) {
> /* ALU */
> case BPF_ALU | BPF_ADD | BPF_X:
> @@ -2014,6 +2060,7 @@ st: if (is_imm8(insn->off))
> emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
> /* Restore R0 after clobbering RAX */
> emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
> +
> break;
> }
>
> @@ -2038,14 +2085,20 @@ st: if (is_imm8(insn->off))
>
> func = (u8 *) __bpf_call_base + imm32;
> if (tail_call_reachable) {
> - RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
> + RESTORE_TAIL_CALL_CNT(stack_depth);
> ip += 7;
> }
> if (!imm32)
> return -EINVAL;
> + if (enable_shadow_stack) {
> + EMIT2(0x41, 0x51);
> + ip += 2;
> + }
> ip += x86_call_depth_emit_accounting(&prog, func, ip);
> if (emit_call(&prog, func, ip))
> return -EINVAL;
> + if (enable_shadow_stack)
> + EMIT2(0x41, 0x59);
push/pop around calls are load/store plus math on %rsp.
I think it's cheaper to reload r9 after the call with
a single insn.
The reload of r9 is effectively gs+const.
There is no memory access. So it should be faster.
Technically we can replace all uses of R10==rbp with
'gs:' based instructions.
Like:
r1 = r10
can be jitted into
lea rdi, gs + (u32)shadow_frame_ptr
and r0 = *(u32 *)(r10 - 64)
can be jitted into:
mov rax, dword ptr gs:[(u32)shadow_frame_ptr - 64]
but that is probably a bunch of jit changes.
So I'd start with a simple reload of r9 after each call.
We need to micro-benchmark it to make sure there is no perf overhead.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs
2024-06-14 0:30 ` Alexei Starovoitov
@ 2024-06-16 5:52 ` Yonghong Song
2024-06-17 23:19 ` Alexei Starovoitov
0 siblings, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2024-06-16 5:52 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 6/13/24 5:30 PM, Alexei Starovoitov wrote:
> On Sun, Jun 9, 2024 at 10:18 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> I think "shadow stack" already has at least two different meanings
> in the kernel.
> Let's avoid adding 3rd.
> How about "divided stack" ?
Naming is hard. Maybe "private stack" which suggests the stack is private
to that program?
>
>> +static void emit_percpu_shadow_frame_ptr(u8 **pprog, void *shadow_frame_ptr)
>> +{
>> + u8 *prog = *pprog;
>> +
>> + /* movabs r9, shadow_frame_ptr */
>> + emit_mov_imm32(&prog, false, X86_REG_R9, (u32) (long) shadow_frame_ptr);
>> +
>> + /* add <r9>, gs:[<off>] */
>> + EMIT2(0x65, 0x4c);
>> + EMIT3(0x03, 0x0c, 0x25);
>> + EMIT((u32)(unsigned long)&this_cpu_off, 4);
> I think this can be one insn:
> lea r9, gs:[(u32)shadow_frame_ptr]
Apparently, __alloc_percpu_gfp() may return a pointer which is beyond 32bit. That is why my
RFC patch failed CI. I later tried to use
+ /* movabs r9, shadow_frame_ptr */
+ emit_mov_imm64(&prog, X86_REG_R9, (long) shadow_frame_ptr >> 32,
+ (u32) (long) shadow_frame_ptr);
and CI is successful. I did some on-demand test (https://github.com/kernel-patches/bpf/pull/7179)
and it succeeded with CI.
If __alloc_percpu_gfp() returns a pointer beyond 32bit, I am not sure
whether we could get r9 with a single insn.
>
>> + if (stack_depth && enable_shadow_stack) {
> I think enabling it for progs with small stack usage
> is unnecessary.
> The definition of "small" is complicated.
> I feel stack_depth <= 64 can stay as-is and
> all networking progs don't have to use it either,
> since they're called from known places.
> While tracing progs can be anywhere, so I'd enable
> divided stack for
> stack_depth > 64 && prog_type == kprobe, tp, raw_tp, tracing, perf_event.
This does make sense. It partially aligns what I think for prog type
side. We only need to enable 'divided stack' for certain prog types.
>
>> + if (bpf_prog->percpu_shadow_stack_ptr) {
>> + percpu_shadow_stack_ptr = bpf_prog->percpu_shadow_stack_ptr;
>> + } else {
>> + percpu_shadow_stack_ptr = __alloc_percpu_gfp(stack_depth, 8, GFP_KERNEL);
>> + if (!percpu_shadow_stack_ptr)
>> + return -ENOMEM;
>> + bpf_prog->percpu_shadow_stack_ptr = percpu_shadow_stack_ptr;
>> + }
>> + shadow_frame_ptr = percpu_shadow_stack_ptr + round_up(stack_depth, 8);
>> + stack_depth = 0;
>> + } else {
>> + enable_shadow_stack = 0;
>> + }
>> +
>> arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
>> user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
>>
>> @@ -1342,7 +1377,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>> /* tail call's presence in current prog implies it is reachable */
>> tail_call_reachable |= tail_call_seen;
>>
>> - emit_prologue(&prog, bpf_prog->aux->stack_depth,
>> + emit_prologue(&prog, stack_depth,
>> bpf_prog_was_classic(bpf_prog), tail_call_reachable,
>> bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
>> /* Exception callback will clobber callee regs for its own use, and
>> @@ -1364,6 +1399,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>> emit_mov_imm64(&prog, X86_REG_R12,
>> arena_vm_start >> 32, (u32) arena_vm_start);
>>
>> + if (enable_shadow_stack)
>> + emit_percpu_shadow_frame_ptr(&prog, shadow_frame_ptr);
>> +
>> ilen = prog - temp;
>> if (rw_image)
>> memcpy(rw_image + proglen, temp, ilen);
>> @@ -1383,6 +1421,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>> u8 *func;
>> int nops;
>>
>> + if (enable_shadow_stack) {
>> + if (src_reg == BPF_REG_FP)
>> + src_reg = X86_REG_R9;
>> +
>> + if (dst_reg == BPF_REG_FP)
>> + dst_reg = X86_REG_R9;
> the verifier will reject a prog that attempts to write into R10.
> So the above shouldn't be necessary.
Actually there is at least one exception, e.g.,
if r10 > r5 goto +5
where dst is r10 and src r5.
For some insn where dst is intended to write with r10
like r10 = 10, and verifier will reject the program before
jit, as you mentioned in the above.
>
>> + }
>> +
>> switch (insn->code) {
>> /* ALU */
>> case BPF_ALU | BPF_ADD | BPF_X:
>> @@ -2014,6 +2060,7 @@ st: if (is_imm8(insn->off))
>> emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
>> /* Restore R0 after clobbering RAX */
>> emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
>> +
>> break;
>> }
>>
>> @@ -2038,14 +2085,20 @@ st: if (is_imm8(insn->off))
>>
>> func = (u8 *) __bpf_call_base + imm32;
>> if (tail_call_reachable) {
>> - RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
>> + RESTORE_TAIL_CALL_CNT(stack_depth);
>> ip += 7;
>> }
>> if (!imm32)
>> return -EINVAL;
>> + if (enable_shadow_stack) {
>> + EMIT2(0x41, 0x51);
>> + ip += 2;
>> + }
>> ip += x86_call_depth_emit_accounting(&prog, func, ip);
>> if (emit_call(&prog, func, ip))
>> return -EINVAL;
>> + if (enable_shadow_stack)
>> + EMIT2(0x41, 0x59);
> push/pop around calls are load/store plus math on %rsp.
> I think it's cheaper to reload r9 after the call with
> a single insn.
> The reload of r9 is effectively gs+const.
> There is no memory access. So it should be faster.
Two insn may be necessary since __alloc_percpu_gfp()
may return a pointer beyond 32 bits.
>
> Technically we can replace all uses of R10==rbp with
> 'gs:' based instructions.
> Like:
> r1 = r10
> can be jitted into
> lea rdi, gs + (u32)shadow_frame_ptr
>
> and r0 = *(u32 *)(r10 - 64)
> can be jitted into:
> mov rax, dword ptr gs:[(u32)shadow_frame_ptr - 64]
>
> but that is probably a bunch of jit changes.
> So I'd start with a simple reload of r9 after each call.
This is a good idea. We might need this so we only have
one extra insn per call.
>
> We need to micro-benchmark it to make sure there is no perf overhead.
Sure. Will do!
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs
2024-06-16 5:52 ` Yonghong Song
@ 2024-06-17 23:19 ` Alexei Starovoitov
2024-06-18 15:20 ` Yonghong Song
0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2024-06-17 23:19 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On Sat, Jun 15, 2024 at 10:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 6/13/24 5:30 PM, Alexei Starovoitov wrote:
> > On Sun, Jun 9, 2024 at 10:18 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > I think "shadow stack" already has at least two different meanings
> > in the kernel.
> > Let's avoid adding 3rd.
> > How about "divided stack" ?
>
> Naming is hard. Maybe "private stack" which suggests the stack is private
> to that program?
I like it. "private stack" fits the best.
> >
> >> +static void emit_percpu_shadow_frame_ptr(u8 **pprog, void *shadow_frame_ptr)
> >> +{
> >> + u8 *prog = *pprog;
> >> +
> >> + /* movabs r9, shadow_frame_ptr */
> >> + emit_mov_imm32(&prog, false, X86_REG_R9, (u32) (long) shadow_frame_ptr);
> >> +
> >> + /* add <r9>, gs:[<off>] */
> >> + EMIT2(0x65, 0x4c);
> >> + EMIT3(0x03, 0x0c, 0x25);
> >> + EMIT((u32)(unsigned long)&this_cpu_off, 4);
> > I think this can be one insn:
> > lea r9, gs:[(u32)shadow_frame_ptr]
>
> Apparently, __alloc_percpu_gfp() may return a pointer which is beyond 32bit. That is why my
> RFC patch failed CI. I later tried to use
>
> + /* movabs r9, shadow_frame_ptr */
> + emit_mov_imm64(&prog, X86_REG_R9, (long) shadow_frame_ptr >> 32,
> + (u32) (long) shadow_frame_ptr);
>
> and CI is successful. I did some on-demand test (https://github.com/kernel-patches/bpf/pull/7179)
> and it succeeded with CI.
>
> If __alloc_percpu_gfp() returns a pointer beyond 32bit, I am not sure
> whether we could get r9 with a single insn.
I see. Ok. Let's keep two insns sequence.
>
> >
> >> + if (stack_depth && enable_shadow_stack) {
> > I think enabling it for progs with small stack usage
> > is unnecessary.
> > The definition of "small" is complicated.
> > I feel stack_depth <= 64 can stay as-is and
> > all networking progs don't have to use it either,
> > since they're called from known places.
> > While tracing progs can be anywhere, so I'd enable
> > divided stack for
> > stack_depth > 64 && prog_type == kprobe, tp, raw_tp, tracing, perf_event.
>
> This does make sense. It partially aligns what I think for prog type
> side. We only need to enable 'divided stack' for certain prog types.
>
> >
> >> + if (bpf_prog->percpu_shadow_stack_ptr) {
> >> + percpu_shadow_stack_ptr = bpf_prog->percpu_shadow_stack_ptr;
> >> + } else {
> >> + percpu_shadow_stack_ptr = __alloc_percpu_gfp(stack_depth, 8, GFP_KERNEL);
> >> + if (!percpu_shadow_stack_ptr)
> >> + return -ENOMEM;
> >> + bpf_prog->percpu_shadow_stack_ptr = percpu_shadow_stack_ptr;
> >> + }
> >> + shadow_frame_ptr = percpu_shadow_stack_ptr + round_up(stack_depth, 8);
> >> + stack_depth = 0;
> >> + } else {
> >> + enable_shadow_stack = 0;
> >> + }
> >> +
> >> arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
> >> user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
> >>
> >> @@ -1342,7 +1377,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> >> /* tail call's presence in current prog implies it is reachable */
> >> tail_call_reachable |= tail_call_seen;
> >>
> >> - emit_prologue(&prog, bpf_prog->aux->stack_depth,
> >> + emit_prologue(&prog, stack_depth,
> >> bpf_prog_was_classic(bpf_prog), tail_call_reachable,
> >> bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
> >> /* Exception callback will clobber callee regs for its own use, and
> >> @@ -1364,6 +1399,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> >> emit_mov_imm64(&prog, X86_REG_R12,
> >> arena_vm_start >> 32, (u32) arena_vm_start);
> >>
> >> + if (enable_shadow_stack)
> >> + emit_percpu_shadow_frame_ptr(&prog, shadow_frame_ptr);
> >> +
> >> ilen = prog - temp;
> >> if (rw_image)
> >> memcpy(rw_image + proglen, temp, ilen);
> >> @@ -1383,6 +1421,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> >> u8 *func;
> >> int nops;
> >>
> >> + if (enable_shadow_stack) {
> >> + if (src_reg == BPF_REG_FP)
> >> + src_reg = X86_REG_R9;
> >> +
> >> + if (dst_reg == BPF_REG_FP)
> >> + dst_reg = X86_REG_R9;
> > the verifier will reject a prog that attempts to write into R10.
> > So the above shouldn't be necessary.
>
> Actually there is at least one exception, e.g.,
> if r10 > r5 goto +5
> where dst is r10 and src r5.
Good point. We even have such a selftest to make sure it's rejected in unpriv.
SEC("socket")
__description("unpriv: cmp of frame pointer")
__success __failure_unpriv __msg_unpriv("R10 pointer comparison")
__retval(0)
__naked void unpriv_cmp_of_frame_pointer(void)
{
asm volatile (" \
if r10 == 0 goto l0_%=; \
> >
> >> + }
> >> +
> >> switch (insn->code) {
> >> /* ALU */
> >> case BPF_ALU | BPF_ADD | BPF_X:
> >> @@ -2014,6 +2060,7 @@ st: if (is_imm8(insn->off))
> >> emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
> >> /* Restore R0 after clobbering RAX */
> >> emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
> >> +
> >> break;
> >> }
> >>
> >> @@ -2038,14 +2085,20 @@ st: if (is_imm8(insn->off))
> >>
> >> func = (u8 *) __bpf_call_base + imm32;
> >> if (tail_call_reachable) {
> >> - RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
> >> + RESTORE_TAIL_CALL_CNT(stack_depth);
> >> ip += 7;
> >> }
> >> if (!imm32)
> >> return -EINVAL;
> >> + if (enable_shadow_stack) {
> >> + EMIT2(0x41, 0x51);
> >> + ip += 2;
> >> + }
> >> ip += x86_call_depth_emit_accounting(&prog, func, ip);
> >> if (emit_call(&prog, func, ip))
> >> return -EINVAL;
> >> + if (enable_shadow_stack)
> >> + EMIT2(0x41, 0x59);
> > push/pop around calls are load/store plus math on %rsp.
> > I think it's cheaper to reload r9 after the call with
> > a single insn.
> > The reload of r9 is effectively gs+const.
> > There is no memory access. So it should be faster.
>
> Two insn may be necessary since __alloc_percpu_gfp()
> may return a pointer beyond 32 bits.
>
> >
> > Technically we can replace all uses of R10==rbp with
> > 'gs:' based instructions.
> > Like:
> > r1 = r10
> > can be jitted into
> > lea rdi, gs + (u32)shadow_frame_ptr
> >
> > and r0 = *(u32 *)(r10 - 64)
> > can be jitted into:
> > mov rax, dword ptr gs:[(u32)shadow_frame_ptr - 64]
> >
> > but that is probably a bunch of jit changes.
> > So I'd start with a simple reload of r9 after each call.
>
> This is a good idea. We might need this so we only have
> one extra insn per call.
Since reload of r9 is a two insn sequence of 64-bit mov immediate,
and load from gs:this_cpu_off, I suspect, push/pop r9
might be faster. So I'd stick to what you have already.
Interesting though that static per-cpu vars have 32-bit pointers,
but dynamic per-cpu alloc returns full 64-bit? hmm.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs
2024-06-17 23:19 ` Alexei Starovoitov
@ 2024-06-18 15:20 ` Yonghong Song
0 siblings, 0 replies; 5+ messages in thread
From: Yonghong Song @ 2024-06-18 15:20 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau
On 6/17/24 4:19 PM, Alexei Starovoitov wrote:
> On Sat, Jun 15, 2024 at 10:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 6/13/24 5:30 PM, Alexei Starovoitov wrote:
>>> On Sun, Jun 9, 2024 at 10:18 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> I think "shadow stack" already has at least two different meanings
>>> in the kernel.
>>> Let's avoid adding 3rd.
>>> How about "divided stack" ?
>> Naming is hard. Maybe "private stack" which suggests the stack is private
>> to that program?
> I like it. "private stack" fits the best.
>
>>>> +static void emit_percpu_shadow_frame_ptr(u8 **pprog, void *shadow_frame_ptr)
>>>> +{
>>>> + u8 *prog = *pprog;
>>>> +
>>>> + /* movabs r9, shadow_frame_ptr */
>>>> + emit_mov_imm32(&prog, false, X86_REG_R9, (u32) (long) shadow_frame_ptr);
>>>> +
>>>> + /* add <r9>, gs:[<off>] */
>>>> + EMIT2(0x65, 0x4c);
>>>> + EMIT3(0x03, 0x0c, 0x25);
>>>> + EMIT((u32)(unsigned long)&this_cpu_off, 4);
>>> I think this can be one insn:
>>> lea r9, gs:[(u32)shadow_frame_ptr]
>> Apparently, __alloc_percpu_gfp() may return a pointer which is beyond 32bit. That is why my
>> RFC patch failed CI. I later tried to use
>>
>> + /* movabs r9, shadow_frame_ptr */
>> + emit_mov_imm64(&prog, X86_REG_R9, (long) shadow_frame_ptr >> 32,
>> + (u32) (long) shadow_frame_ptr);
>>
>> and CI is successful. I did some on-demand test (https://github.com/kernel-patches/bpf/pull/7179)
>> and it succeeded with CI.
>>
>> If __alloc_percpu_gfp() returns a pointer beyond 32bit, I am not sure
>> whether we could get r9 with a single insn.
> I see. Ok. Let's keep two insns sequence.
>
>>>> + if (stack_depth && enable_shadow_stack) {
>>> I think enabling it for progs with small stack usage
>>> is unnecessary.
>>> The definition of "small" is complicated.
>>> I feel stack_depth <= 64 can stay as-is and
>>> all networking progs don't have to use it either,
>>> since they're called from known places.
>>> While tracing progs can be anywhere, so I'd enable
>>> divided stack for
>>> stack_depth > 64 && prog_type == kprobe, tp, raw_tp, tracing, perf_event.
>> This does make sense. It partially aligns what I think for prog type
>> side. We only need to enable 'divided stack' for certain prog types.
>>
>>>> + if (bpf_prog->percpu_shadow_stack_ptr) {
>>>> + percpu_shadow_stack_ptr = bpf_prog->percpu_shadow_stack_ptr;
>>>> + } else {
>>>> + percpu_shadow_stack_ptr = __alloc_percpu_gfp(stack_depth, 8, GFP_KERNEL);
>>>> + if (!percpu_shadow_stack_ptr)
>>>> + return -ENOMEM;
>>>> + bpf_prog->percpu_shadow_stack_ptr = percpu_shadow_stack_ptr;
>>>> + }
>>>> + shadow_frame_ptr = percpu_shadow_stack_ptr + round_up(stack_depth, 8);
>>>> + stack_depth = 0;
>>>> + } else {
>>>> + enable_shadow_stack = 0;
>>>> + }
>>>> +
>>>> arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
>>>> user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
>>>>
>>>> @@ -1342,7 +1377,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>>>> /* tail call's presence in current prog implies it is reachable */
>>>> tail_call_reachable |= tail_call_seen;
>>>>
>>>> - emit_prologue(&prog, bpf_prog->aux->stack_depth,
>>>> + emit_prologue(&prog, stack_depth,
>>>> bpf_prog_was_classic(bpf_prog), tail_call_reachable,
>>>> bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
>>>> /* Exception callback will clobber callee regs for its own use, and
>>>> @@ -1364,6 +1399,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>>>> emit_mov_imm64(&prog, X86_REG_R12,
>>>> arena_vm_start >> 32, (u32) arena_vm_start);
>>>>
>>>> + if (enable_shadow_stack)
>>>> + emit_percpu_shadow_frame_ptr(&prog, shadow_frame_ptr);
>>>> +
>>>> ilen = prog - temp;
>>>> if (rw_image)
>>>> memcpy(rw_image + proglen, temp, ilen);
>>>> @@ -1383,6 +1421,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>>>> u8 *func;
>>>> int nops;
>>>>
>>>> + if (enable_shadow_stack) {
>>>> + if (src_reg == BPF_REG_FP)
>>>> + src_reg = X86_REG_R9;
>>>> +
>>>> + if (dst_reg == BPF_REG_FP)
>>>> + dst_reg = X86_REG_R9;
>>> the verifier will reject a prog that attempts to write into R10.
>>> So the above shouldn't be necessary.
>> Actually there is at least one exception, e.g.,
>> if r10 > r5 goto +5
>> where dst is r10 and src r5.
> Good point. We even have such a selftest to make sure it's rejected in unpriv.
>
> SEC("socket")
> __description("unpriv: cmp of frame pointer")
> __success __failure_unpriv __msg_unpriv("R10 pointer comparison")
> __retval(0)
> __naked void unpriv_cmp_of_frame_pointer(void)
> {
> asm volatile (" \
> if r10 == 0 goto l0_%=; \
>
>>>> + }
>>>> +
>>>> switch (insn->code) {
>>>> /* ALU */
>>>> case BPF_ALU | BPF_ADD | BPF_X:
>>>> @@ -2014,6 +2060,7 @@ st: if (is_imm8(insn->off))
>>>> emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
>>>> /* Restore R0 after clobbering RAX */
>>>> emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
>>>> +
>>>> break;
>>>> }
>>>>
>>>> @@ -2038,14 +2085,20 @@ st: if (is_imm8(insn->off))
>>>>
>>>> func = (u8 *) __bpf_call_base + imm32;
>>>> if (tail_call_reachable) {
>>>> - RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
>>>> + RESTORE_TAIL_CALL_CNT(stack_depth);
>>>> ip += 7;
>>>> }
>>>> if (!imm32)
>>>> return -EINVAL;
>>>> + if (enable_shadow_stack) {
>>>> + EMIT2(0x41, 0x51);
>>>> + ip += 2;
>>>> + }
>>>> ip += x86_call_depth_emit_accounting(&prog, func, ip);
>>>> if (emit_call(&prog, func, ip))
>>>> return -EINVAL;
>>>> + if (enable_shadow_stack)
>>>> + EMIT2(0x41, 0x59);
>>> push/pop around calls are load/store plus math on %rsp.
>>> I think it's cheaper to reload r9 after the call with
>>> a single insn.
>>> The reload of r9 is effectively gs+const.
>>> There is no memory access. So it should be faster.
>> Two insn may be necessary since __alloc_percpu_gfp()
>> may return a pointer beyond 32 bits.
>>
>>> Technically we can replace all uses of R10==rbp with
>>> 'gs:' based instructions.
>>> Like:
>>> r1 = r10
>>> can be jitted into
>>> lea rdi, gs + (u32)shadow_frame_ptr
>>>
>>> and r0 = *(u32 *)(r10 - 64)
>>> can be jitted into:
>>> mov rax, dword ptr gs:[(u32)shadow_frame_ptr - 64]
>>>
>>> but that is probably a bunch of jit changes.
>>> So I'd start with a simple reload of r9 after each call.
>> This is a good idea. We might need this so we only have
>> one extra insn per call.
> Since reload of r9 is a two insn sequence of 64-bit mov immediate,
> and load from gs:this_cpu_off, I suspect, push/pop r9
> might be faster. So I'd stick to what you have already.
>
> Interesting though that static per-cpu vars have 32-bit pointers,
> but dynamic per-cpu alloc returns full 64-bit? hmm.
Not always. This RFC works in my local qemu run as dynamic per-cpu
allocation returns 32-bit. But CI failed since in CI 64-bit ptr val
is returned. Later on, with different code based, my local qemu
can also return 64-bit per-cpu ptr. In the next revision, I will
use 64-bit value to hold shadow_frame_ptr (to be named private_frame_ptr).
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-06-18 15:20 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 5:18 [RFC PATCH bpf-next] bpf: Support shadow stack for bpf progs Yonghong Song
2024-06-14 0:30 ` Alexei Starovoitov
2024-06-16 5:52 ` Yonghong Song
2024-06-17 23:19 ` Alexei Starovoitov
2024-06-18 15:20 ` Yonghong Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox