From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0519916EB71 for ; Mon, 22 Jul 2024 16:43:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721666593; cv=none; b=JxkN6vU4t7ENBMzPhKdbJpIvVH8OPyOTuovjlmJxAG5hS1EbFtKoekSehK2gFfqSuUL5yA1uW9BIGM5PMvkZrhqcsGzyb18E4/XfW8UK9ERJsmwWG/x923kjLfQnJcW3e74ads7D+jgB2CrqvKmu2ALbv2nx0dt8IKQjve928jM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721666593; c=relaxed/simple; bh=lyXhLKXPe/KSQMjLrI3m4hH6Gbn/gikbORszc7633T8=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=GdM6gRM9t4m0otV9qaT8rrqQltfEREdBpLtiTImhY6cVRNGHXu9sxKdzEeIdoWczia13/Gwmur5r+eQBZ601qUZ2BmlRrhusK6QuNJwU4spaJ4YsxToBCDODiIC2NwdoSd4LIxvKTVSkKktD6kSNTVQGIzH9/VJxQzjZbF7l57Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=gueXNHhD; arc=none smtp.client-ip=91.218.175.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="gueXNHhD" X-Envelope-To: andrii.nakryiko@gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1721666588; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=82Ch8UPv3ZF915xu78U4Fur+9krC3N6cdVmrXlEOmJY=; b=gueXNHhDpMRRLh20tcmlifnuIixplJ0ejB+p0tHxKKYa4nY7n7bZ5o7XeJEIU7JfbrOVMM xdBiPcNOSPnjjuZn2IseqOt0N7JqIy2xuWoZvAAcZzVmvVDBpJVEaWz8cWd7hQCv5LFc50 3vDaCDXuhjbpvG55vPN7ORpHlNQZ6KU= X-Envelope-To: bpf@vger.kernel.org X-Envelope-To: ast@kernel.org X-Envelope-To: andrii@kernel.org X-Envelope-To: daniel@iogearbox.net X-Envelope-To: kernel-team@fb.com X-Envelope-To: martin.lau@kernel.org Message-ID: <036e4320-1e22-4066-bfa5-42b1fa290a39@linux.dev> Date: Mon, 22 Jul 2024 09:43:04 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Content-Language: en-GB To: Andrii Nakryiko Cc: bpf@vger.kernel.org, Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau References: <20240718205158.3651529-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 7/19/24 8:28 PM, Andrii Nakryiko wrote: > On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song wrote: >> The main motivation for private stack comes from nested >> scheduler in sched-ext from Tejun. The basic idea is that >> - each cgroup will its own associated bpf program, >> - bpf program with parent cgroup will call bpf programs >> in immediate child cgroups. >> >> Let us say we have the following cgroup hierarchy: >> root_cg (prog0): >> cg1 (prog1): >> cg11 (prog11): >> cg111 (prog111) >> cg112 (prog112) >> cg12 (prog12): >> cg121 (prog121) >> cg122 (prog122) >> cg2 (prog2): >> cg21 (prog21) >> cg22 (prog22) >> cg23 (prog23) >> >> In the above example, prog0 will call a kfunc which will >> call prog1 and prog2 to get sched info for cg1 and cg2 and >> then the information is summarized and sent back to prog0. >> Similarly, prog11 and prog12 will be invoked in the kfunc >> and the result will be summarized and sent back to prog1, etc. >> >> Currently, for each thread, the x86 kernel allocate 8KB stack. >> The each bpf program (including its subprograms) has maximum >> 512B stack size to avoid potential stack overflow. >> And nested bpf programs increase the risk of stack overflow. >> To avoid potential stack overflow caused by bpf programs, >> this patch implemented a private stack so bpf program stack >> space is allocated dynamically when the program is jited. >> Such private stack is applied to tracing programs like >> kprobe/uprobe, perf_event, tracepoint, raw tracepoint and >> tracing. >> >> But more than one instance of the same bpf program may >> run in the system. To make things simple, percpu private >> stack is allocated for each program, so if the same program >> is running on different cpus concurrently, we won't have >> any issue. Note that the kernel already have logic to prevent >> the recursion for the same bpf program on the same cpu >> (kprobe, fentry, etc.). >> >> The patch implemented a percpu private stack based approach >> for x86 arch. >> - The stack size will be 0 and any stack access is from >> jit-time allocated percpu storage. >> - In the beginning of jit, r9 is used to save percpu >> private stack pointer. >> - Each rbp in the bpf asm insn is replaced by r9. >> - For each call, push r9 before the call and pop r9 >> after the call to preserve r9 value. >> >> Compared to previous RFC patch [1], this patch added >> some conditions to enable private stack, e.g., verifier >> calculated stack size, prog type, etc. The new patch >> also added a performance test to compare private stack >> vs. no private stack. >> >> The following are some code example to illustrate the idea >> for selftest cgroup_skb_sk_lookup: >> >> the existing code the private-stack approach code >> endbr64 endbr64 >> nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0] >> xchg ax,ax xchg ax,ax >> push rbp push rbp >> mov rbp,rsp mov rbp,rsp >> endbr64 endbr64 >> sub rsp,0x68 >> push rbx push rbx >> ... ... >> ... mov r9d,0x8c1c860 >> ... add r9,QWORD PTR gs:0x21a00 >> ... ... >> mov rdx,rbp mov rdx, r9 >> add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4 >> ... ... >> mov ecx,0x28 mov ecx,0x28 >> push r9 >> call 0xffffffffe305e474 call 0xffffffffe305e524 >> pop r9 >> mov rdi,rax mov rdi,rax >> ... ... >> movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46] >> ... ... >> > Eduard nerd-sniped me today with this a bit... :) > > I have a few questions and suggestions. > > So it seems like each *subprogram* (not the entire BPF program) gets > its own per-CPU private stack allocation. Is that intentional? That Currently yes. The reason is the same prog could be run on different cpus at the same time. > seems a bit unnecessary. It also prevents any sort of actual > recursion. Not sure if it's possible to write recursive BPF subprogram > today, verifier seems to reject obvious limited recursion cases, but > still, eventually we might need/want to support that, and this will be > just another hurdle to overcome (so it's best to avoid adding it in > the first place). > > I'm sure Eduard is going to try something like below and it will > probably break badly (I haven't tried, sorry): > > int entry(void *ctx); > > struct { > __uint(type, BPF_MAP_TYPE_PROG_ARRAY); > __uint(max_entries, 1); > __uint(key_size, sizeof(__u32)); > __array(values, int (void *)); > } prog_array_init SEC(".maps") = { > .values = { > [0] = (void *)&entry, > }, > }; > > static __noinline int subprog1(void) > { > > > /* here entry will replace subprog1, and so we'll have > * entry -> entry -> entry -> ..... -> subprog1 > */ > bpf_tail_call(ctx, &prog_array_init, 0); > > return 0; > } > > > SEC("raw_tp/sys_enter") > int entry(void *ctx) > { > > > subprog1(); > } > > And we effectively have limited recursion where the entry's stack > state is clobbered, no? > > So it seems like we need to support recursion. > > > So, the question I have is. Why not do the following: > a) only setup r9 *once* in entry program's prologue (before tail call > jump target) > b) before each call we can adjust r9 with current prog/subprog's > maximum *own* stack, something like: > > push r9; > r9 += 128; // 128 is subprog's stack usage > call > pop r9; > > The idea being that on tail call or in subprog call we assume r9 is > already pointing to the right place. We can probably also figure out > how to avoid push/pop r9 if we make sure that subprogram always > restores r9 (taking tail calls into account and all that, of course)? > > Is this feasible? This is possible. I actually hacked such an idea easily. The basic idea is push frame pointer as an additional argument to the bpf static sub-prog. This is a little bit complicated. It will probably save some stack size but I am not sure how much it is. > > Another question I have is whether it would be possible to just plain > set rbp to private stack and keep using rbp in such a way that stack > traces are preserved? I.e., save return address on private stack to > unwinders can correctly jump back to kernel's stack? I also tried this approach earlier. But it is very trickly we need to modify rbp/rsp and additional jit code will be added If interrupts happens, we will not be able to get reliable stack trace. > > How stupid is what I propose above? > > >> So the number of insns is increased by 1 + num_of_calls * 2. >> Here the number of calls are those calls in the final jited binary. >> Comparing function call itself, the push/pop overhead should be >> minimum in most common cases. >> >> Our original use case is for sched-ext nested scheduler. This will be done >> in the future. >> >> [1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@linux.dev/T/ >> >> Signed-off-by: Yonghong Song >> --- >> arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++--- >> include/linux/bpf.h | 2 ++ >> kernel/bpf/core.c | 20 ++++++++++++ >> kernel/bpf/syscall.c | 1 + >> 4 files changed, 82 insertions(+), 4 deletions(-) >> > [...]