From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com [91.218.175.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AD5B5347BB5 for ; Fri, 3 Apr 2026 04:13:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775189603; cv=none; b=pxUBD/dDukDEmORpLi4LT6LZhEwnB4svX0gBMF4zHVlMOQauPQcMEhckE+7kHi/V++otunSeHWkXke1+0tIUj/ynmvIjUn/NOd1pDgHJ+CCpe9fIOYIb4uyWR08dzziQYUFhZzECPXVuwxQv3qrjcDkgIPPXaLXY+4UlW3TqCvs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775189603; c=relaxed/simple; bh=7pUyxj19GE5Rg+dO85T5EWVIspdG+WNwpYZB3wLC8mE=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=M1mtr0RnmyNwGYsnxkq4RKdHRsYH4+Ns9/TvV50pf3er5MSePgL91bh5fb70Z0IsvWm222MKkKjXCsInYyI1XQn3MPWSVLKPgOQS/wDR21QOWWlbS2WYPH2O50Smrpj5Y6ngDCkPtWZek0PLQqn36gqFKvcQ/Z2LGPVm0/gvLOY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=TgGg7kdq; arc=none smtp.client-ip=91.218.175.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="TgGg7kdq" Message-ID: <72f47124-1cab-4406-a6c1-3bed0c3579e8@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1775189599; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8B1FcQyqOiPi+OXyU4bJBPR/UyrJV6vvroLInOVRLIA=; b=TgGg7kdqjI41YaA9KSLg9tFzGV7XglhvgnqAoWWrtmWux7gwnVDJsAb8LYFz4scS/Q2RKr +kr/45gxVKDVDYtVdV4iIr3DxDFLHjDd6lXwskrdifujKz/gzaXtoZvQqv/u8YVJT2Zr4s hNMlWLTHsAiPCUvGtinr2DSvGNLkuss= Date: Thu, 2 Apr 2026 21:13:15 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH bpf-next 07/10] bpf,x86: Implement JIT support for stack arguments Content-Language: en-GB To: Alexei Starovoitov Cc: bpf , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , "Jose E . Marchesi" , Kernel Team , Martin KaFai Lau References: <20260402012727.3916819-1-yonghong.song@linux.dev> <20260402012803.3920450-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 4/2/26 4:51 PM, Alexei Starovoitov wrote: > On Wed, Apr 1, 2026 at 6:28 PM Yonghong Song wrote: >> Add x86_64 JIT support for BPF functions and kfuncs with more than >> 5 arguments. The extra arguments are passed through a stack area >> addressed by register r12 (BPF_REG_STACK_ARG_BASE) in BPF bytecode, >> which the JIT translates to RBP-relative accesses in native code. >> >> There are two possible approaches to allocate the stack arg area: >> >> Option 1: Allocate a single combined region (incoming + max_outgoing) >> below the program stack in the function prologue. All r12-relative >> accesses become [rbp - prog_stack_depth - offset] where the 'offset' >> is the offset value in (incoming + max_outgoing) region. This is >> simple because the area is always at a fixed offset from RBP. >> The tradeoff is slightly higher stack usage when multiple callees >> have different stack arg counts — the area is sized to the maximum. >> >> Option 2: Allocate each outgoing area individually at the call >> site, sized exactly to the callee's needs. This minimizes >> stack usage but significantly complicates the JIT: each call >> site must dynamically adjust RSP, and addresses of stack args >> would shift depending on context, making the offset >> calculations harder. >> >> This patch uses Option 1 for simplicity. >> >> The native x86_64 stack layout for a function with incoming and >> outgoing stack args: >> >> high address >> ┌─────────────────────────┐ >> │ incoming stack arg N │ [rbp + 16 + (N - 1) * 8] (pushed by caller) >> │ ... │ >> │ incoming stack arg 1 │ [rbp + 16] >> ├─────────────────────────┤ >> │ return address │ [rbp + 8] >> │ saved rbp │ [rbp] >> ├─────────────────────────┤ >> │ callee-saved regs │ >> │ BPF program stack │ (stack_depth bytes) >> ├─────────────────────────┤ >> │ incoming stack arg 1 │ [rbp - prog_stack_depth - 8] >> │ ... (copied from │ (copied in prologue) >> │ caller's push) │ >> │ incoming stack arg N │ [rbp - prog_stack_depth - N * 8] >> ├─────────────────────────┤ >> │ outgoing stack arg 1 │ (written via r12-relative STX/ST, >> │ ... │ JIT translates to RBP-relative) >> │ outgoing stack arg M │ >> └─────────────────────────┘ >> ... Other stack usage >> ┌─────────────────────────┐ >> │ incoming stack arg M │ (copy from outgoing stack arg to >> │ ... │ incoming stack arg) >> │ incoming stack arg 1 │ >> ├─────────────────────────┤ >> │ return address │ >> │ saved rbp │ >> ├─────────────────────────┤ >> │ ... │ >> └─────────────────────────┘ >> low address >> >> In prologue, the caller's incoming stack arguments are copied to callee's >> incoming stack arguments, which will be fetched by later load insns. >> The outgoing stack arguments are written by JIT RBP-relative STX or ST. >> >> For each bpf-to-bpf call, push outgoing stack args onto the native >> stack before CALL, pop them after return. So the same 'outgoing stack arg' >> area is used by all bpf-to-bpf functions. >> >> For kfunc calls, push stack args (arg 7+) onto the native stack >> and load arg 6 into R9 per the x86_64 calling convention, >> then clean up RSP after return. >> >> Signed-off-by: Yonghong Song >> --- >> arch/x86/net/bpf_jit_comp.c | 145 ++++++++++++++++++++++++++++++++++-- >> 1 file changed, 138 insertions(+), 7 deletions(-) >> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c >> index 32864dbc2c4e..807493f109e5 100644 >> --- a/arch/x86/net/bpf_jit_comp.c >> +++ b/arch/x86/net/bpf_jit_comp.c >> @@ -367,6 +367,27 @@ static void push_callee_regs(u8 **pprog, bool *callee_regs_used) >> *pprog = prog; >> } >> >> +static int push_stack_args(u8 **pprog, s32 base_off, int from, int to) >> +{ >> + u8 *prog = *pprog; >> + int j, off, cnt = 0; >> + >> + for (j = from; j >= to; j--) { >> + off = base_off - j * 8; >> + >> + /* push qword [rbp + off] */ >> + if (is_imm8(off)) { >> + EMIT3(0xFF, 0x75, off); >> + cnt += 3; >> + } else { >> + EMIT2_off32(0xFF, 0xB5, off); >> + cnt += 6; >> + } >> + } >> + *pprog = prog; >> + return cnt; >> +} >> + >> static void pop_r12(u8 **pprog) >> { >> u8 *prog = *pprog; >> @@ -1664,19 +1685,35 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image >> int i, excnt = 0; >> int ilen, proglen = 0; >> u8 *prog = temp; >> - u32 stack_depth; >> + u16 stack_arg_depth, incoming_stack_arg_depth; >> + u32 prog_stack_depth, stack_depth; >> + bool has_stack_args; >> int err; >> >> stack_depth = bpf_prog->aux->stack_depth; >> + stack_arg_depth = bpf_prog->aux->stack_arg_depth; >> + incoming_stack_arg_depth = bpf_prog->aux->incoming_stack_arg_depth; >> priv_stack_ptr = bpf_prog->aux->priv_stack_ptr; >> if (priv_stack_ptr) { >> priv_frame_ptr = priv_stack_ptr + PRIV_STACK_GUARD_SZ + round_up(stack_depth, 8); >> stack_depth = 0; >> } >> >> + /* >> + * Save program stack depth before adding stack arg space. >> + * Each function allocates its own stack arg space >> + * (incoming + outgoing) below its BPF stack. >> + * Stack args are accessed via RBP-based addressing. >> + */ >> + prog_stack_depth = round_up(stack_depth, 8); >> + if (stack_arg_depth) >> + stack_depth += stack_arg_depth; >> + has_stack_args = stack_arg_depth > 0; >> + >> arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena); >> user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena); >> >> + >> detect_reg_usage(insn, insn_cnt, callee_regs_used); >> >> emit_prologue(&prog, image, stack_depth, >> @@ -1704,6 +1741,38 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image >> emit_mov_imm64(&prog, X86_REG_R12, >> arena_vm_start >> 32, (u32) arena_vm_start); >> >> + if (incoming_stack_arg_depth && bpf_is_subprog(bpf_prog)) { >> + int n = incoming_stack_arg_depth / 8; >> + >> + /* >> + * Caller pushed stack args before CALL, so after prologue >> + * (CALL saves ret addr, then PUSH saves old RBP) they sit >> + * above RBP: >> + * >> + * [rbp + 16 + (n - 1) * 8] stack_arg n >> + * ... >> + * [rbp + 24] stack_arg 2 >> + * [rbp + 16] stack_arg 1 >> + * [rbp + 8] return address >> + * [rbp + 0] saved rbp >> + * >> + * Copy each into callee's own region below the program stack: >> + * [rbp - prog_stack_depth - i * 8] >> + */ >> + for (i = 0; i < n; i++) { >> + s32 src = 16 + i * 8; >> + s32 dst = -prog_stack_depth - (i + 1) * 8; >> + >> + /* mov rax, [rbp + src] */ >> + EMIT4(0x48, 0x8B, 0x45, src); >> + /* mov [rbp + dst], rax */ >> + if (is_imm8(dst)) >> + EMIT4(0x48, 0x89, 0x45, dst); >> + else >> + EMIT3_off32(0x48, 0x89, 0x85, dst); >> + } > This is really suboptimal. > bpf calling convention for 6+ args needs to match x86. > With an exception of 6th arg. > All bpf insn need to remain as-is when calling another bpf prog > or kfunc. There should be no additional moves. > JIT should only special case 6th arg and convert bpf's STX [r12-N], src_reg > into 'mov r9, src_reg', since r9 is used to pass 6th argument on x86. > The rest of STX needs to be jitted pretty much as-is > with a twist that bpf's r12 becomes %rbp on x86. > And similar things in the callee. > Instead of LDX [r12+N] it will be a 'mov dst_reg, r9' where r9 is x86's r9. > Other LDX from [r12+M] will remain as-is, but r12->%rbp. > On arm64 more of the STX/LDX insns become native 'mov'-s > because arm64 has more registers for arguments. Good point. I will try to simplify the JIT by following x86_64 calling convention. > > pw-bot: cr