From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-176.mta0.migadu.com (out-176.mta0.migadu.com [91.218.175.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 37EEA187FE4 for ; Wed, 2 Oct 2024 06:28:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727850538; cv=none; b=gkam4ZcxjrllSKYJGdJxuhBUUPipEWFxzilZgyNB/3uf3bw6kHCI//S0gRsdlaHaFw0bmMF0RYhhYsqfxZYjO4qfiS/WnMrC2EqDbIV/YVbuUeKSOeQSfLf/Vx6S5CWmbO2fCm+AiccHF3Rzc40+ANfJg+ib82NCVtPTwAAkYg4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727850538; c=relaxed/simple; bh=dAXk2ePTXn5yKxEi6gTka+FI9O/8tixftQDmO5oBsOM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=HWGH9gBcIZ1IjSPtWKBuSQwPNgl1POvAc1OavnRmPqvVdX20B1nBL7CFHeohYfJktHYjGpwAQ+NBjOZpvcVpmdLPqMnPpndnJ++X3xz/t3EeaL8WVRwE0YmdmyCQH8IjkNq5LgZ5b2pPv6W5EdfjIHilso/kSxcCwZout/30RqM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=sR6Fi+ng; arc=none smtp.client-ip=91.218.175.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="sR6Fi+ng" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727850533; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0jTPbLtG2MqeXmZpOwLgUqbyKCXlhVOBPglhdX90A0E=; b=sR6Fi+nghoQqxKy3Cb7pUSPqi0SzLYtABDtJN29k107QMLwkbCclGdF+OEydMwwxWbv/if UnSjalJyQMS0AJxYIZIcMAwCvrrNScfug6ZoDanUOPHM+yGS7K+L1aAzCYDDrCOY0c0Bzi 7k0gzCBS4/NHzNhf/GERHH7UMCTNgeQ= Date: Tue, 1 Oct 2024 23:28:44 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack Content-Language: en-GB To: Kumar Kartikeya Dwivedi , Alexei Starovoitov Cc: bpf , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Kernel Team , Martin KaFai Lau References: <20240926234506.1769256-1-yonghong.song@linux.dev> <20240926234526.1770736-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 10/1/24 7:16 PM, Kumar Kartikeya Dwivedi wrote: > On Wed, 2 Oct 2024 at 03:26, Alexei Starovoitov > wrote: >> On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi wrote: >>> Makes sense, though will we have cases where hierarchical scheduling >>> attaches the same prog at different points of the hierarchy? >> I'm not sure anyone was asking for such a use case. > I wondered because why would you then need a limit of 4 (say instead > of disallowing it)? > >>> Then the >>> limit of 4 may not be enough (e.g. say with cgroup nested levels > 4). >> Well, 4 was the number from TJ. >> > Ok, then let's assume 4 would be enough. > >> Anyway the proposed pseudo code: >> >> __bpf_prog_enter_recur_limited() >> { >> cnt = this_cpu_inc_return(*(prog->active)); >> if (cnt > 4) { >> inc_miss >> return 0; >> } >> // pass cnt into bpf prog somehow, like %rdx ? >> // or re-read prog->active from prog >> } >> >> >> then in the prologue emit: >> >> push rbp >> mov rbp, rsp >> if %rdx == 1 >> // main prog is called for the first time >> mov rsp, pcpu_priv_stack_top This sounds good in high level. I still need to figure out 'if %rdx == 1' part and how to implement this. >> else >> // 2+nd time main prog is called or 1+ time subprog >> sub rsp, stack_size >> if rsp < pcpu_priv_stack_bottom >> goto exit // stack is too small, exit >> fi > I think we need just the second part for subprogs, right? > Since rdx is R3 (arg into subprog). > I guess that's what you meant in the pseudocode. > But otherwise sounds good. > The benefit with stack probing is we don't exactly limit to 4 cases. > > Another option instead of the branch in main prog is to divide in 4 > slots (as you said before) and choose the slot based on cnt. > But then we're stuck with a max limit of 4. Since we're allocating > stack size of bpf + extra (which I guess is 8K?). rdx can be used to > pass in the priv_stack address of the right slot. > > So I think the probing version seems better. We can probably pass in > rdx = priv_stack and then test and cmov instead for main prog. Yes, we do not need to limit to 4, checking rsp < pcpu_priv_stack_bottom should be okay. > >> Since stack bottom/top are known at JIT time we can >> generate reliable stack overflow checks. >> Much better than guard pages and -fstack-protector. >> The prog can alloc percpu >> (stack size of main prog + subprogs + extra) * 4 > extra will be 8K, I guess (same as kernel stack size)? > Just confirming. > >> and it likely will be enough. >> If not, the stack protection will gently exit the prog >> when the stack is too deep. > I like this stack probing version, since there's no hard limit on the > number of recursions, and it's safe against stack overflow as well. > >> kfunc won't have such a check, so we need a buffer zone. >> Can have a guard page too, but feels like overkill. > I was leaning toward saying yes for a guard page, since we'll atleast > have a hard error instead of random corruption if the kfunc goes > beyond the bottom after probing succeeds. > > But the better way might be doing if rsp < pcpu_priv_stack_bottom + > 8K, so we leave max headroom we reserve for kernel stuff (or say add > 4K instead, which should be good enough), and then skip execution.