From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9CEC433A9 for ; Wed, 2 Oct 2024 06:49:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727851746; cv=none; b=kjJBM5T3NDEk+dDfKkHViuJMDm74NiOizKKKAo2uIGs5ECBZr7RJtHBoIyqGKvUUYB59mS9kNTCDH99O2c/sk9Tz+FULe2aSPyZwjIb26DaB8HHZ7bEmi1n3XfRSg1XzBOTsbpKmyla7QLP9VH2RwIF1ZBlHJgUO0WBk+qg4D4c= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727851746; c=relaxed/simple; bh=bs0V9niQEq/DX4u6e6/Qdtb3UpO/5LkeAMb2zRWAhBM=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=GJJPV4Ip4QBWzFY4/kmaduYOVwOc5RqvV+3Knikp0s6uKBsB9IuzFhD0rthkwZ6p5Ppv7hWj901MLvSMaCAsY9VQNhYobC+s00CdfVYA9LWtipYJcnO7N97cl4PXQ2r3A8bM5gY2Mk2IFN7fb/YV7ni87gkApApKIyPPkvkr8Uk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Si0ettZm; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Si0ettZm" Message-ID: <6366e7d3-f81b-4837-b105-bced5217c95f@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727851739; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EEdraAmQF9/P0QqnWL33aNgCzRj89B2W2T+RbBTzf6U=; b=Si0ettZmjPEgI3XKYPOIAAm3a207yuRxXyPgJmT399wQwD0pe4wAlWfPy6dhEfXdeNoSFM 2v3b2KDqqfdRDmZWhVcY2MtcnLFuBjFX1QLfmh7S0rMpJoxN5/9/mZxDdUdt36sHr1qFUl 9SCyH/7jif38VtSHe61Mm+tn+TuYNes= Date: Tue, 1 Oct 2024 23:48:51 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack Content-Language: en-GB X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song To: Kumar Kartikeya Dwivedi , Alexei Starovoitov Cc: bpf , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Kernel Team , Martin KaFai Lau References: <20240926234506.1769256-1-yonghong.song@linux.dev> <20240926234526.1770736-1-yonghong.song@linux.dev> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 10/1/24 11:28 PM, Yonghong Song wrote: > > On 10/1/24 7:16 PM, Kumar Kartikeya Dwivedi wrote: >> On Wed, 2 Oct 2024 at 03:26, Alexei Starovoitov >> wrote: >>> On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi >>> wrote: >>>> Makes sense, though will we have cases where hierarchical scheduling >>>> attaches the same prog at different points of the hierarchy? >>> I'm not sure anyone was asking for such a use case. >> I wondered because why would you then need a limit of 4 (say instead >> of disallowing it)? >> >>>> Then the >>>> limit of 4 may not be enough (e.g. say with cgroup nested levels > 4). >>> Well, 4 was the number from TJ. >>> >> Ok, then let's assume 4 would be enough. >> >>> Anyway the proposed pseudo code: >>> >>> __bpf_prog_enter_recur_limited() >>> { >>>    cnt = this_cpu_inc_return(*(prog->active)); >>>    if (cnt > 4) { >>>       inc_miss >>>       return 0; >>>    } >>>   // pass cnt into bpf prog somehow, like %rdx ? >>>   // or re-read prog->active from prog >>> } >>> >>> >>> then in the prologue emit: >>> >>> push rbp >>> mov rbp, rsp >>> if %rdx == 1 >>>     // main prog is called for the first time >>>     mov rsp, pcpu_priv_stack_top > > This sounds good in high level. I still need to figure out > 'if %rdx == 1' part and how to implement this. Okay, looks like trampoline could supply rdx == 1. > >>> else >>>     // 2+nd time main prog is called or 1+ time subprog >>>    sub rsp, stack_size >>>    if rsp < pcpu_priv_stack_bottom >>>      goto exit  // stack is too small, exit >>> fi >> I think we need just the second part for subprogs, right? >> Since rdx is R3 (arg into subprog). >> I guess that's what you meant in the pseudocode. >> But otherwise sounds good. >> The benefit with stack probing is we don't exactly limit to 4 cases. >> >> Another option instead of the branch in main prog is to divide in 4 >> slots (as you said before) and choose the slot based on cnt. >> But then we're stuck with a max limit of 4. Since we're allocating >> stack size of bpf + extra (which I guess is 8K?). rdx can be used to >> pass in the priv_stack address of the right slot. >> >> So I think the probing version seems better. We can probably pass in >> rdx = priv_stack and then test and cmov instead for main prog. > > Yes, we do not need to limit to 4, checking rsp < pcpu_priv_stack_bottom > should be okay. > >> >>> Since stack bottom/top are known at JIT time we can >>> generate reliable stack overflow checks. >>> Much better than guard pages and -fstack-protector. >>> The prog can alloc percpu >>> (stack size of main prog + subprogs + extra) * 4 >> extra will be 8K, I guess (same as kernel stack size)? >> Just confirming. >> >>> and it likely will be enough. >>> If not, the stack protection will gently exit the prog >>> when the stack is too deep. >> I like this stack probing version, since there's no hard limit on the >> number of recursions, and it's safe against stack overflow as well. >> >>> kfunc won't have such a check, so we need a buffer zone. >>> Can have a guard page too, but feels like overkill. >> I was leaning toward saying yes for a guard page, since we'll atleast >> have a hard error instead of random corruption if the kfunc goes >> beyond the bottom after probing succeeds. >> >> But the better way might be doing if rsp < pcpu_priv_stack_bottom + >> 8K, so we leave max headroom we reserve for kernel stuff (or say add >> 4K instead, which should be good enough), and then skip execution. >