From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B8EB5398 for ; Thu, 15 Feb 2024 03:07:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707966454; cv=none; b=WLVSOKYn2JogJyhtJ8dyBaNwwYBSyUCPsQHLXbpzFh1BhBMnEBG7Q6wU8nyhU/mavM7A87o90mBa1up6MivMmL7UEbeckaoOUjxBOmJ7Kn/hnVPLMeyFGSI1Ckzyt2nd9+UkvsbJND9FnBA9kMk5WxtpGncOYQyDC2pyUc1hWOw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707966454; c=relaxed/simple; bh=iWTjOwhllXjQc9b84748H0YbmSvOj1ddaEZYy3Wodos=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=kG3P1MDk8t5nPdGTBtiHjoNKmX+BS+uKLjJRIMX+Xs88sqUW7eJ90UEfQ5QqM7n/9a25uAh/P5lJDPZMmXw1zyC2vWjRRJmI0lQY6gy4l6nceSLuDIg1niW6b4rAfI6kcCl3G2mFYeLcshsWyt5DJV66Q1nxq2BsIz+hUqqwths= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=TsFGJ0XL; arc=none smtp.client-ip=95.215.58.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="TsFGJ0XL" Message-ID: <77fbe6f8-b1df-44a2-a177-d3b9faba5482@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1707966449; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=p77/DaUTwYPoKQzXGliqzeByk1ybNdKc+Kp24OgIFNw=; b=TsFGJ0XLo56agIwU895W68esWa6UoQblPuYWNcwvKMZcRaESIs3ExBkpjfkuFgdLcMKzdb W9UTLmkjhv7KE8hxEZE9PRzV4d3ZNYUiB+rFRCEmBJ8A853NxN+Zkjh4i431H8WMxUerTv 3qGPgoYgpyGJTwKefKxnrgxW4LK0dqY= Date: Wed, 14 Feb 2024 19:07:14 -0800 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [LSF/MM/BPF TOPIC] Segmented Stacks for BPF Programs Content-Language: en-GB To: Alexei Starovoitov Cc: bpf , Tejun Heo , Alexei Starovoitov , David Vernet , lsf-pc References: X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 2/14/24 6:20 PM, Alexei Starovoitov wrote: > On Wed, Feb 14, 2024 at 11:53 AM Yonghong Song wrote: >> For each active kernel thread, the thread stack size is 2*PAGE_SIZE ([1]). >> Each bpf program has a maximum stack size 512 bytes to avoid >> overflowing the thread stack. But nested bpf programs may post >> a challenge to avoid stack overflow. >> >> For example, currently we already allow nested bpf >> programs esp in tracing, i.e., >> Prog_A >> -> Call Helper_B >> -> Call Func_C >> -> fentry program is called due to Func_C. >> -> Call Helper_D and then Func_E >> -> fentry due to Func_E >> -> ... >> If we have too many bpf programs in the chain and each bpf program >> has close to 512 byte stack size, it could overflow the kernel thread >> stack. >> >> Another more practical potential use case is from a discussion between >> Alexei and Tejun. It is possible for a complex scheduler like sched-ext, >> we could have BPF prog hierarchy like below: >> Prog_1 (at system level) >> Prog_Numa_1 Prog_Numa_2 ... Prog_Numa_4 >> Prog_LLC_1 Prog_LLC_2 ... >> Prog_CPU_1 ... >> >> Basically, the top bpf program (Prog_1) will call Prog_Numa_* programs >> >> through a kfunc to collect information from programs in each numa node. >> Each Prog_Numa_* program will call Prog_LLC_* programs to collect >> information from programs in each llc domain in that particular >> numa node, etc. The same for Prog_LLC_* vs. Prog_CPU_*. >> Now we have four level nested bpf programs. >> >> The proposed approach is to allocate stack from heap for >> each bpf program. That way, we do not need to worry about >> kernel stack overflow. Such an approach is called >> segmented stacks ([2]) in clang/gcc/go etc. >> >> Obviously there are some drawbacks for segmented stack approach: >> - some performance degradation, so this approach may not for everyone. >> - stack backtracking, kernel changes are necessary. > I suspect segmented stacks the way compilers do them are not suitable > for bpf progs, since they break backtraces and backtrace is a crucial > feature that must work even when there are kernel bugs. > How about we keep call/ret, save/restore of callee saved regs > in the normal stack, but use a parallel memory (per-cpu or some other) > for bpf prog needs. What bpf prog thinks of stack will be in that memory > while the call chain will remain correct. > From bpf prog pov the stack is where bpf_reg_r10 points to. > It doesn't have to be in the kernel stack. Shadow memory will work. Thanks for suggestions. Make sense. This will resolve backtrace issue. Let me experiment with this approach. > > Let's also call it something else than "segmented stack" to avoid > confusion. Indeed, "segmented stack" terminoloty is not appropriate any more since the original stack is backtracable. Will think of a different one.