From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-172.mta1.migadu.com (out-172.mta1.migadu.com [95.215.58.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EE68A224D1 for ; Thu, 3 Oct 2024 20:54:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727988870; cv=none; b=onKVVvG+Hnvg1gvCOWaG36C34KgSF7Dw7egQM5lkFwYWWvhBsmqFDHelgxCcOivFF9Sd5g9clqYEaK9g1S11G7RmoFMcqGl9ewdJKbyQut/wgUQ49eGubPwmPjYXqT432hk6Hmet97p4jMPGosAGGbOfF1T7RIhzA/gHOuGw3jc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727988870; c=relaxed/simple; bh=d0QfieXfENrWksHSxSgdz5h7MVZyvhmK9kgrF+FTszU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=S9TUUTYcq56dvDqNucOYw42d0AiZ7NgTFJfvQakpvQ6SPACrMJrdrYHuQ8Ul18UmgyP23Nc2NBmHRvUIZIfutNe8DWji5MX+6RhU7hN4TSZjeHJs9gSJI1MNBInSHiu2Raw/IJCiBAcsPtclTRsejCSgFV5Z/acBUtcSwxU9M5Y= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=ZYT9TM04; arc=none smtp.client-ip=95.215.58.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="ZYT9TM04" Message-ID: <89c98687-2087-46eb-8341-6ae65d70cb9c@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727988866; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NRj+hdtnQAsunKo2wllQxFgzW8G4V5X3a49QiCrG7RA=; b=ZYT9TM04WYYe721f42tkT/f+ShTB7yWFg84AUrpcJ4uOlzLZ/dcd4osgvgCAPgIa/Su5C6 UZwkzo4r4LQ7g8r1R6YV+Y+ScHnkxNH9uAkPBEh/odgZQ2JaxTgxUDaVy8HFJ4ogJtgMyj Hcve9c0I1rZ8qDBCPSUUEkwlvFdi2Ak= Date: Thu, 3 Oct 2024 13:54:18 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack Content-Language: en-GB To: Kumar Kartikeya Dwivedi Cc: Alexei Starovoitov , bpf , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Kernel Team , Martin KaFai Lau References: <20240926234506.1769256-1-yonghong.song@linux.dev> <8b6c1eb1-de43-4ddb-b2b6-48256bdacddb@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 10/3/24 1:47 PM, Kumar Kartikeya Dwivedi wrote: > On Thu, 3 Oct 2024 at 22:44, Yonghong Song wrote: >> >> On 10/3/24 10:35 AM, Alexei Starovoitov wrote: >>> On Thu, Oct 3, 2024 at 6:40 AM Kumar Kartikeya Dwivedi wrote: >>>> On Thu, 3 Oct 2024 at 08:17, Yonghong Song wrote: >>>>> On 10/1/24 6:26 PM, Alexei Starovoitov wrote: >>>>>> On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi wrote: >>>>>>> Makes sense, though will we have cases where hierarchical scheduling >>>>>>> attaches the same prog at different points of the hierarchy? >>>>>> I'm not sure anyone was asking for such a use case. >>>>>> >>>>>>> Then the >>>>>>> limit of 4 may not be enough (e.g. say with cgroup nested levels > 4). >>>>>> Well, 4 was the number from TJ. >>>>>> >>>>>> Anyway the proposed pseudo code: >>>>>> >>>>>> __bpf_prog_enter_recur_limited() >>>>>> { >>>>>> cnt = this_cpu_inc_return(*(prog->active)); >>>>>> if (cnt > 4) { >>>>>> inc_miss >>>>>> return 0; >>>>>> } >>>>>> // pass cnt into bpf prog somehow, like %rdx ? >>>>>> // or re-read prog->active from prog >>>>>> } >>>>>> >>>>>> >>>>>> then in the prologue emit: >>>>>> >>>>>> push rbp >>>>>> mov rbp, rsp >>>>>> if %rdx == 1 >>>>>> // main prog is called for the first time >>>>>> mov rsp, pcpu_priv_stack_top >>>>>> else >>>>>> // 2+nd time main prog is called or 1+ time subprog >>>>>> sub rsp, stack_size >>>>>> if rsp < pcpu_priv_stack_bottom >>>>>> goto exit // stack is too small, exit >>>>>> fi >>>>> I have tried to implement this approach (not handling >>>>> recursion yet) based on the above approach. It works >>>>> okay with nested bpf subprogs like >>>>> main prog // set rsp = pcpu_priv_stack_top >>>>> subprog1 // some stack >>>>> subprog2 // some stack >>>>> >>>>> The pcpu_priv_stack is allocated like >>>>> priv_stack_ptr = __alloc_percpu_gfp(1024 * 16, 8, GFP_KERNEL); >>>>> >>>>> But whenever the prog called an external function, >>>>> e.g. a helper in this case, I will get a double fault. >>>>> An example could be >>>>> main prog // set rsp = pcpu_priv_stack_top >>>>> subprog1 // some stack >>>>> subprog2 // some stack >>>>> call bpf_seq_printf >>>>> (I modified bpf_iter_ipv6_route.c bpf prog for the above >>>>> purpose.) >>>>> I added some printk statements from the beginning of bpf_seq_printf and >>>>> nothing printed out either and of course traps still happens. >>>>> >>>>> I tried another example without subprog and the mainprog calls >>>>> a helper and the same double traps happens below too. >>>>> >>>>> The error log looks like >>>>> >>>>> [ 54.024955] traps: PANIC: double fault, error_code: 0x0 >>>>> [ 54.024969] Oops: double fault: 0000 [#1] PREEMPT SMP KASAN PTI >>>>> [ 54.024977] CPU: 3 UID: 0 PID: 1946 Comm: test_progs Tainted: G OE 6.11.0-10577-gf25c172fd840-dirty #968 >>>>> [ 54.024982] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE >>>>> [ 54.024983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >>>>> [ 54.024986] RIP: 0010:error_entry+0x1e/0x140 >>>>> [ 54.024996] Code: ff ff 90 90 90 90 90 90 90 90 90 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 55 41 54 41 55 41 56 <41> 57 56 31 f6 31 d1 >>>>> [ 54.024999] RSP: 0018:ffffe8ffff580000 EFLAGS: 00010806 >>>>> [ 54.025002] RAX: f3f3f300f1f1f1f1 RBX: fffff91fffeb0044 RCX: ffffffff84201701 >>>>> [ 54.025005] RDX: fffff91fffeb0044 RSI: ffffffff8420128d RDI: ffffe8ffff580178 >>>>> [ 54.025007] RBP: ffffe8ffff580140 R08: 0000000000000000 R09: 0000000000000000 >>>>> [ 54.025009] R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 >>>>> [ 54.025010] R13: 1ffffd1fffeb0014 R14: 0000000000000003 R15: ffffe8ffff580178 >>>>> [ 54.025012] FS: 00007fd076525d00(0000) GS:ffff8881f7180000(0000) knlGS:0000000000000000 >>>>> [ 54.025015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 54.025017] CR2: ffffe8ffff57fff8 CR3: 000000010cd80002 CR4: 0000000000370ef0 >>>>> [ 54.025021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>> [ 54.025022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>>> [ 54.025024] Call Trace: >>>>> [ 54.025026] <#DF> >>>>> [ 54.025028] ? __die_body+0xaf/0xc0 >>>>> [ 54.025032] ? die+0x2f/0x50 >>>>> [ 54.025036] ? exc_double_fault+0x73/0x80 >>>>> [ 54.025040] ? asm_exc_double_fault+0x23/0x30 >>>>> [ 54.025044] ? common_interrupt_return+0xb1/0xcc >>>>> [ 54.025048] ? asm_exc_page_fault+0xd/0x30 >>>>> [ 54.025051] ? error_entry+0x1e/0x140 >>>>> [ 54.025055] >>>>> [ 54.025056] Modules linked in: bpf_testmod(OE) >>>>> [ 54.025061] ---[ end trace 0000000000000000 ]--- >>>>> >>>>> Maybe somebody could give a hint why I got a double fault >>>>> when calling external functions (outside of bpf programs) >>>>> with allocated stack? >>>>> >>>> I will help in debugging. Can you share the patch you applied locally >>>> so I can reproduce? >>> Looks like the idea needs more thought. >>> >>> in_task_stack() won't recognize the private stack, >>> so it will look like stack overflow and double fault. >>> >>> do you have CONFIG_VMAP_STACK ? >> Yes, my above test runs fine withCONFIG_VMAP_STACK. Let me guard private stack support with >> CONFIG_VMAP_STACK for now. Not sure whether distributions enable >> CONFIG_VMAP_STACK or not. >> > I think it is the default on most distributions (Debian, Ubuntu, Fedora, etc.). Thanks for confirmation! Great CONFIG_VMAP_STACK is on by default for most distro's.