From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-175.mta0.migadu.com (out-175.mta0.migadu.com [91.218.175.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CED9C2564 for ; Thu, 3 Oct 2024 06:17:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727936281; cv=none; b=T+Z0mSd3KgVzDgwpic2c/lWS2hX/JNyaPpK0VZ1kEiUCv8tsK8aKESOrDRm6fAvv1YpwR6HqxaFWHNOaiteGFU5xotHMJ2ZUk//4K52ucrcafFSInkLuFaW6xPxggzdlvBUzBjYcMTN10f1h26HEIVLF4UU9ajNyCm0qf0FDVFo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727936281; c=relaxed/simple; bh=fIIFQBsw7bb8sbXQLz60tMfAUWKmJJDxvjZuycx22co=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=TZN77Ey3rKW3fuIMRMRh542qKE+9a5vCyWsFcLPwDINh8fWhcj/VIWyHF9arb+vawATkCxuCg/cFZ4/Do+wOkn9fXUX6bpLxtafDinjbZN6O12psSWL5Ocos0OEFlEaUAbdXklotqVCgshwpsooFt2JWBIcybtiNOCnmLG7vomo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=gE5k9ZT/; arc=none smtp.client-ip=91.218.175.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="gE5k9ZT/" Message-ID: <8b6c1eb1-de43-4ddb-b2b6-48256bdacddb@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727936276; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=l1ae8gTA1k7GPD69d8GTFlpuY5yjiGiTxZ/jQFwsw5I=; b=gE5k9ZT/2KmSj6SI07P9WtWOZBeZP5u5KhbrqLxooONjKtxAZoGXnCNi6c3QJb1L7icgJx aNUwJm8SMOHrdQhctSNGy9ZRji8mewabPMACXBdIbGZIrYeDqrMZZV5NV+OHaw8Zq5Eywx xbbjDDftwvEiPX2j/c8W4X4IiJeg51g= Date: Wed, 2 Oct 2024 23:17:50 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack Content-Language: en-GB To: Alexei Starovoitov , Kumar Kartikeya Dwivedi Cc: bpf , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Kernel Team , Martin KaFai Lau References: <20240926234506.1769256-1-yonghong.song@linux.dev> <20240926234526.1770736-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 10/1/24 6:26 PM, Alexei Starovoitov wrote: > On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi wrote: >> Makes sense, though will we have cases where hierarchical scheduling >> attaches the same prog at different points of the hierarchy? > I'm not sure anyone was asking for such a use case. > >> Then the >> limit of 4 may not be enough (e.g. say with cgroup nested levels > 4). > Well, 4 was the number from TJ. > > Anyway the proposed pseudo code: > > __bpf_prog_enter_recur_limited() > { > cnt = this_cpu_inc_return(*(prog->active)); > if (cnt > 4) { > inc_miss > return 0; > } > // pass cnt into bpf prog somehow, like %rdx ? > // or re-read prog->active from prog > } > > > then in the prologue emit: > > push rbp > mov rbp, rsp > if %rdx == 1 > // main prog is called for the first time > mov rsp, pcpu_priv_stack_top > else > // 2+nd time main prog is called or 1+ time subprog > sub rsp, stack_size > if rsp < pcpu_priv_stack_bottom > goto exit // stack is too small, exit > fi I have tried to implement this approach (not handling recursion yet) based on the above approach. It works okay with nested bpf subprogs like main prog // set rsp = pcpu_priv_stack_top subprog1 // some stack subprog2 // some stack The pcpu_priv_stack is allocated like priv_stack_ptr = __alloc_percpu_gfp(1024 * 16, 8, GFP_KERNEL); But whenever the prog called an external function, e.g. a helper in this case, I will get a double fault. An example could be main prog // set rsp = pcpu_priv_stack_top subprog1 // some stack subprog2 // some stack call bpf_seq_printf (I modified bpf_iter_ipv6_route.c bpf prog for the above purpose.) I added some printk statements from the beginning of bpf_seq_printf and nothing printed out either and of course traps still happens. I tried another example without subprog and the mainprog calls a helper and the same double traps happens below too. The error log looks like [ 54.024955] traps: PANIC: double fault, error_code: 0x0 [ 54.024969] Oops: double fault: 0000 [#1] PREEMPT SMP KASAN PTI [ 54.024977] CPU: 3 UID: 0 PID: 1946 Comm: test_progs Tainted: G OE 6.11.0-10577-gf25c172fd840-dirty #968 [ 54.024982] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 54.024983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 54.024986] RIP: 0010:error_entry+0x1e/0x140 [ 54.024996] Code: ff ff 90 90 90 90 90 90 90 90 90 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 55 41 54 41 55 41 56 <41> 57 56 31 f6 31 d1 [ 54.024999] RSP: 0018:ffffe8ffff580000 EFLAGS: 00010806 [ 54.025002] RAX: f3f3f300f1f1f1f1 RBX: fffff91fffeb0044 RCX: ffffffff84201701 [ 54.025005] RDX: fffff91fffeb0044 RSI: ffffffff8420128d RDI: ffffe8ffff580178 [ 54.025007] RBP: ffffe8ffff580140 R08: 0000000000000000 R09: 0000000000000000 [ 54.025009] R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 [ 54.025010] R13: 1ffffd1fffeb0014 R14: 0000000000000003 R15: ffffe8ffff580178 [ 54.025012] FS: 00007fd076525d00(0000) GS:ffff8881f7180000(0000) knlGS:0000000000000000 [ 54.025015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.025017] CR2: ffffe8ffff57fff8 CR3: 000000010cd80002 CR4: 0000000000370ef0 [ 54.025021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 54.025022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 54.025024] Call Trace: [ 54.025026] <#DF> [ 54.025028] ? __die_body+0xaf/0xc0 [ 54.025032] ? die+0x2f/0x50 [ 54.025036] ? exc_double_fault+0x73/0x80 [ 54.025040] ? asm_exc_double_fault+0x23/0x30 [ 54.025044] ? common_interrupt_return+0xb1/0xcc [ 54.025048] ? asm_exc_page_fault+0xd/0x30 [ 54.025051] ? error_entry+0x1e/0x140 [ 54.025055] [ 54.025056] Modules linked in: bpf_testmod(OE) [ 54.025061] ---[ end trace 0000000000000000 ]--- Maybe somebody could give a hint why I got a double fault when calling external functions (outside of bpf programs) with allocated stack? > > Since stack bottom/top are known at JIT time we can > generate reliable stack overflow checks. > Much better than guard pages and -fstack-protector. > The prog can alloc percpu > (stack size of main prog + subprogs + extra) * 4 > and it likely will be enough. > If not, the stack protection will gently exit the prog > when the stack is too deep. > kfunc won't have such a check, so we need a buffer zone. > Can have a guard page too, but feels like overkill.