From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-181.mta1.migadu.com (out-181.mta1.migadu.com [95.215.58.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2ED42B67E for ; Wed, 12 Feb 2025 05:35:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739338554; cv=none; b=STZy+KHzEGDHchkHhKMJUboTiw2D8JHZ0ZrHdtVcCen74b/cQtznjCtXRbBuZzQoE7pjcexrcfUki5Y+bZ6dtbkFI+FXvEqcdmRkcD48Vtz+W5iEu+BL94z5AOBMxKAGCMr93/TeIBzThZnJXSlOblCSUf+blSvOKwzc9EwqMD0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739338554; c=relaxed/simple; bh=7E+7Z3NsHGX6cghxmofGKJOzGrsUllPpziXjt9wybiI=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=izzu3yT3qciqbCxfRIef8dSf3hhfkxRmQpTpBMd4HQSvWBk2OELhmXus4S1XMZBAN69QfnjXcZ1Bx1r56PwNZKkhQ81BRJZAY4Iz1wK5DpVMQ8fM+jruU8iwMhv8kKfiyiWE+SWictN+wApH8Qrgt75u1WGj7y/5BvOY2GSM84w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=dexJAmPc; arc=none smtp.client-ip=95.215.58.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="dexJAmPc" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1739338549; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KDcNXpwqtHt70y6qnvtr2z2oLnLuYPSTSkl0LmcHIZI=; b=dexJAmPc6D+wzx0aPybn7sbOFe6JSpkQOkTFBydLJ+D61lQrNF0GSnNHJvkzcMEMtj+40E iHwPhmcHHx0hObU2DLgMNAglEL+8LiZbHi7M7AucO7fSXz2HM9hUpIeFZUnTwS5WAaQJoz SHi5zJG15jPf3xQqqbTNio+B/PQlmw8= Date: Tue, 11 Feb 2025 21:35:33 -0800 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering Content-Language: en-GB To: Andrii Nakryiko Cc: bpf@vger.kernel.org, Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau References: <20250206225956.3740809-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 2/11/25 2:57 PM, Andrii Nakryiko wrote: > On Thu, Feb 6, 2025 at 3:00 PM Yonghong Song wrote: >> Currently for bpf progs in a cgroup hierarchy, the effective prog array >> is computed from bottom cgroup to upper cgroups. For example, the following >> cgroup hierarchy >> root cgroup: p1, p2 >> subcgroup: p3, p4 >> have BPF_F_ALLOW_MULTI for both cgroup levels. >> The effective cgroup array ordering looks like >> p3 p4 p1 p2 >> and at run time, the progs will execute based on that order. >> >> But in some cases, it is desirable to have root prog executes earlier than >> children progs. For example, >> - prog p1 intends to collect original pkt dest addresses. >> - prog p3 will modify original pkt dest addresses to a proxy address for >> security reason. >> The end result is that prog p1 gets proxy address which is not what it >> wants. Also, putting p1 to every child cgroup is not desirable either as it >> will duplicate itself in many child cgroups. And this is exactly a use case >> we are encountering in Meta. >> >> To fix this issue, let us introduce a flag BPF_F_PRIO_TOPDOWN. If the flag >> is specified at attachment time, the prog has higher priority and the >> ordering with that flag will be from top to bottom. For example, in the >> above example, >> root cgroup: p1, p2 >> subcgroup: p3, p4 >> Let us say p1, p2 and p4 are marked with BPF_F_PRIO_TOPDOWN. The final > I'm not a big fan of PRIO_TOPDOWN naming, and this example just > provides further argument for why. Between p3 and p4 programs in > subcgroup, there is no notion of TOPDOWN, they are at the same level > of the hierarchy. > > In graphs, for DFS, PRIO_TOPDOWN semantics corresponds to pre-order vs > (current and default) post-order. So why not something like > BPF_F_PREORDER or some variation on that? BPF_F_PREORDER sounds good to me. > > Also, for your example if would be nicer if p1 and p3 were the default > post-order attachment, while p2 and p4 were pre-order. Then you'd have > p2, p4, p3, p1, where everything is swapped relative to original > ordering ;) Yes, will adjust the test for such scenario! > >> effective array ordering will be >> p1 p2 p4 p3 >> >> Suggested-by: Andrii Nakryiko >> Signed-off-by: Yonghong Song >> --- >> include/linux/bpf-cgroup.h | 1 + >> include/uapi/linux/bpf.h | 1 + >> kernel/bpf/cgroup.c | 37 +++++++++++++++++++++++++++++++--- >> kernel/bpf/syscall.c | 3 ++- >> tools/include/uapi/linux/bpf.h | 1 + >> 5 files changed, 39 insertions(+), 4 deletions(-) >> >> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h >> index 7fc69083e745..3d4f221df9ef 100644 >> --- a/include/linux/bpf-cgroup.h >> +++ b/include/linux/bpf-cgroup.h >> @@ -111,6 +111,7 @@ struct bpf_prog_list { >> struct bpf_prog *prog; >> struct bpf_cgroup_link *link; >> struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE]; >> + bool is_prio_topdown; > let's go with `int flags`, we increase the size of struct > bpf_prog_list by 8 bytes anyways, so let's make this a bit more > generic? Ack. > >> }; >> >> int cgroup_bpf_inherit(struct cgroup *cgrp); >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index fff6cdb8d11a..7ae8e8751e78 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -1207,6 +1207,7 @@ enum bpf_perf_event_type { >> #define BPF_F_BEFORE (1U << 3) >> #define BPF_F_AFTER (1U << 4) >> #define BPF_F_ID (1U << 5) >> +#define BPF_F_PRIO_TOPDOWN (1U << 6) >> #define BPF_F_LINK BPF_F_LINK /* 1 << 13 */ >> >> /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c >> index 46e5db65dbc8..f31250c6025b 100644 >> --- a/kernel/bpf/cgroup.c >> +++ b/kernel/bpf/cgroup.c >> @@ -382,6 +382,21 @@ static u32 prog_list_length(struct hlist_head *head) >> return cnt; >> } >> >> +static u32 prog_list_length_with_topdown_cnt(struct hlist_head *head, int *topdown_cnt) > instead of duplicating prog_list_length(), let's add this `int > *topdown_cnt` counter as an optional argument, which prog_list_length > will fill out only if it's provided, i.e., you'll just have: > > if (topdown_cnt && pl->is_prio_topdown) > (*topdown_cnt) += 1; > > as one extra condition inside the loop? I thought about this as well. I tried to create a new function since prog_list_length() is used in several different places. This is not critical path, so yes, I can just add one addititional parameter for prog_list_length() as you suggested. > >> +{ >> + struct bpf_prog_list *pl; >> + u32 cnt = 0; >> + >> + hlist_for_each_entry(pl, head, node) { >> + if (!prog_list_prog(pl)) >> + continue; >> + cnt++; >> + if (pl->is_prio_topdown) >> + (*topdown_cnt) += 1; >> + } >> + return cnt; >> +} >> +