From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B59663EA63 for ; Sat, 17 May 2025 15:46:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.170 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747496801; cv=none; b=LfGIKcsHA8q8iwupOma3WWqrx1xZWK/iBFt7DqZSd2b0JQyCZD1qFMIqgj8y/WuKILKoS87fXhfjWXeBk79rZrVqBTSQITjYqa/3Du+vI8fGVZ2c0N/ZWWfqTg6ztDMVawm8Pk+zKp6dsrrb839EoDe6MDfENd7wsIlcjhKgWqg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747496801; c=relaxed/simple; bh=SsrPLfM8wthYD43S65xheBryIlvUKfHdOR+brJMbJKg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=NFkYcjXOCsRAFra/jfVL2+pT1jxBEEE8erugnVpVz/3vRhdC5/OijRbT/XG+ab5X6z+Lvd7PuJOw5SFhm+e55rvJci+BvaeLQg56f+Kebw+4ze8UrIEo6GBHJoZvPCKQGT2VLfVfkfYtC9I4Yc9YcuoHsQS+qsH3Ohx0ExucO2o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=uvmfVcXx; arc=none smtp.client-ip=95.215.58.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="uvmfVcXx" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1747496794; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WI/79AbWQx3gG/Dmhht4xgAIYbS+IPsz7KSU3cLPE8w=; b=uvmfVcXxpawQ1bH2a98v6lZRoqJaiUfRg7QZOJRSG0jztjdLcTbZC9IoSR5tBnJ6W/2O6h 9rQEzPaC6+UDK4NfiOhogTPTTIlTmytA9o4iwINsjXai5/cBeMn89RBhcG+qbtoX20eArV tRifJh7tf1vxI+q/CG0RpXsWeLxEV0w= Date: Sat, 17 May 2025 08:46:23 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH bpf-next v2 2/4] bpf: Implement mprog API on top of existing cgroup progs Content-Language: en-GB To: Andrii Nakryiko Cc: bpf@vger.kernel.org, Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau References: <20250508223524.487875-1-yonghong.song@linux.dev> <20250508223534.488607-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 5/15/25 4:38 AM, Andrii Nakryiko wrote: > On Thu, May 8, 2025 at 3:35 PM Yonghong Song wrote: >> Current cgroup prog ordering is appending at attachment time. This is not >> ideal. In some cases, users want specific ordering at a particular cgroup >> level. To address this, the existing mprog API seems an ideal solution with >> supporting BPF_F_BEFORE and BPF_F_AFTER flags. >> >> But there are a few obstacles to directly use kernel mprog interface. >> Currently cgroup bpf progs already support prog attach/detach/replace >> and link-based attach/detach/replace. For example, in struct >> bpf_prog_array_item, the cgroup_storage field needs to be together >> with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog >> as the member, which makes it difficult to use kernel mprog interface. >> >> In another case, the current cgroup prog detach tries to use the >> same flag as in attach. This is different from mprog kernel interface >> which uses flags passed from user space. >> >> So to avoid modifying existing behavior, I made the following changes to >> support mprog API for cgroup progs: >> - The support is for prog list at cgroup level. Cross-level prog list >> (a.k.a. effective prog list) is not supported. >> - Previously, BPF_F_PREORDER is supported only for prog attach, now >> BPF_F_PREORDER is also supported by link-based attach. >> - For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID is supported similar to >> kernel mprog but with different implementation. >> - For detach and replace, use the existing implementation. >> - For attach, detach and replace, the revision for a particular prog >> list, associated with a particular attach type, will be updated >> by increasing count by 1. >> >> Signed-off-by: Yonghong Song >> --- >> include/uapi/linux/bpf.h | 7 ++ >> kernel/bpf/cgroup.c | 144 ++++++++++++++++++++++++++++----- >> kernel/bpf/syscall.c | 44 ++++++---- >> tools/include/uapi/linux/bpf.h | 7 ++ >> 4 files changed, 165 insertions(+), 37 deletions(-) >> > [...] > >> + if (!anchor_prog) { >> + hlist_for_each_entry(pltmp, progs, node) { >> + if ((flags & BPF_F_BEFORE) && *ppltmp) >> + break; >> + *ppltmp = pltmp; > This is be correct, but it's less obvious why because of all the > loops, breaks, and NULL anchor prog. The idea here is to find the very > first pl for BPF_F_BEFORE or the very last for BPF_F_AFTER, right? So > wouldn't this be more obviously correct: > > hlist_for_each_entry(pltmp, progs, node) { > if (flags & BPF_F_BEFORE) { > *ppltmp = pltmp; > return NULL; > } > *ppltmp = pltmp; > } > return NULL; > > > I.e., once you know the result, just return as early as possible and > don't require tracing through the rest of code just to eventually > return all the same (but now somewhat disguised) values. > > > Though see my point about anchor_prog below, which will simplify this > to just `return pltmp;` Indeed, returning pltmp sounds a better idea. I will do that. > > > I'd also add a comment that if there is no anchor_prog, then > BPF_F_PREORDER doesn't matter because we either prepend or append to a > combined list of progs and end up with correct result Okay, will add such comments. > >> + } >> + } else { >> + hlist_for_each_entry(pltmp, progs, node) { >> + pltmp_prog = pltmp->link ? pltmp->link->link.prog : pltmp->prog; >> + if (pltmp_prog != anchor_prog) >> + continue; >> + if (!!(pltmp->flags & BPF_F_PREORDER) != preorder) >> + goto out; >> + *ppltmp = pltmp; >> + break; >> + } >> + if (!*ppltmp) { >> + ret = -ENOENT; >> + goto out; >> + } >> + } >> + >> + return anchor_prog; >> + >> +out: >> + bpf_prog_put(anchor_prog); >> + return ERR_PTR(ret); >> +} >> + >> +static int insert_pl_to_hlist(struct bpf_prog_list *pl, struct hlist_head *progs, >> + struct bpf_prog *prog, u32 flags, u32 id_or_fd) >> +{ >> + struct bpf_prog_list *pltmp = NULL; >> + struct bpf_prog *anchor_prog; >> + >> + /* flags cannot have both BPF_F_BEFORE and BPF_F_AFTER */ >> + if ((flags & BPF_F_BEFORE) && (flags & BPF_F_AFTER)) >> + return -EINVAL; > I think this should be handled by get_anchor_prog(), both BPF_F_AFTER > and BPF_F_BEFORE will just result in no valid anchor program and we'll > error out below Yes. I will move the above flags checking to get_anchor_prog(). > >> + >> + anchor_prog = get_anchor_prog(progs, prog, flags, id_or_fd, &pltmp); >> + if (IS_ERR(anchor_prog)) >> + return PTR_ERR(anchor_prog); > it's confusing that we return anchor_prog but actually never use it, > no? wouldn't it make more sense to just return struct bpf_prog_list * > for an anchor then? Totally agree. Will return 'struct bpf_prog_list *'. > >> + >> + if (hlist_empty(progs)) >> + hlist_add_head(&pl->node, progs); >> + else if (flags & BPF_F_BEFORE) >> + hlist_add_before(&pl->node, &pltmp->node); >> + else >> + hlist_add_behind(&pl->node, &pltmp->node); >> + >> + return 0; >> +} >> + >> /** >> * __cgroup_bpf_attach() - Attach the program or the link to a cgroup, and >> * propagate the change to descendants >> @@ -633,6 +710,8 @@ static struct bpf_prog_list *find_attach_entry(struct hlist_head *progs, >> * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set >> * @type: Type of attach operation >> * @flags: Option flags >> + * @id_or_fd: Relative prog id or fd >> + * @revision: bpf_prog_list revision >> * >> * Exactly one of @prog or @link can be non-null. >> * Must be called with cgroup_mutex held. >> @@ -640,7 +719,8 @@ static struct bpf_prog_list *find_attach_entry(struct hlist_head *progs, >> static int __cgroup_bpf_attach(struct cgroup *cgrp, >> struct bpf_prog *prog, struct bpf_prog *replace_prog, >> struct bpf_cgroup_link *link, >> - enum bpf_attach_type type, u32 flags) >> + enum bpf_attach_type type, u32 flags, u32 id_or_fd, >> + u64 revision) >> { >> u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)); >> struct bpf_prog *old_prog = NULL; >> @@ -656,6 +736,9 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp, >> ((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI))) >> /* invalid combination */ >> return -EINVAL; >> + if ((flags & BPF_F_REPLACE) && (flags & (BPF_F_BEFORE | BPF_F_AFTER))) >> + /* only either replace or insertion with before/after */ >> + return -EINVAL; >> if (link && (prog || replace_prog)) >> /* only either link or prog/replace_prog can be specified */ >> return -EINVAL; >> @@ -663,9 +746,12 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp, >> /* replace_prog implies BPF_F_REPLACE, and vice versa */ >> return -EINVAL; >> >> + > nit: unnecessary empty line? Ack > >> atype = bpf_cgroup_atype_find(type, new_prog->aux->attach_btf_id); >> if (atype < 0) >> return -EINVAL; >> + if (revision && revision != cgrp->bpf.revisions[atype]) >> + return -ESTALE; >> >> progs = &cgrp->bpf.progs[atype]; >> > [...] > >> @@ -1312,7 +1409,8 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) >> struct cgroup *cgrp; >> int err; >> >> - if (attr->link_create.flags) >> + if (attr->link_create.flags && >> + (attr->link_create.flags & (~(BPF_F_ID | BPF_F_BEFORE | BPF_F_AFTER | BPF_F_PREORDER)))) > why the `attr->link_create.flags &&` check, seems unnecessary > > > also looking at BPF_F_ATTACH_MASK_MPROG, not allowing BPF_F_REPLACE > makes sense, but BPF_F_LINK makes sense for ordering, no? I didn't add BPF_F_LINK as my current implementation didn't support it. But as you mentioned, mprog API does support it (to find anchor prog). I will add support in the next revision. > >> return -EINVAL; >> >> cgrp = cgroup_get_from_fd(attr->link_create.target_fd); >> @@ -1336,7 +1434,9 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) >> } >> >> err = cgroup_bpf_attach(cgrp, NULL, NULL, link, >> - link->type, BPF_F_ALLOW_MULTI); >> + link->type, BPF_F_ALLOW_MULTI | attr->link_create.flags, >> + attr->link_create.cgroup.relative_fd, >> + attr->link_create.cgroup.expected_revision); >> if (err) { >> bpf_link_cleanup(&link_primer); >> goto out_put_cgroup; >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index df33d19c5c3b..58ea3c38eabb 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -4184,6 +4184,25 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, >> } >> } >> >> +static bool is_cgroup_prog_type(enum bpf_prog_type ptype, enum bpf_attach_type atype, >> + bool check_atype) >> +{ >> + switch (ptype) { >> + case BPF_PROG_TYPE_CGROUP_DEVICE: >> + case BPF_PROG_TYPE_CGROUP_SKB: >> + case BPF_PROG_TYPE_CGROUP_SOCK: >> + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: >> + case BPF_PROG_TYPE_CGROUP_SOCKOPT: >> + case BPF_PROG_TYPE_CGROUP_SYSCTL: >> + case BPF_PROG_TYPE_SOCK_OPS: >> + return true; >> + case BPF_PROG_TYPE_LSM: >> + return check_atype ? atype == BPF_LSM_CGROUP : true; >> + default: >> + return false; >> + } >> +} >> + >> #define BPF_PROG_ATTACH_LAST_FIELD expected_revision >> >> #define BPF_F_ATTACH_MASK_BASE \ >> @@ -4214,6 +4233,9 @@ static int bpf_prog_attach(const union bpf_attr *attr) >> if (bpf_mprog_supported(ptype)) { >> if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG) >> return -EINVAL; >> + } else if (is_cgroup_prog_type(ptype, 0, false)) { >> + if (attr->attach_flags & BPF_F_LINK) >> + return -EINVAL; > Why disable BPF_F_LINK? It's just a matter of using FD/ID for link vs > program to specify the place to attach. It doesn't mean that we need > to attach through BPF link interface. Or am I misremembering? Again, I didn't implement it. Will add support in the next revision. > >> } else { >> if (attr->attach_flags & ~BPF_F_ATTACH_MASK_BASE) >> return -EINVAL; >> @@ -4242,20 +4264,6 @@ static int bpf_prog_attach(const union bpf_attr *attr) >> case BPF_PROG_TYPE_FLOW_DISSECTOR: >> ret = netns_bpf_prog_attach(attr, prog); >> break; >> - case BPF_PROG_TYPE_CGROUP_DEVICE: >> - case BPF_PROG_TYPE_CGROUP_SKB: >> - case BPF_PROG_TYPE_CGROUP_SOCK: >> - case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: >> - case BPF_PROG_TYPE_CGROUP_SOCKOPT: >> - case BPF_PROG_TYPE_CGROUP_SYSCTL: >> - case BPF_PROG_TYPE_SOCK_OPS: >> - case BPF_PROG_TYPE_LSM: >> - if (ptype == BPF_PROG_TYPE_LSM && >> - prog->expected_attach_type != BPF_LSM_CGROUP) >> - ret = -EINVAL; >> - else >> - ret = cgroup_bpf_prog_attach(attr, ptype, prog); >> - break; >> case BPF_PROG_TYPE_SCHED_CLS: >> if (attr->attach_type == BPF_TCX_INGRESS || >> attr->attach_type == BPF_TCX_EGRESS) >> @@ -4264,7 +4272,10 @@ static int bpf_prog_attach(const union bpf_attr *attr) >> ret = netkit_prog_attach(attr, prog); >> break; >> default: >> - ret = -EINVAL; >> + if (!is_cgroup_prog_type(ptype, prog->expected_attach_type, true)) >> + ret = -EINVAL; >> + else >> + ret = cgroup_bpf_prog_attach(attr, ptype, prog); >> } >> >> if (ret) > [...]