From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 09B338C1E for ; Thu, 8 May 2025 04:42:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1746679374; cv=none; b=ANn2KUHh7VfXz/kF0S0xsBAziP3XRtEn878QDPamFjwilV7EkCE9TGiF+YEkiQvt5G0qCjE7Us4/38SNO39laI9OxDZdkEPvRpOkobdDssvXY5CH9QBzOUlZrRmVKnQ0H40AUlvVlOTgDHTyW2yHvD04U7+44ixuVRLRFNucxdw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1746679374; c=relaxed/simple; bh=LdpKIEayUECPg/zpxTyw4/8POzOA+ehpI1roFzTCkzw=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=i6JMGfy89pSndobx2o5Ksp7BL0cebPDiheSIQQrKoJ4kO1psRoZdeYrZOAfkOdsOpbq+Fo8pb4CTjIFl8LoNon2RsPGuWrfjOQg8b6PPoqASNIuCtmPEI7qGk+RnNVr/gIhzKc6Y1PC1yZR2lgmCU1DHyOdwkL9//6UJ/TFP/dQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=VkGVnSQJ; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="VkGVnSQJ" Message-ID: <01f9b38e-d14d-4e35-93aa-d9722a565474@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1746679369; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GqbKz26Owt6XjYVd8551lfvho9O5vyigB31ECbUz8lw=; b=VkGVnSQJD+kPG3xvCzzbXKBTFKbyH4csqq+v5SyLae06pVvM+Vsh5HnT7rtIuR0zuVBUzr pho4mmHvZdi/W5M6Ju9lKYDEZZGu9+BQ7Qs/9FaSnQm0AWwbeHMerVL4X+on5HBkoiRiLC +GV9rf8zC5qX4WwYBdqbppwtTVo2fVk= Date: Wed, 7 May 2025 21:42:40 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [RFC PATCH bpf-next 2/4] bpf: Implement mprog API on top of existing cgroup progs Content-Language: en-GB To: Andrii Nakryiko Cc: bpf@vger.kernel.org, Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau References: <20250411011523.1838771-1-yonghong.song@linux.dev> <20250411011533.1839631-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 4/23/25 7:19 AM, Andrii Nakryiko wrote: > On Thu, Apr 10, 2025 at 6:15 PM Yonghong Song wrote: >> Current cgroup prog ordering is appending at attachment time. This is not >> ideal. In some cases, users want specific ordering at a particular cgroup >> level. To address this, the existing mprog API seems an ideal solution with >> supporting BPF_F_BEFORE and BPF_F_AFTER flags. >> >> But there are a few obstacles to directly use kernel mprog interface. >> Currently cgroup bpf progs already support prog attach/detach/replace >> and link-based attach/detach/replace. For example, in struct >> bpf_prog_array_item, the cgroup_storage field needs to be together >> with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog >> as the member, which makes it difficult to use kernel mprog interface. >> >> In another case, the current cgroup prog detach tries to use the >> same flag as in attach. This is different from mprog kernel interface >> which uses flags passed from user space. >> >> So to avoid modifying existing behavior, I made the following changes to >> support mprog API for cgroup progs: >> - The support is for prog list at cgroup level. Cross-level prog list >> (a.k.a. effective prog list) is not supported. >> - Previously, BPF_F_PREORDER is supported only for prog attach, now >> BPF_F_PREORDER is also supported by link-based attach. >> - For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID is supported similar to >> kernel mprog but with different implementation. >> - For detach and replace, use the existing implementation. >> - For attach, detach and replace, the revision for a particular prog >> list, associated with a particular attach type, will be updated >> by increasing count by 1. >> >> Signed-off-by: Yonghong Song >> --- >> include/uapi/linux/bpf.h | 7 ++ >> kernel/bpf/cgroup.c | 151 ++++++++++++++++++++++++++++----- >> kernel/bpf/syscall.c | 58 ++++++++----- >> tools/include/uapi/linux/bpf.h | 7 ++ >> 4 files changed, 181 insertions(+), 42 deletions(-) >> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 71d5ac83cf5d..a5c7992e8f7c 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -1794,6 +1794,13 @@ union bpf_attr { >> }; >> __u64 expected_revision; >> } netkit; >> + struct { >> + union { >> + __u32 relative_fd; >> + __u32 relative_id; >> + }; >> + __u64 expected_revision; >> + } cgroup; >> }; >> } link_create; >> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c >> index 84f58f3d028a..ffd455051131 100644 >> --- a/kernel/bpf/cgroup.c >> +++ b/kernel/bpf/cgroup.c >> @@ -624,6 +624,90 @@ static struct bpf_prog_list *find_attach_entry(struct hlist_head *progs, >> return NULL; >> } >> >> +static struct bpf_prog *get_cmp_prog(struct hlist_head *progs, struct bpf_prog *prog, >> + u32 flags, u32 id_or_fd, struct bpf_prog_list **ppltmp) >> +{ >> + struct bpf_prog *cmp_prog = NULL, *pltmp_prog; >> + bool preorder = !!(flags & BPF_F_PREORDER); > nit: !!() pattern is not necessary when assigning to bool and is just > a visual and cognitive noise Okay, will remove !! operators. > >> + struct bpf_prog_list *pltmp; >> + bool id = flags & BPF_F_ID; >> + bool found; >> + >> + if (id || id_or_fd) { > let's invert the condition and exit early? also I find "id" as a bool > very confusing, I think it's fine to just open-coded it in two places > you actually check for BPF_F_ID flag. Open-codedflags & BPF_F_ID is okay. > But also, isn't this `if (id)` part redundant? Would just `if > (id_or_fd)` be enough? Yes, id_or_fd should be enough. > > >> + /* flags must have BPF_F_BEFORE or BPF_F_AFTER */ >> + if (!(flags & (BPF_F_BEFORE | BPF_F_AFTER))) >> + return ERR_PTR(-EINVAL); >> + >> + if (id) >> + cmp_prog = bpf_prog_by_id(id_or_fd); >> + else >> + cmp_prog = bpf_prog_get(id_or_fd); > how about we use "anchor" terminology here? So this would be "anchor > program" or anchor_prog? anchor_prog sounds good. > >> + if (IS_ERR(cmp_prog)) >> + return cmp_prog; >> + if (cmp_prog->type != prog->type) > bpf_prog_put? Yes. Missed bpf_prog_put. Will fix. > > pw-bot: cr > >> + return ERR_PTR(-EINVAL); >> + >> + found = false; >> + hlist_for_each_entry(pltmp, progs, node) { >> + pltmp_prog = pltmp->link ? pltmp->link->link.prog : pltmp->prog; >> + if (pltmp_prog == cmp_prog) { > try keeping nesting minimal: > > if (pltmp_prog != cmp_prog) > continue; Good point. Will do. > >> + if (!!(pltmp->flags & BPF_F_PREORDER) != preorder) >> + return ERR_PTR(-EINVAL); >> + found = true; >> + *ppltmp = pltmp; > we don't need found flag if we set ppltmp to NULL before loop, and to > non-NULL if we find the match Ack. > >> + break; >> + } >> + } >> + if (!found) > bpf_prog_put(cmp_prog) Again, will do bpf_prog_put before turn error. > >> + return ERR_PTR(-ENOENT); >> + } >> + >> + return cmp_prog; >> +} >> + >> +static int insert_pl_to_hlist(struct bpf_prog_list *pl, struct hlist_head *progs, >> + struct bpf_prog *prog, u32 flags, u32 id_or_fd) >> +{ >> + struct hlist_node *last, *last_node = NULL; >> + struct bpf_prog_list *pltmp = NULL; >> + struct bpf_prog *cmp_prog; >> + >> + /* flags cannot have both BPF_F_BEFORE and BPF_F_AFTER */ >> + if ((flags & BPF_F_BEFORE) && (flags & BPF_F_AFTER)) >> + return -EINVAL; >> + >> + cmp_prog = get_cmp_prog(progs, prog, flags, id_or_fd, &pltmp); > why get_cmp_prog can't return this last_node if we have BPF_F_AFTER > with no id/fd specified? Then you wouldn't have to special-case > appending (same for prepending, actually)? We could do this. Let me try this. > >> + if (IS_ERR(cmp_prog)) >> + return PTR_ERR(cmp_prog); >> + >> + if (hlist_empty(progs)) { >> + hlist_add_head(&pl->node, progs); >> + } else { >> + hlist_for_each(last, progs) { >> + if (last->next) >> + continue; >> + last_node = last; >> + break; >> + } >> + >> + if (!cmp_prog) { >> + if (flags & BPF_F_BEFORE) >> + hlist_add_head(&pl->node, progs); >> + else >> + hlist_add_behind(&pl->node, last_node); >> + } else { >> + if (flags & BPF_F_BEFORE) >> + hlist_add_before(&pl->node, &pltmp->node); >> + else if (flags & BPF_F_AFTER) >> + hlist_add_behind(&pl->node, &pltmp->node); >> + else >> + hlist_add_behind(&pl->node, last_node); >> + } >> + } >> + >> + return 0; >> +} >> + >> /** >> * __cgroup_bpf_attach() - Attach the program or the link to a cgroup, and >> * propagate the change to descendants >> @@ -633,6 +717,8 @@ static struct bpf_prog_list *find_attach_entry(struct hlist_head *progs, >> * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set >> * @type: Type of attach operation >> * @flags: Option flags >> + * @id_or_fd: Relative prog id or fd >> + * @revision: bpf_prog_list revision >> * >> * Exactly one of @prog or @link can be non-null. >> * Must be called with cgroup_mutex held. >> @@ -640,7 +726,8 @@ static struct bpf_prog_list *find_attach_entry(struct hlist_head *progs, >> static int __cgroup_bpf_attach(struct cgroup *cgrp, >> struct bpf_prog *prog, struct bpf_prog *replace_prog, >> struct bpf_cgroup_link *link, >> - enum bpf_attach_type type, u32 flags) >> + enum bpf_attach_type type, u32 flags, u32 id_or_fd, >> + u64 revision) >> { >> u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)); >> struct bpf_prog *old_prog = NULL; >> @@ -656,6 +743,9 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp, >> ((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI))) >> /* invalid combination */ >> return -EINVAL; >> + if ((flags & BPF_F_REPLACE) && (flags & (BPF_F_BEFORE | BPF_F_AFTER))) >> + /* only either replace or insertion with before/after */ >> + return -EINVAL; >> if (link && (prog || replace_prog)) >> /* only either link or prog/replace_prog can be specified */ >> return -EINVAL; >> @@ -663,9 +753,12 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp, >> /* replace_prog implies BPF_F_REPLACE, and vice versa */ >> return -EINVAL; >> >> + >> atype = bpf_cgroup_atype_find(type, new_prog->aux->attach_btf_id); >> if (atype < 0) >> return -EINVAL; >> + if (revision && revision != atomic64_read(&cgrp->bpf.revisions[atype])) > this is happening under lock, no need for atomic operations Ack > >> + return -ESTALE; >> >> progs = &cgrp->bpf.progs[atype]; >> > [...] > >> @@ -1063,6 +1161,7 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, >> struct bpf_prog_array *effective; >> int cnt, ret = 0, i; >> int total_cnt = 0; >> + u64 revision = 0; >> u32 flags; >> >> if (effective_query && prog_attach_flags) >> @@ -1100,6 +1199,10 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, >> return -EFAULT; >> if (copy_to_user(&uattr->query.prog_cnt, &total_cnt, sizeof(total_cnt))) >> return -EFAULT; >> + if (!effective_query && from_atype == to_atype) >> + revision = atomic64_read(&cgrp->bpf.revisions[from_atype]); > even here we hold cgroup_mutex Ack > >> + if (copy_to_user(&uattr->query.revision, &revision, sizeof(revision))) >> + return -EFAULT; >> if (attr->query.prog_cnt == 0 || !prog_ids || !total_cnt) >> /* return early if user requested only program count + flags */ >> return 0; > [...] > >> @@ -1336,7 +1441,9 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) >> } >> >> err = cgroup_bpf_attach(cgrp, NULL, NULL, link, >> - link->type, BPF_F_ALLOW_MULTI); >> + link->type, BPF_F_ALLOW_MULTI | attr->link_create.flags, >> + attr->link_create.cgroup.relative_fd, >> + attr->link_create.cgroup.expected_revision); >> if (err) { >> bpf_link_cleanup(&link_primer); >> goto out_put_cgroup; >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 9794446bc8c6..48cf855f949f 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -4183,6 +4183,23 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, >> } >> } >> >> +static bool bpf_cgroup_prog_attached(enum bpf_prog_type ptype) > I find this "attached" naming misleading. I think the name should call > out that we are just checking if program type is cgroup-attaching, so > maybe something like "is_cgroup_prog_type", or something along those > lines? > > > But for LSM we need to look at expected_attach_type to be > BPF_LSM_CGROUP, so maybe pass both prog type and expected attach type? Yes, let us pass both prog type and expected attach type with func name is_cgroup_prog_type(). > >> +{ >> + switch (ptype) { >> + case BPF_PROG_TYPE_CGROUP_DEVICE: >> + case BPF_PROG_TYPE_CGROUP_SKB: >> + case BPF_PROG_TYPE_CGROUP_SOCK: >> + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: >> + case BPF_PROG_TYPE_CGROUP_SOCKOPT: >> + case BPF_PROG_TYPE_CGROUP_SYSCTL: >> + case BPF_PROG_TYPE_SOCK_OPS: >> + case BPF_PROG_TYPE_LSM: >> + return true; >> + default: >> + return false; >> + } >> +} >> + > [...]