From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering
Date: Tue, 11 Feb 2025 21:35:33 -0800 [thread overview]
Message-ID: <e5b049d9-630d-44cf-bb7b-180faf5080f8@linux.dev> (raw)
In-Reply-To: <CAEf4BzY2F33FT2pDO8Zy1_zuQJVbwSS4OoMbBsEcyBVDTaKSeg@mail.gmail.com>
On 2/11/25 2:57 PM, Andrii Nakryiko wrote:
> On Thu, Feb 6, 2025 at 3:00 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> Currently for bpf progs in a cgroup hierarchy, the effective prog array
>> is computed from bottom cgroup to upper cgroups. For example, the following
>> cgroup hierarchy
>> root cgroup: p1, p2
>> subcgroup: p3, p4
>> have BPF_F_ALLOW_MULTI for both cgroup levels.
>> The effective cgroup array ordering looks like
>> p3 p4 p1 p2
>> and at run time, the progs will execute based on that order.
>>
>> But in some cases, it is desirable to have root prog executes earlier than
>> children progs. For example,
>> - prog p1 intends to collect original pkt dest addresses.
>> - prog p3 will modify original pkt dest addresses to a proxy address for
>> security reason.
>> The end result is that prog p1 gets proxy address which is not what it
>> wants. Also, putting p1 to every child cgroup is not desirable either as it
>> will duplicate itself in many child cgroups. And this is exactly a use case
>> we are encountering in Meta.
>>
>> To fix this issue, let us introduce a flag BPF_F_PRIO_TOPDOWN. If the flag
>> is specified at attachment time, the prog has higher priority and the
>> ordering with that flag will be from top to bottom. For example, in the
>> above example,
>> root cgroup: p1, p2
>> subcgroup: p3, p4
>> Let us say p1, p2 and p4 are marked with BPF_F_PRIO_TOPDOWN. The final
> I'm not a big fan of PRIO_TOPDOWN naming, and this example just
> provides further argument for why. Between p3 and p4 programs in
> subcgroup, there is no notion of TOPDOWN, they are at the same level
> of the hierarchy.
>
> In graphs, for DFS, PRIO_TOPDOWN semantics corresponds to pre-order vs
> (current and default) post-order. So why not something like
> BPF_F_PREORDER or some variation on that?
BPF_F_PREORDER sounds good to me.
>
> Also, for your example if would be nicer if p1 and p3 were the default
> post-order attachment, while p2 and p4 were pre-order. Then you'd have
> p2, p4, p3, p1, where everything is swapped relative to original
> ordering ;)
Yes, will adjust the test for such scenario!
>
>> effective array ordering will be
>> p1 p2 p4 p3
>>
>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> include/linux/bpf-cgroup.h | 1 +
>> include/uapi/linux/bpf.h | 1 +
>> kernel/bpf/cgroup.c | 37 +++++++++++++++++++++++++++++++---
>> kernel/bpf/syscall.c | 3 ++-
>> tools/include/uapi/linux/bpf.h | 1 +
>> 5 files changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 7fc69083e745..3d4f221df9ef 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -111,6 +111,7 @@ struct bpf_prog_list {
>> struct bpf_prog *prog;
>> struct bpf_cgroup_link *link;
>> struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>> + bool is_prio_topdown;
> let's go with `int flags`, we increase the size of struct
> bpf_prog_list by 8 bytes anyways, so let's make this a bit more
> generic?
Ack.
>
>> };
>>
>> int cgroup_bpf_inherit(struct cgroup *cgrp);
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index fff6cdb8d11a..7ae8e8751e78 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1207,6 +1207,7 @@ enum bpf_perf_event_type {
>> #define BPF_F_BEFORE (1U << 3)
>> #define BPF_F_AFTER (1U << 4)
>> #define BPF_F_ID (1U << 5)
>> +#define BPF_F_PRIO_TOPDOWN (1U << 6)
>> #define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
>>
>> /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 46e5db65dbc8..f31250c6025b 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -382,6 +382,21 @@ static u32 prog_list_length(struct hlist_head *head)
>> return cnt;
>> }
>>
>> +static u32 prog_list_length_with_topdown_cnt(struct hlist_head *head, int *topdown_cnt)
> instead of duplicating prog_list_length(), let's add this `int
> *topdown_cnt` counter as an optional argument, which prog_list_length
> will fill out only if it's provided, i.e., you'll just have:
>
> if (topdown_cnt && pl->is_prio_topdown)
> (*topdown_cnt) += 1;
>
> as one extra condition inside the loop?
I thought about this as well. I tried to create a new function since
prog_list_length() is used in several different places. This
is not critical path, so yes, I can just add one addititional parameter
for prog_list_length() as you suggested.
>
>> +{
>> + struct bpf_prog_list *pl;
>> + u32 cnt = 0;
>> +
>> + hlist_for_each_entry(pl, head, node) {
>> + if (!prog_list_prog(pl))
>> + continue;
>> + cnt++;
>> + if (pl->is_prio_topdown)
>> + (*topdown_cnt) += 1;
>> + }
>> + return cnt;
>> +}
>> +
prev parent reply other threads:[~2025-02-12 5:35 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-06 22:59 [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering Yonghong Song
2025-02-06 23:00 ` [PATCH bpf-next 2/2] selftests/bpf: Add selftests allowing cgroup prog top-down ordering Yonghong Song
2025-02-11 22:57 ` [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering Andrii Nakryiko
2025-02-12 5:35 ` Yonghong Song [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e5b049d9-630d-44cf-bb7b-180faf5080f8@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox