BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yonghong.song@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering
Date: Tue, 11 Feb 2025 21:35:33 -0800	[thread overview]
Message-ID: <e5b049d9-630d-44cf-bb7b-180faf5080f8@linux.dev> (raw)
In-Reply-To: <CAEf4BzY2F33FT2pDO8Zy1_zuQJVbwSS4OoMbBsEcyBVDTaKSeg@mail.gmail.com>



On 2/11/25 2:57 PM, Andrii Nakryiko wrote:
> On Thu, Feb 6, 2025 at 3:00 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> Currently for bpf progs in a cgroup hierarchy, the effective prog array
>> is computed from bottom cgroup to upper cgroups. For example, the following
>> cgroup hierarchy
>>      root cgroup: p1, p2
>>          subcgroup: p3, p4
>> have BPF_F_ALLOW_MULTI for both cgroup levels.
>> The effective cgroup array ordering looks like
>>      p3 p4 p1 p2
>> and at run time, the progs will execute based on that order.
>>
>> But in some cases, it is desirable to have root prog executes earlier than
>> children progs. For example,
>>    - prog p1 intends to collect original pkt dest addresses.
>>    - prog p3 will modify original pkt dest addresses to a proxy address for
>>      security reason.
>> The end result is that prog p1 gets proxy address which is not what it
>> wants. Also, putting p1 to every child cgroup is not desirable either as it
>> will duplicate itself in many child cgroups. And this is exactly a use case
>> we are encountering in Meta.
>>
>> To fix this issue, let us introduce a flag BPF_F_PRIO_TOPDOWN. If the flag
>> is specified at attachment time, the prog has higher priority and the
>> ordering with that flag will be from top to bottom. For example, in the
>> above example,
>>      root cgroup: p1, p2
>>          subcgroup: p3, p4
>> Let us say p1, p2 and p4 are marked with BPF_F_PRIO_TOPDOWN. The final
> I'm not a big fan of PRIO_TOPDOWN naming, and this example just
> provides further argument for why. Between p3 and p4 programs in
> subcgroup, there is no notion of TOPDOWN, they are at the same level
> of the hierarchy.
>
> In graphs, for DFS, PRIO_TOPDOWN semantics corresponds to pre-order vs
> (current and default) post-order. So why not something like
> BPF_F_PREORDER or some variation on that?

BPF_F_PREORDER sounds good to me.

>
> Also, for your example if would be nicer if p1 and p3 were the default
> post-order attachment, while p2 and p4 were pre-order. Then you'd have
> p2, p4, p3, p1, where everything is swapped relative to original
> ordering ;)

Yes, will adjust the test for such scenario!

>
>> effective array ordering will be
>>      p1 p2 p4 p3
>>
>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   include/linux/bpf-cgroup.h     |  1 +
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/cgroup.c            | 37 +++++++++++++++++++++++++++++++---
>>   kernel/bpf/syscall.c           |  3 ++-
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   5 files changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 7fc69083e745..3d4f221df9ef 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -111,6 +111,7 @@ struct bpf_prog_list {
>>          struct bpf_prog *prog;
>>          struct bpf_cgroup_link *link;
>>          struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>> +       bool is_prio_topdown;
> let's go with `int flags`, we increase the size of struct
> bpf_prog_list by 8 bytes anyways, so let's make this a bit more
> generic?

Ack.

>
>>   };
>>
>>   int cgroup_bpf_inherit(struct cgroup *cgrp);
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index fff6cdb8d11a..7ae8e8751e78 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1207,6 +1207,7 @@ enum bpf_perf_event_type {
>>   #define BPF_F_BEFORE           (1U << 3)
>>   #define BPF_F_AFTER            (1U << 4)
>>   #define BPF_F_ID               (1U << 5)
>> +#define BPF_F_PRIO_TOPDOWN     (1U << 6)
>>   #define BPF_F_LINK             BPF_F_LINK /* 1 << 13 */
>>
>>   /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 46e5db65dbc8..f31250c6025b 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -382,6 +382,21 @@ static u32 prog_list_length(struct hlist_head *head)
>>          return cnt;
>>   }
>>
>> +static u32 prog_list_length_with_topdown_cnt(struct hlist_head *head, int *topdown_cnt)
> instead of duplicating prog_list_length(), let's add this `int
> *topdown_cnt` counter as an optional argument, which prog_list_length
> will fill out only if it's provided, i.e., you'll just have:
>
> if (topdown_cnt && pl->is_prio_topdown)
>     (*topdown_cnt) += 1;
>
> as one extra condition inside the loop?

I thought about this as well. I tried to create a new function since
prog_list_length() is used in several different places. This
is not critical path, so yes, I can just add one addititional parameter
for prog_list_length() as you suggested.

>
>> +{
>> +       struct bpf_prog_list *pl;
>> +       u32 cnt = 0;
>> +
>> +       hlist_for_each_entry(pl, head, node) {
>> +               if (!prog_list_prog(pl))
>> +                       continue;
>> +               cnt++;
>> +               if (pl->is_prio_topdown)
>> +                       (*topdown_cnt) += 1;
>> +       }
>> +       return cnt;
>> +}
>> +


      reply	other threads:[~2025-02-12  5:35 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-06 22:59 [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering Yonghong Song
2025-02-06 23:00 ` [PATCH bpf-next 2/2] selftests/bpf: Add selftests allowing cgroup prog top-down ordering Yonghong Song
2025-02-11 22:57 ` [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering Andrii Nakryiko
2025-02-12  5:35   ` Yonghong Song [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e5b049d9-630d-44cf-bb7b-180faf5080f8@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox