Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matt Bobrowski <mattbobrowski@google.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	ohn Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Jiri Olsa <jolsa@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Chuyi Zhou <zhouchuyi@bytedance.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
Date: Tue, 27 Jan 2026 08:28:31 +0000	[thread overview]
Message-ID: <aXh3L7BukANG6Dve@google.com> (raw)
In-Reply-To: <CAADnVQJXs1NinaahP-uKOrqKUHEZ4PgwOAAVpWkHDWkapBZ2YA@mail.gmail.com>

On Mon, Jan 26, 2026 at 06:26:55PM -0800, Alexei Starovoitov wrote:
> On Mon, Jan 26, 2026 at 1:03 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > On Fri, Jan 23, 2026 at 09:17:03AM -0800, Alexei Starovoitov wrote:
> > > On Fri, Jan 23, 2026 at 3:06 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > >
> > > > On Thu, Jan 22, 2026 at 08:26:49PM -0800, Alexei Starovoitov wrote:
> > > > > On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > > > >
> > > > > >                 break;
> > > > > > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > > > > > +               kit->pos = css_next_child(kit->pos, kit->start);
> > > > > > +               break;
> > > > >
> > > > > There are no users of css_next_child() outside of cgroup
> > > > > internals. It's a red flag for me.
> > > >
> > > > Hm, I can see the slight hesitation here. However, until somewhat
> > > > recently, the same argument could have been applied to functions like
> > > > css_next_descendant_post(), right?
> > > >
> > > > css_next_descendant_post() has rather recently started seeing usage in
> > > > other subsystems (block/, kernel/sched/, kernel/bpf), despite
> > > > remaining internal-only since its inception over a decade ago. Given
> > > > that css_next_descendant_post() continues to remain unexported, in all
> > > > fairness, I don't see css_next_child() being all that different in
> > > > this context.
> > > >
> > > > Would your stance change if Tejun agreed to mark css_next_child() as
> > > > exportable, or simply agreed to it being used from BPF cgroup
> > > > iterators?
> > > >
> > > > Both css_next_child() and css_for_each_child() have remained virtually
> > > > unchanged - and therefore inherently stable - since their
> > > > introduction. I should also mention that functions like
> > > > css_next_child() and css_next_descendant_post() were originally marked
> > > > as exported, and only later unmarked. This presumably was done as part
> > > > of a symbol cleanup, not because the functions were unstable
> > > > primitives.
> > > >
> > > > > I feel there is something wrong with the use case. It should have
> > > > > been handled by the primitives we have.
> > > >
> > > > Right, arguably it could be handled with the current set of
> > > > primitives, but it isn't nice and certainly not as efficient as I'd
> > > > like.
> > >
> > > Why is it? Just descdant_pre and break out of the loop if not immediate.
> >
> > If I use a break when I hit a grandchild in pre-order traversal, I
> > will still terminate the loop early and miss all the subsequent
> > immediate children?
> >
> > I do understand that using css_next_descendant_post() (without a
> > break) could allow me to logically filter for children in the
> > loop. However, my concern is purely about algorithmic complexity and
> > scalability.
> >
> > css_next_descendant_post() performs a full subtree walk
> > (O(Descendants)). If I have a top-level cgroup with only a few
> > children but a massive hierarchy underneath (hundreds/thousands of
> > grandchildren), using the descendant based iterator requires me to
> > literally visit every single node in that subtree.
> >
> > In contrast, iterating immediate children is just a linked-list walk
> > (O(Children)). For deep hierarchies, the performance difference is not
> > constant - it's orders of magnitude.
> >
> > For example, if I have 5 children under a given root/parent with each
> > of those 5 children having 500 nested descedants each:
> > - A pre-order traversal would lead me to visiting 2505 nodes.
> > - A linear traversal would lead me to visting 5 nodes.
> >
> > > The thought process should be to use what's available as much as possible.
> >
> > Well, I get that and it was my initial idea. But, I quickly realized
> > that we have a gap in the current iterators capabilities, and this
> > patch is meant to address this gap.
> >
> > > Adding UAPI for specific use cases isn't great.
> > > If it was a new kfunc that's one thing, but here you're touching uapi/bpf.h
> > > So acceptance bar is much much higher.
> >
> > OK, but now that Tejun confirmed [0] that he's fine with exposing
> > css_next_child(), how does this weigh up?
> 
> ok. let's add it.
> Maybe shorten the name to
> BPF_CGROUP_ITER_CHILDREN
> "_ONLY" looks unnecessary.

Sounds good, thank you. Shortening to BPF_CGROUP_ITER_CHILDREN is OK
too. Will send it out as part of v2.

next prev parent reply	other threads:[~2026-01-27  8:28 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-21 13:54 [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option Matt Bobrowski
2026-01-21 13:54 ` [PATCH bpf-next 2/2] bpf/selftests: cover " Matt Bobrowski
2026-01-21 19:14 ` [PATCH bpf-next 1/2] bpf: add new " Song Liu
2026-01-22 12:31   ` Matt Bobrowski
2026-01-23  4:26 ` Alexei Starovoitov
2026-01-23 11:06   ` Matt Bobrowski
2026-01-23 17:17     ` Alexei Starovoitov
2026-01-26  9:03       ` Matt Bobrowski
2026-01-27  2:26         ` Alexei Starovoitov
2026-01-27  8:28           ` Matt Bobrowski [this message]
2026-01-23 18:50     ` Tejun Heo
2026-01-26  9:14       ` Matt Bobrowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXh3L7BukANG6Dve@google.com \
    --to=mattbobrowski@google.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=yonghong.song@linux.dev \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.