[PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
@ 2026-01-21 13:54 Matt Bobrowski
  2026-01-21 13:54 ` [PATCH bpf-next 2/2] bpf/selftests: cover " Matt Bobrowski
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-21 13:54 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo, Matt Bobrowski

Currently, the BPF cgroup iterator supports walking descendants in
either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
(BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
depth-first search (DFS) of the hierarchy. In scenarios where a BPF
program may need to inspect only the direct children of a given parent
cgroup, a full DFS is unnecessarily expensive.

This patch introduces a new BPF cgroup iterator control option,
BPF_CGROUP_ITER_CHILDREN_ONLY. This control option restricts the
traversal to the immediate children of a specified parent cgroup,
allowing for more targeted and efficient iteration, particularly when
exhaustive depth-first search (DFS) traversal is not required.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
---
 include/uapi/linux/bpf.h       | 16 ++++++++++++----
 kernel/bpf/cgroup_iter.c       | 26 +++++++++++++++++++++-----
 tools/include/uapi/linux/bpf.h | 16 ++++++++++++----
 3 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a2ade4be60f..eae8f9133df2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -115,10 +115,18 @@ struct bpf_cgroup_storage_key {
 
 enum bpf_cgroup_iter_order {
 	BPF_CGROUP_ITER_ORDER_UNSPEC = 0,
-	BPF_CGROUP_ITER_SELF_ONLY,		/* process only a single object. */
-	BPF_CGROUP_ITER_DESCENDANTS_PRE,	/* walk descendants in pre-order. */
-	BPF_CGROUP_ITER_DESCENDANTS_POST,	/* walk descendants in post-order. */
-	BPF_CGROUP_ITER_ANCESTORS_UP,		/* walk ancestors upward. */
+	BPF_CGROUP_ITER_SELF_ONLY, 		/* process only a single object. */
+	BPF_CGROUP_ITER_DESCENDANTS_PRE, 	/* walk descendants in pre-order. */
+	BPF_CGROUP_ITER_DESCENDANTS_POST, 	/* walk descendants in post-order. */
+	BPF_CGROUP_ITER_ANCESTORS_UP, 		/* walk ancestors upward. */
+	/*
+	 * Walks the immediate children of the specified parent
+	 * cgroup_subsys_state. Unlike BPF_CGROUP_ITER_DESCENDANTS_PRE,
+	 * BPF_CGROUP_ITER_DESCENDANTS_POST, and BPF_CGROUP_ITER_ANCESTORS_UP
+	 * the iterator does not include the specified parent as one of the
+	 * returned iterator elements.
+	 */
+	BPF_CGROUP_ITER_CHILDREN_ONLY,
 };
 
 union bpf_iter_link_info {
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
index f04a468cf6a7..bca95bbfecf0 100644
--- a/kernel/bpf/cgroup_iter.c
+++ b/kernel/bpf/cgroup_iter.c
@@ -8,12 +8,13 @@
 
 #include "../cgroup/cgroup-internal.h"  /* cgroup_mutex and cgroup_is_dead */
 
-/* cgroup_iter provides four modes of traversal to the cgroup hierarchy.
+/* cgroup_iter provides five modes of traversal to the cgroup hierarchy.
  *
  *  1. Walk the descendants of a cgroup in pre-order.
  *  2. Walk the descendants of a cgroup in post-order.
  *  3. Walk the ancestors of a cgroup.
  *  4. Show the given cgroup only.
+ *  5. Walk only the children of a given parent cgroup.
  *
  * For walking descendants, cgroup_iter can walk in either pre-order or
  * post-order. For walking ancestors, the iter walks up from a cgroup to
@@ -78,6 +79,8 @@ static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
 		return css_next_descendant_pre(NULL, p->start_css);
 	else if (p->order == BPF_CGROUP_ITER_DESCENDANTS_POST)
 		return css_next_descendant_post(NULL, p->start_css);
+	else if (p->order == BPF_CGROUP_ITER_CHILDREN_ONLY)
+		return css_next_child(NULL, p->start_css);
 	else /* BPF_CGROUP_ITER_SELF_ONLY and BPF_CGROUP_ITER_ANCESTORS_UP */
 		return p->start_css;
 }
@@ -113,6 +116,8 @@ static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 		return css_next_descendant_post(curr, p->start_css);
 	else if (p->order == BPF_CGROUP_ITER_ANCESTORS_UP)
 		return curr->parent;
+	else if (p->order == BPF_CGROUP_ITER_CHILDREN_ONLY)
+		return css_next_child(curr, p->start_css);
 	else  /* BPF_CGROUP_ITER_SELF_ONLY */
 		return NULL;
 }
@@ -200,11 +205,16 @@ static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
 	int order = linfo->cgroup.order;
 	struct cgroup *cgrp;
 
-	if (order != BPF_CGROUP_ITER_DESCENDANTS_PRE &&
-	    order != BPF_CGROUP_ITER_DESCENDANTS_POST &&
-	    order != BPF_CGROUP_ITER_ANCESTORS_UP &&
-	    order != BPF_CGROUP_ITER_SELF_ONLY)
+	switch (order) {
+	case BPF_CGROUP_ITER_DESCENDANTS_PRE:
+	case BPF_CGROUP_ITER_DESCENDANTS_POST:
+	case BPF_CGROUP_ITER_ANCESTORS_UP:
+	case BPF_CGROUP_ITER_SELF_ONLY:
+	case BPF_CGROUP_ITER_CHILDREN_ONLY:
+		break;
+	default:
 		return -EINVAL;
+	}
 
 	if (fd && id)
 		return -EINVAL;
@@ -257,6 +267,8 @@ static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
 		seq_puts(seq, "order: descendants_post\n");
 	else if (aux->cgroup.order == BPF_CGROUP_ITER_ANCESTORS_UP)
 		seq_puts(seq, "order: ancestors_up\n");
+	else if (aux->cgroup.order == BPF_CGROUP_ITER_CHILDREN_ONLY)
+		seq_puts(seq, "order: children_only\n");
 	else /* BPF_CGROUP_ITER_SELF_ONLY */
 		seq_puts(seq, "order: self_only\n");
 }
@@ -320,6 +332,7 @@ __bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it,
 	case BPF_CGROUP_ITER_DESCENDANTS_PRE:
 	case BPF_CGROUP_ITER_DESCENDANTS_POST:
 	case BPF_CGROUP_ITER_ANCESTORS_UP:
+	case BPF_CGROUP_ITER_CHILDREN_ONLY:
 		break;
 	default:
 		return -EINVAL;
@@ -345,6 +358,9 @@ __bpf_kfunc struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *i
 	case BPF_CGROUP_ITER_DESCENDANTS_POST:
 		kit->pos = css_next_descendant_post(kit->pos, kit->start);
 		break;
+	case BPF_CGROUP_ITER_CHILDREN_ONLY:
+		kit->pos = css_next_child(kit->pos, kit->start);
+		break;
 	case BPF_CGROUP_ITER_ANCESTORS_UP:
 		kit->pos = kit->pos ? kit->pos->parent : kit->start;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b816bc53d2e1..fa0686bbc638 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -115,10 +115,18 @@ struct bpf_cgroup_storage_key {
 
 enum bpf_cgroup_iter_order {
 	BPF_CGROUP_ITER_ORDER_UNSPEC = 0,
-	BPF_CGROUP_ITER_SELF_ONLY,		/* process only a single object. */
-	BPF_CGROUP_ITER_DESCENDANTS_PRE,	/* walk descendants in pre-order. */
-	BPF_CGROUP_ITER_DESCENDANTS_POST,	/* walk descendants in post-order. */
-	BPF_CGROUP_ITER_ANCESTORS_UP,		/* walk ancestors upward. */
+	BPF_CGROUP_ITER_SELF_ONLY, 		/* process only a single object. */
+	BPF_CGROUP_ITER_DESCENDANTS_PRE, 	/* walk descendants in pre-order. */
+	BPF_CGROUP_ITER_DESCENDANTS_POST, 	/* walk descendants in post-order. */
+	BPF_CGROUP_ITER_ANCESTORS_UP, 		/* walk ancestors upward. */
+	/*
+	 * Walks the immediate children of the specified parent
+	 * cgroup_subsys_state. Unlike BPF_CGROUP_ITER_DESCENDANTS_PRE,
+	 * BPF_CGROUP_ITER_DESCENDANTS_POST, and BPF_CGROUP_ITER_ANCESTORS_UP
+	 * the iterator does not include the specified parent as one of the
+	 * returned iterator elements.
+	 */
+	BPF_CGROUP_ITER_CHILDREN_ONLY,
 };
 
 union bpf_iter_link_info {
-- 
2.52.0.457.g6b5491de43-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 2/2] bpf/selftests: cover BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-21 13:54 [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option Matt Bobrowski
@ 2026-01-21 13:54 ` Matt Bobrowski
  2026-01-21 19:14 ` [PATCH bpf-next 1/2] bpf: add new " Song Liu
  2026-01-23  4:26 ` Alexei Starovoitov
  2 siblings, 0 replies; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-21 13:54 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo, Matt Bobrowski

Extend some of the existing CSS iterator selftests such that they
cover the newly introduced BPF_CGROUP_ITER_CHILDREN_ONLY iterator
control option.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
---
 tools/testing/selftests/bpf/prog_tests/cgroup_iter.c | 12 ++++++++++++
 tools/testing/selftests/bpf/prog_tests/iters.c       |  8 +++++++-
 tools/testing/selftests/bpf/progs/iters_css.c        |  9 ++++++---
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
index 574d9a0cdc8e..07c85c0cc204 100644
--- a/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
@@ -190,6 +190,16 @@ static void test_walk_self_only(struct cgroup_iter *skel)
 			      BPF_CGROUP_ITER_SELF_ONLY, "self_only");
 }
 
+static void test_walk_children_only(struct cgroup_iter *skel)
+{
+	snprintf(expected_output, sizeof(expected_output),
+		 PROLOGUE "%8llu\n%8llu\n" EPILOGUE, cg_id[CHILD1],
+		 cg_id[CHILD2]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_CGROUP_ITER_CHILDREN_ONLY, "children");
+}
+
 static void test_walk_dead_self_only(struct cgroup_iter *skel)
 {
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
@@ -325,6 +335,8 @@ void test_cgroup_iter(void)
 		test_walk_dead_self_only(skel);
 	if (test__start_subtest("cgroup_iter__self_only_css_task"))
 		test_walk_self_only_css_task();
+	if (test__start_subtest("cgroup_iter__children_only"))
+		test_walk_children_only(skel);
 
 out:
 	cgroup_iter__destroy(skel);
diff --git a/tools/testing/selftests/bpf/prog_tests/iters.c b/tools/testing/selftests/bpf/prog_tests/iters.c
index 3cea71f9c500..a539980a2fbe 100644
--- a/tools/testing/selftests/bpf/prog_tests/iters.c
+++ b/tools/testing/selftests/bpf/prog_tests/iters.c
@@ -253,6 +253,11 @@ static void subtest_css_iters(void)
 		{ "/cg1/cg2" },
 		{ "/cg1/cg2/cg3" },
 		{ "/cg1/cg2/cg3/cg4" },
+		{ "/cg1/cg5" },
+		{ "/cg1/cg5/cg6" },
+		{ "/cg1/cg7" },
+		{ "/cg1/cg7/cg8" },
+		{ "/cg1/cg7/cg8/cg9" },
 	};
 	int err, cg_nr = ARRAY_SIZE(cgs);
 	int i;
@@ -284,7 +289,8 @@ static void subtest_css_iters(void)
 
 	ASSERT_EQ(skel->bss->post_order_cnt, cg_nr, "post_order_cnt");
 	ASSERT_EQ(skel->bss->last_cg_id, get_cgroup_id(cgs[0].path), "last_cg_id");
-	ASSERT_EQ(skel->bss->tree_high, cg_nr - 1, "tree_high");
+	ASSERT_EQ(skel->bss->children_cnt, 3, "children_cnt");
+	ASSERT_EQ(skel->bss->tree_high, 3, "tree_high");
 	iters_css__detach(skel);
 cleanup:
 	cleanup_cgroup_environment();
diff --git a/tools/testing/selftests/bpf/progs/iters_css.c b/tools/testing/selftests/bpf/progs/iters_css.c
index ec1f6c2f590b..c74d3075ccdc 100644
--- a/tools/testing/selftests/bpf/progs/iters_css.c
+++ b/tools/testing/selftests/bpf/progs/iters_css.c
@@ -12,8 +12,7 @@ char _license[] SEC("license") = "GPL";
 pid_t target_pid;
 u64 root_cg_id, leaf_cg_id;
 u64 first_cg_id, last_cg_id;
-
-int pre_order_cnt, post_order_cnt, tree_high;
+int pre_order_cnt, post_order_cnt, children_cnt, tree_high;
 
 struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
 void bpf_cgroup_release(struct cgroup *p) __ksym;
@@ -43,7 +42,7 @@ int iter_css_for_each(const void *ctx)
 	}
 	root_css = &root_cgrp->self;
 	leaf_css = &leaf_cgrp->self;
-	pre_order_cnt = post_order_cnt = tree_high = 0;
+	pre_order_cnt = post_order_cnt = children_cnt = tree_high = 0;
 	first_cg_id = last_cg_id = 0;
 
 	bpf_rcu_read_lock();
@@ -60,6 +59,10 @@ int iter_css_for_each(const void *ctx)
 			first_cg_id = cur_cgrp->kn->id;
 	}
 
+	bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_CHILDREN_ONLY) {
+		children_cnt++;
+	}
+
 	bpf_for_each(css, pos, leaf_css, BPF_CGROUP_ITER_ANCESTORS_UP)
 		tree_high++;
 
-- 
2.52.0.457.g6b5491de43-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-21 13:54 [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option Matt Bobrowski
  2026-01-21 13:54 ` [PATCH bpf-next 2/2] bpf/selftests: cover " Matt Bobrowski
@ 2026-01-21 19:14 ` Song Liu
  2026-01-22 12:31   ` Matt Bobrowski
  2026-01-23  4:26 ` Alexei Starovoitov
  2 siblings, 1 reply; 12+ messages in thread
From: Song Liu @ 2026-01-21 19:14 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, ohn Fastabend,
	KP Singh, Stanislav Fomichev, Jiri Olsa, Roman Gushchin,
	Chuyi Zhou, Tejun Heo

On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> Currently, the BPF cgroup iterator supports walking descendants in
> either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
> (BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
> depth-first search (DFS) of the hierarchy. In scenarios where a BPF
> program may need to inspect only the direct children of a given parent
> cgroup, a full DFS is unnecessarily expensive.
>
> This patch introduces a new BPF cgroup iterator control option,
> BPF_CGROUP_ITER_CHILDREN_ONLY. This control option restricts the
> traversal to the immediate children of a specified parent cgroup,
> allowing for more targeted and efficient iteration, particularly when
> exhaustive depth-first search (DFS) traversal is not required.
>
> Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>

The code looks good to me.

Some nitpick and some high level questions/ideas below

>  enum bpf_cgroup_iter_order {
>         BPF_CGROUP_ITER_ORDER_UNSPEC = 0,
> -       BPF_CGROUP_ITER_SELF_ONLY,              /* process only a single object. */
> -       BPF_CGROUP_ITER_DESCENDANTS_PRE,        /* walk descendants in pre-order. */
> -       BPF_CGROUP_ITER_DESCENDANTS_POST,       /* walk descendants in post-order. */
> -       BPF_CGROUP_ITER_ANCESTORS_UP,           /* walk ancestors upward. */
> +       BPF_CGROUP_ITER_SELF_ONLY,              /* process only a single object. */
> +       BPF_CGROUP_ITER_DESCENDANTS_PRE,        /* walk descendants in pre-order. */
> +       BPF_CGROUP_ITER_DESCENDANTS_POST,       /* walk descendants in post-order. */
> +       BPF_CGROUP_ITER_ANCESTORS_UP,           /* walk ancestors upward. */

Changes above seem unnecessary.

> +       /*
> +        * Walks the immediate children of the specified parent
> +        * cgroup_subsys_state. Unlike BPF_CGROUP_ITER_DESCENDANTS_PRE,
> +        * BPF_CGROUP_ITER_DESCENDANTS_POST, and BPF_CGROUP_ITER_ANCESTORS_UP
> +        * the iterator does not include the specified parent as one of the
> +        * returned iterator elements.
> +        */
> +       BPF_CGROUP_ITER_CHILDREN_ONLY,
>  };

[...]

> @@ -320,6 +332,7 @@ __bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it,
>         case BPF_CGROUP_ITER_DESCENDANTS_PRE:
>         case BPF_CGROUP_ITER_DESCENDANTS_POST:
>         case BPF_CGROUP_ITER_ANCESTORS_UP:
> +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
>                 break;
>         default:
>                 return -EINVAL;
> @@ -345,6 +358,9 @@ __bpf_kfunc struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *i
>         case BPF_CGROUP_ITER_DESCENDANTS_POST:
>                 kit->pos = css_next_descendant_post(kit->pos, kit->start);
>                 break;
> +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> +               kit->pos = css_next_child(kit->pos, kit->start);
> +               break;
>         case BPF_CGROUP_ITER_ANCESTORS_UP:
>                 kit->pos = kit->pos ? kit->pos->parent : kit->start;
>         }

I wonder whether we can use the return values and/or the kfuncs to
enable more flexible walks. For example, some return value means
"do not walk more children, go back to the parent". This will enable
use cases like "walk children and grandchildren nodes from here,
but not any deeper". WDYT?

Thanks,
Song

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-21 19:14 ` [PATCH bpf-next 1/2] bpf: add new " Song Liu
@ 2026-01-22 12:31   ` Matt Bobrowski
  0 siblings, 0 replies; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-22 12:31 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, ohn Fastabend,
	KP Singh, Stanislav Fomichev, Jiri Olsa, Roman Gushchin,
	Chuyi Zhou, Tejun Heo

On Wed, Jan 21, 2026 at 11:14:10AM -0800, Song Liu wrote:
> On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > Currently, the BPF cgroup iterator supports walking descendants in
> > either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
> > (BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
> > depth-first search (DFS) of the hierarchy. In scenarios where a BPF
> > program may need to inspect only the direct children of a given parent
> > cgroup, a full DFS is unnecessarily expensive.
> >
> > This patch introduces a new BPF cgroup iterator control option,
> > BPF_CGROUP_ITER_CHILDREN_ONLY. This control option restricts the
> > traversal to the immediate children of a specified parent cgroup,
> > allowing for more targeted and efficient iteration, particularly when
> > exhaustive depth-first search (DFS) traversal is not required.
> >
> > Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
> 
> The code looks good to me.
>
> Some nitpick and some high level questions/ideas below

Sure, thank you for taking a look!

> >  enum bpf_cgroup_iter_order {
> >         BPF_CGROUP_ITER_ORDER_UNSPEC = 0,
> > -       BPF_CGROUP_ITER_SELF_ONLY,              /* process only a single object. */
> > -       BPF_CGROUP_ITER_DESCENDANTS_PRE,        /* walk descendants in pre-order. */
> > -       BPF_CGROUP_ITER_DESCENDANTS_POST,       /* walk descendants in post-order. */
> > -       BPF_CGROUP_ITER_ANCESTORS_UP,           /* walk ancestors upward. */
> > +       BPF_CGROUP_ITER_SELF_ONLY,              /* process only a single object. */
> > +       BPF_CGROUP_ITER_DESCENDANTS_PRE,        /* walk descendants in pre-order. */
> > +       BPF_CGROUP_ITER_DESCENDANTS_POST,       /* walk descendants in post-order. */
> > +       BPF_CGROUP_ITER_ANCESTORS_UP,           /* walk ancestors upward. */
> 
> Changes above seem unnecessary.

This is just noise, sorry. Will revert this once I send out v2.

> > +       /*
> > +        * Walks the immediate children of the specified parent
> > +        * cgroup_subsys_state. Unlike BPF_CGROUP_ITER_DESCENDANTS_PRE,
> > +        * BPF_CGROUP_ITER_DESCENDANTS_POST, and BPF_CGROUP_ITER_ANCESTORS_UP
> > +        * the iterator does not include the specified parent as one of the
> > +        * returned iterator elements.
> > +        */
> > +       BPF_CGROUP_ITER_CHILDREN_ONLY,
> >  };
> 
> [...]
> 
> > @@ -320,6 +332,7 @@ __bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it,
> >         case BPF_CGROUP_ITER_DESCENDANTS_PRE:
> >         case BPF_CGROUP_ITER_DESCENDANTS_POST:
> >         case BPF_CGROUP_ITER_ANCESTORS_UP:
> > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> >                 break;
> >         default:
> >                 return -EINVAL;
> > @@ -345,6 +358,9 @@ __bpf_kfunc struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *i
> >         case BPF_CGROUP_ITER_DESCENDANTS_POST:
> >                 kit->pos = css_next_descendant_post(kit->pos, kit->start);
> >                 break;
> > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > +               kit->pos = css_next_child(kit->pos, kit->start);
> > +               break;
> >         case BPF_CGROUP_ITER_ANCESTORS_UP:
> >                 kit->pos = kit->pos ? kit->pos->parent : kit->start;
> >         }
> 
> I wonder whether we can use the return values and/or the kfuncs to
> enable more flexible walks. For example, some return value means
> "do not walk more children, go back to the parent". This will enable
> use cases like "walk children and grandchildren nodes from here,
> but not any deeper". WDYT?

Some general control over whether the DFS traversal should continue in
a specific direction could be nice. That way you could reverse out of
a specific subtree/branch early without necessarily having to walk all
nodes under a given subtree/branch.

With the above said however, it's not really a control/semantic which
I'm after at this point. I'm purely interested in exhaustively
iterating children of a given parent before selectively deciding which
subtree/branch, and therefore parent, I'd like to explore next.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-21 13:54 [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option Matt Bobrowski
  2026-01-21 13:54 ` [PATCH bpf-next 2/2] bpf/selftests: cover " Matt Bobrowski
  2026-01-21 19:14 ` [PATCH bpf-next 1/2] bpf: add new " Song Liu
@ 2026-01-23  4:26 ` Alexei Starovoitov
  2026-01-23 11:06   ` Matt Bobrowski
  2 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2026-01-23  4:26 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo

On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
>                 break;
> +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> +               kit->pos = css_next_child(kit->pos, kit->start);
> +               break;

There are no users of css_next_child() outside of cgroup internals.
It's a red flag for me. I feel there is something wrong with
the use case. It should have been handled by the primitives we have.

pw-bot: cr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-23  4:26 ` Alexei Starovoitov
@ 2026-01-23 11:06   ` Matt Bobrowski
  2026-01-23 17:17     ` Alexei Starovoitov
  2026-01-23 18:50     ` Tejun Heo
  0 siblings, 2 replies; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-23 11:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo

On Thu, Jan 22, 2026 at 08:26:49PM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> >                 break;
> > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > +               kit->pos = css_next_child(kit->pos, kit->start);
> > +               break;
>
> There are no users of css_next_child() outside of cgroup
> internals. It's a red flag for me.

Hm, I can see the slight hesitation here. However, until somewhat
recently, the same argument could have been applied to functions like
css_next_descendant_post(), right?

css_next_descendant_post() has rather recently started seeing usage in
other subsystems (block/, kernel/sched/, kernel/bpf), despite
remaining internal-only since its inception over a decade ago. Given
that css_next_descendant_post() continues to remain unexported, in all
fairness, I don't see css_next_child() being all that different in
this context.

Would your stance change if Tejun agreed to mark css_next_child() as
exportable, or simply agreed to it being used from BPF cgroup
iterators?

Both css_next_child() and css_for_each_child() have remained virtually
unchanged - and therefore inherently stable - since their
introduction. I should also mention that functions like
css_next_child() and css_next_descendant_post() were originally marked
as exported, and only later unmarked. This presumably was done as part
of a symbol cleanup, not because the functions were unstable
primitives.

> I feel there is something wrong with the use case. It should have
> been handled by the primitives we have.

Right, arguably it could be handled with the current set of
primitives, but it isn't nice and certainly not as efficient as I'd
like.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-23 11:06   ` Matt Bobrowski
@ 2026-01-23 17:17     ` Alexei Starovoitov
  2026-01-26  9:03       ` Matt Bobrowski
  2026-01-23 18:50     ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2026-01-23 17:17 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo

On Fri, Jan 23, 2026 at 3:06 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> On Thu, Jan 22, 2026 at 08:26:49PM -0800, Alexei Starovoitov wrote:
> > On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > >
> > >                 break;
> > > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > > +               kit->pos = css_next_child(kit->pos, kit->start);
> > > +               break;
> >
> > There are no users of css_next_child() outside of cgroup
> > internals. It's a red flag for me.
>
> Hm, I can see the slight hesitation here. However, until somewhat
> recently, the same argument could have been applied to functions like
> css_next_descendant_post(), right?
>
> css_next_descendant_post() has rather recently started seeing usage in
> other subsystems (block/, kernel/sched/, kernel/bpf), despite
> remaining internal-only since its inception over a decade ago. Given
> that css_next_descendant_post() continues to remain unexported, in all
> fairness, I don't see css_next_child() being all that different in
> this context.
>
> Would your stance change if Tejun agreed to mark css_next_child() as
> exportable, or simply agreed to it being used from BPF cgroup
> iterators?
>
> Both css_next_child() and css_for_each_child() have remained virtually
> unchanged - and therefore inherently stable - since their
> introduction. I should also mention that functions like
> css_next_child() and css_next_descendant_post() were originally marked
> as exported, and only later unmarked. This presumably was done as part
> of a symbol cleanup, not because the functions were unstable
> primitives.
>
> > I feel there is something wrong with the use case. It should have
> > been handled by the primitives we have.
>
> Right, arguably it could be handled with the current set of
> primitives, but it isn't nice and certainly not as efficient as I'd
> like.

Why is it? Just descdant_pre and break out of the loop if not immediate.
The thought process should be to use what's available as much as possible.
Adding UAPI for specific use cases isn't great.
If it was a new kfunc that's one thing, but here you're touching uapi/bpf.h
So acceptance bar is much much higher.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-23 11:06   ` Matt Bobrowski
  2026-01-23 17:17     ` Alexei Starovoitov
@ 2026-01-23 18:50     ` Tejun Heo
  2026-01-26  9:14       ` Matt Bobrowski
  1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2026-01-23 18:50 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, ohn Fastabend, KP Singh, Stanislav Fomichev,
	Jiri Olsa, Roman Gushchin, Chuyi Zhou

Hello,

On Fri, Jan 23, 2026 at 11:06:21AM +0000, Matt Bobrowski wrote:
> Would your stance change if Tejun agreed to mark css_next_child() as
> exportable, or simply agreed to it being used from BPF cgroup
> iterators?

FWIW, I don't think there's any issue to exposing css_next_child(). This is
a part of the fundamental structure of cgroups and I don't see it changing
in any foreseeable future. That said, I wonder whether it'd be more flexible
to expose css_rightmost_descendant() which allows skipping the current
subtree during pre-order traversals. However, this has higher complexity for
traversing just immediate children as it would have to skip each child
subtree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-23 17:17     ` Alexei Starovoitov
@ 2026-01-26  9:03       ` Matt Bobrowski
  2026-01-27  2:26         ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-26  9:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo

On Fri, Jan 23, 2026 at 09:17:03AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 23, 2026 at 3:06 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > On Thu, Jan 22, 2026 at 08:26:49PM -0800, Alexei Starovoitov wrote:
> > > On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > >
> > > >                 break;
> > > > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > > > +               kit->pos = css_next_child(kit->pos, kit->start);
> > > > +               break;
> > >
> > > There are no users of css_next_child() outside of cgroup
> > > internals. It's a red flag for me.
> >
> > Hm, I can see the slight hesitation here. However, until somewhat
> > recently, the same argument could have been applied to functions like
> > css_next_descendant_post(), right?
> >
> > css_next_descendant_post() has rather recently started seeing usage in
> > other subsystems (block/, kernel/sched/, kernel/bpf), despite
> > remaining internal-only since its inception over a decade ago. Given
> > that css_next_descendant_post() continues to remain unexported, in all
> > fairness, I don't see css_next_child() being all that different in
> > this context.
> >
> > Would your stance change if Tejun agreed to mark css_next_child() as
> > exportable, or simply agreed to it being used from BPF cgroup
> > iterators?
> >
> > Both css_next_child() and css_for_each_child() have remained virtually
> > unchanged - and therefore inherently stable - since their
> > introduction. I should also mention that functions like
> > css_next_child() and css_next_descendant_post() were originally marked
> > as exported, and only later unmarked. This presumably was done as part
> > of a symbol cleanup, not because the functions were unstable
> > primitives.
> >
> > > I feel there is something wrong with the use case. It should have
> > > been handled by the primitives we have.
> >
> > Right, arguably it could be handled with the current set of
> > primitives, but it isn't nice and certainly not as efficient as I'd
> > like.
> 
> Why is it? Just descdant_pre and break out of the loop if not immediate.

If I use a break when I hit a grandchild in pre-order traversal, I
will still terminate the loop early and miss all the subsequent
immediate children?

I do understand that using css_next_descendant_post() (without a
break) could allow me to logically filter for children in the
loop. However, my concern is purely about algorithmic complexity and
scalability.

css_next_descendant_post() performs a full subtree walk
(O(Descendants)). If I have a top-level cgroup with only a few
children but a massive hierarchy underneath (hundreds/thousands of
grandchildren), using the descendant based iterator requires me to
literally visit every single node in that subtree.

In contrast, iterating immediate children is just a linked-list walk
(O(Children)). For deep hierarchies, the performance difference is not
constant - it's orders of magnitude.

For example, if I have 5 children under a given root/parent with each
of those 5 children having 500 nested descedants each:
- A pre-order traversal would lead me to visiting 2505 nodes.
- A linear traversal would lead me to visting 5 nodes.

> The thought process should be to use what's available as much as possible.

Well, I get that and it was my initial idea. But, I quickly realized
that we have a gap in the current iterators capabilities, and this
patch is meant to address this gap.

> Adding UAPI for specific use cases isn't great.
> If it was a new kfunc that's one thing, but here you're touching uapi/bpf.h
> So acceptance bar is much much higher.

OK, but now that Tejun confirmed [0] that he's fine with exposing
css_next_child(), how does this weigh up?

[0] https://lore.kernel.org/bpf/aXPC_KMX_rvO14lR@slm.duckdns.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-23 18:50     ` Tejun Heo
@ 2026-01-26  9:14       ` Matt Bobrowski
  0 siblings, 0 replies; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-26  9:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, ohn Fastabend, KP Singh, Stanislav Fomichev,
	Jiri Olsa, Roman Gushchin, Chuyi Zhou

On Fri, Jan 23, 2026 at 08:50:36AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 23, 2026 at 11:06:21AM +0000, Matt Bobrowski wrote:
> > Would your stance change if Tejun agreed to mark css_next_child() as
> > exportable, or simply agreed to it being used from BPF cgroup
> > iterators?
> 
> FWIW, I don't think there's any issue to exposing css_next_child(). This is
> a part of the fundamental structure of cgroups and I don't see it changing
> in any foreseeable future.

Thanks for weighing in on this Tejun.

> That said, I wonder whether it'd be more flexible to expose
> css_rightmost_descendant() which allows skipping the current subtree
> during pre-order traversals. However, this has higher complexity for
> traversing just immediate children as it would have to skip each
> child subtree.

Yeah, css_rightmost_descendant() isn't what I'm immediately after at
this point. However, I also wouldn't argue against exposing it as
another possible control order available to the BPF cgroup iterators.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-26  9:03       ` Matt Bobrowski
@ 2026-01-27  2:26         ` Alexei Starovoitov
  2026-01-27  8:28           ` Matt Bobrowski
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2026-01-27  2:26 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo

On Mon, Jan 26, 2026 at 1:03 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> On Fri, Jan 23, 2026 at 09:17:03AM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 23, 2026 at 3:06 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > >
> > > On Thu, Jan 22, 2026 at 08:26:49PM -0800, Alexei Starovoitov wrote:
> > > > On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > > >
> > > > >                 break;
> > > > > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > > > > +               kit->pos = css_next_child(kit->pos, kit->start);
> > > > > +               break;
> > > >
> > > > There are no users of css_next_child() outside of cgroup
> > > > internals. It's a red flag for me.
> > >
> > > Hm, I can see the slight hesitation here. However, until somewhat
> > > recently, the same argument could have been applied to functions like
> > > css_next_descendant_post(), right?
> > >
> > > css_next_descendant_post() has rather recently started seeing usage in
> > > other subsystems (block/, kernel/sched/, kernel/bpf), despite
> > > remaining internal-only since its inception over a decade ago. Given
> > > that css_next_descendant_post() continues to remain unexported, in all
> > > fairness, I don't see css_next_child() being all that different in
> > > this context.
> > >
> > > Would your stance change if Tejun agreed to mark css_next_child() as
> > > exportable, or simply agreed to it being used from BPF cgroup
> > > iterators?
> > >
> > > Both css_next_child() and css_for_each_child() have remained virtually
> > > unchanged - and therefore inherently stable - since their
> > > introduction. I should also mention that functions like
> > > css_next_child() and css_next_descendant_post() were originally marked
> > > as exported, and only later unmarked. This presumably was done as part
> > > of a symbol cleanup, not because the functions were unstable
> > > primitives.
> > >
> > > > I feel there is something wrong with the use case. It should have
> > > > been handled by the primitives we have.
> > >
> > > Right, arguably it could be handled with the current set of
> > > primitives, but it isn't nice and certainly not as efficient as I'd
> > > like.
> >
> > Why is it? Just descdant_pre and break out of the loop if not immediate.
>
> If I use a break when I hit a grandchild in pre-order traversal, I
> will still terminate the loop early and miss all the subsequent
> immediate children?
>
> I do understand that using css_next_descendant_post() (without a
> break) could allow me to logically filter for children in the
> loop. However, my concern is purely about algorithmic complexity and
> scalability.
>
> css_next_descendant_post() performs a full subtree walk
> (O(Descendants)). If I have a top-level cgroup with only a few
> children but a massive hierarchy underneath (hundreds/thousands of
> grandchildren), using the descendant based iterator requires me to
> literally visit every single node in that subtree.
>
> In contrast, iterating immediate children is just a linked-list walk
> (O(Children)). For deep hierarchies, the performance difference is not
> constant - it's orders of magnitude.
>
> For example, if I have 5 children under a given root/parent with each
> of those 5 children having 500 nested descedants each:
> - A pre-order traversal would lead me to visiting 2505 nodes.
> - A linear traversal would lead me to visting 5 nodes.
>
> > The thought process should be to use what's available as much as possible.
>
> Well, I get that and it was my initial idea. But, I quickly realized
> that we have a gap in the current iterators capabilities, and this
> patch is meant to address this gap.
>
> > Adding UAPI for specific use cases isn't great.
> > If it was a new kfunc that's one thing, but here you're touching uapi/bpf.h
> > So acceptance bar is much much higher.
>
> OK, but now that Tejun confirmed [0] that he's fine with exposing
> css_next_child(), how does this weigh up?

ok. let's add it.
Maybe shorten the name to
BPF_CGROUP_ITER_CHILDREN
"_ONLY" looks unnecessary.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option
  2026-01-27  2:26         ` Alexei Starovoitov
@ 2026-01-27  8:28           ` Matt Bobrowski
  0 siblings, 0 replies; 12+ messages in thread
From: Matt Bobrowski @ 2026-01-27  8:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Chuyi Zhou, Tejun Heo

On Mon, Jan 26, 2026 at 06:26:55PM -0800, Alexei Starovoitov wrote:
> On Mon, Jan 26, 2026 at 1:03 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > On Fri, Jan 23, 2026 at 09:17:03AM -0800, Alexei Starovoitov wrote:
> > > On Fri, Jan 23, 2026 at 3:06 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > >
> > > > On Thu, Jan 22, 2026 at 08:26:49PM -0800, Alexei Starovoitov wrote:
> > > > > On Wed, Jan 21, 2026 at 5:54 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > > > >
> > > > > >                 break;
> > > > > > +       case BPF_CGROUP_ITER_CHILDREN_ONLY:
> > > > > > +               kit->pos = css_next_child(kit->pos, kit->start);
> > > > > > +               break;
> > > > >
> > > > > There are no users of css_next_child() outside of cgroup
> > > > > internals. It's a red flag for me.
> > > >
> > > > Hm, I can see the slight hesitation here. However, until somewhat
> > > > recently, the same argument could have been applied to functions like
> > > > css_next_descendant_post(), right?
> > > >
> > > > css_next_descendant_post() has rather recently started seeing usage in
> > > > other subsystems (block/, kernel/sched/, kernel/bpf), despite
> > > > remaining internal-only since its inception over a decade ago. Given
> > > > that css_next_descendant_post() continues to remain unexported, in all
> > > > fairness, I don't see css_next_child() being all that different in
> > > > this context.
> > > >
> > > > Would your stance change if Tejun agreed to mark css_next_child() as
> > > > exportable, or simply agreed to it being used from BPF cgroup
> > > > iterators?
> > > >
> > > > Both css_next_child() and css_for_each_child() have remained virtually
> > > > unchanged - and therefore inherently stable - since their
> > > > introduction. I should also mention that functions like
> > > > css_next_child() and css_next_descendant_post() were originally marked
> > > > as exported, and only later unmarked. This presumably was done as part
> > > > of a symbol cleanup, not because the functions were unstable
> > > > primitives.
> > > >
> > > > > I feel there is something wrong with the use case. It should have
> > > > > been handled by the primitives we have.
> > > >
> > > > Right, arguably it could be handled with the current set of
> > > > primitives, but it isn't nice and certainly not as efficient as I'd
> > > > like.
> > >
> > > Why is it? Just descdant_pre and break out of the loop if not immediate.
> >
> > If I use a break when I hit a grandchild in pre-order traversal, I
> > will still terminate the loop early and miss all the subsequent
> > immediate children?
> >
> > I do understand that using css_next_descendant_post() (without a
> > break) could allow me to logically filter for children in the
> > loop. However, my concern is purely about algorithmic complexity and
> > scalability.
> >
> > css_next_descendant_post() performs a full subtree walk
> > (O(Descendants)). If I have a top-level cgroup with only a few
> > children but a massive hierarchy underneath (hundreds/thousands of
> > grandchildren), using the descendant based iterator requires me to
> > literally visit every single node in that subtree.
> >
> > In contrast, iterating immediate children is just a linked-list walk
> > (O(Children)). For deep hierarchies, the performance difference is not
> > constant - it's orders of magnitude.
> >
> > For example, if I have 5 children under a given root/parent with each
> > of those 5 children having 500 nested descedants each:
> > - A pre-order traversal would lead me to visiting 2505 nodes.
> > - A linear traversal would lead me to visting 5 nodes.
> >
> > > The thought process should be to use what's available as much as possible.
> >
> > Well, I get that and it was my initial idea. But, I quickly realized
> > that we have a gap in the current iterators capabilities, and this
> > patch is meant to address this gap.
> >
> > > Adding UAPI for specific use cases isn't great.
> > > If it was a new kfunc that's one thing, but here you're touching uapi/bpf.h
> > > So acceptance bar is much much higher.
> >
> > OK, but now that Tejun confirmed [0] that he's fine with exposing
> > css_next_child(), how does this weigh up?
> 
> ok. let's add it.
> Maybe shorten the name to
> BPF_CGROUP_ITER_CHILDREN
> "_ONLY" looks unnecessary.

Sounds good, thank you. Shortening to BPF_CGROUP_ITER_CHILDREN is OK
too. Will send it out as part of v2.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-01-27  8:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 13:54 [PATCH bpf-next 1/2] bpf: add new BPF_CGROUP_ITER_CHILDREN_ONLY control option Matt Bobrowski
2026-01-21 13:54 ` [PATCH bpf-next 2/2] bpf/selftests: cover " Matt Bobrowski
2026-01-21 19:14 ` [PATCH bpf-next 1/2] bpf: add new " Song Liu
2026-01-22 12:31   ` Matt Bobrowski
2026-01-23  4:26 ` Alexei Starovoitov
2026-01-23 11:06   ` Matt Bobrowski
2026-01-23 17:17     ` Alexei Starovoitov
2026-01-26  9:03       ` Matt Bobrowski
2026-01-27  2:26         ` Alexei Starovoitov
2026-01-27  8:28           ` Matt Bobrowski
2026-01-23 18:50     ` Tejun Heo
2026-01-26  9:14       ` Matt Bobrowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox