[RFC PATCH bpf-next 0/4] bpf: Introduce cgroup

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter
@ 2023-07-16 12:10 Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 1/4] bpf: Add __bpf_iter_attach_cgroup() Yafang Shao
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Yafang Shao @ 2023-07-16 12:10 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, quentin
  Cc: bpf, Yafang Shao

This patch introduces cgroup_task iter, which allows for efficient
iteration of tasks within a specific cgroup. For example, we can effiently
get the nr_{running,blocked} of a container with this new feature.

The cgroup_task iteration serves as an alternative to task_iter in
container environments due to certain limitations associated with
task_iter.

- Firstly, task_iter only supports the 'current' pidns.
  However, since our data collector operates on the host, we may need to
  collect information from multiple containers simultaneously. Using
  task_iter would require us to fork the collector for each container,
  which is not ideal.

- Additionally, task_iter is unable to collect task information from
containers running in the host pidns.
  In our container environment, we have containers running in the host
  pidns, and we would like to collect task information from them as well.

- Lastly, task_iter does not support multiple-container pods.
  In a Kubernetes environment, a single pod may contain multiple
  containers, all sharing the same pidns. However, we are only interested
  in iterating tasks within the main container, which is not possible with
  task_iter.

To address the first issue, we could potentially extend task_iter to
support specifying a pidns other than the current one. However, for the
other two issues, extending task_iter would not provide a solution.
Therefore, we believe it is preferable to introduce the cgroup_task iter to
handle these scenarios effectively.

Patch #1: Preparation
Patch #2: Add cgroup_task iter
Patch #3: Add support for cgroup_task iter in bpftool
Patch #4: Selftests for cgroup_task iter

Yafang Shao (4):
  bpf: Add __bpf_iter_attach_cgroup()
  bpf: Add cgroup_task iter
  bpftool: Add support for cgroup_task
  selftests/bpf: Add selftest for cgroup_task iter

 include/linux/btf_ids.h                       |  14 ++
 kernel/bpf/cgroup_iter.c                      | 181 ++++++++++++++--
 tools/bpf/bpftool/link.c                      |   3 +-
 .../bpf/prog_tests/cgroup_task_iter.c         | 197 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_task_iter.c    |  39 ++++
 5 files changed, 419 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_task_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_task_iter.c

-- 
2.39.3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH bpf-next 1/4] bpf: Add __bpf_iter_attach_cgroup()
  2023-07-16 12:10 [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter Yafang Shao
@ 2023-07-16 12:10 ` Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 2/4] bpf: Add cgroup_task iter Yafang Shao
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Yafang Shao @ 2023-07-16 12:10 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, quentin
  Cc: bpf, Yafang Shao

This is a preparation for the followup patch. No functional change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/cgroup_iter.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
index 810378f04fbc..619c13c30e87 100644
--- a/kernel/bpf/cgroup_iter.c
+++ b/kernel/bpf/cgroup_iter.c
@@ -191,21 +191,14 @@ static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
 	.seq_priv_size		= sizeof(struct cgroup_iter_priv),
 };
 
-static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
-				  union bpf_iter_link_info *linfo,
-				  struct bpf_iter_aux_info *aux)
+static int __bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				    union bpf_iter_link_info *linfo,
+				    struct bpf_iter_aux_info *aux)
 {
 	int fd = linfo->cgroup.cgroup_fd;
 	u64 id = linfo->cgroup.cgroup_id;
-	int order = linfo->cgroup.order;
 	struct cgroup *cgrp;
 
-	if (order != BPF_CGROUP_ITER_DESCENDANTS_PRE &&
-	    order != BPF_CGROUP_ITER_DESCENDANTS_POST &&
-	    order != BPF_CGROUP_ITER_ANCESTORS_UP &&
-	    order != BPF_CGROUP_ITER_SELF_ONLY)
-		return -EINVAL;
-
 	if (fd && id)
 		return -EINVAL;
 
@@ -220,10 +213,25 @@ static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
 		return PTR_ERR(cgrp);
 
 	aux->cgroup.start = cgrp;
-	aux->cgroup.order = order;
 	return 0;
 }
 
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	int order = linfo->cgroup.order;
+
+	if (order != BPF_CGROUP_ITER_DESCENDANTS_PRE &&
+	    order != BPF_CGROUP_ITER_DESCENDANTS_POST &&
+	    order != BPF_CGROUP_ITER_ANCESTORS_UP &&
+	    order != BPF_CGROUP_ITER_SELF_ONLY)
+		return -EINVAL;
+
+	aux->cgroup.order = order;
+	return __bpf_iter_attach_cgroup(prog, linfo, aux);
+}
+
 static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
 {
 	cgroup_put(aux->cgroup.start);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH bpf-next 2/4] bpf: Add cgroup_task iter
  2023-07-16 12:10 [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 1/4] bpf: Add __bpf_iter_attach_cgroup() Yafang Shao
@ 2023-07-16 12:10 ` Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 3/4] bpftool: Add support for cgroup_task Yafang Shao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Yafang Shao @ 2023-07-16 12:10 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, quentin
  Cc: bpf, Yafang Shao

This patch introduces cgroup_task iter, which allows for efficient
iteration of tasks within a specific cgroup. For example, we can effiently
get the nr_{running,blocked} of a container with this new feature.

The cgroup_task iteration serves as an alternative to task_iter in
container environments due to certain limitations associated with
task_iter.

- Firstly, task_iter only supports the 'current' pidns.
  However, since our data collector operates on the host, we may need to
  collect information from multiple containers simultaneously. Using
  task_iter would require us to fork the collector for each container,
  which is not ideal.

- Additionally, task_iter is unable to collect task information from
containers running in the host pidns.
  In our container environment, we have containers running in the host
  pidns, and we would like to collect task information from them as well.

- Lastly, task_iter does not support multiple-container pods.
  In a Kubernetes environment, a single pod may contain multiple
  containers, all sharing the same pidns. However, we are only interested
  in iterating tasks within the main container, which is not possible with
  task_iter.

To address the first issue, we could potentially extend task_iter to
support specifying a pidns other than the current one. However, for the
other two issues, extending task_iter would not provide a solution.
Therefore, we believe it is preferable to introduce the cgroup_task iter to
handle these scenarios effectively.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/btf_ids.h  |  14 ++++
 kernel/bpf/cgroup_iter.c | 151 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 162 insertions(+), 3 deletions(-)

diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index 00950cc03bff..559f78de8e25 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -265,6 +265,20 @@ MAX_BTF_TRACING_TYPE,
 };
 
 extern u32 btf_tracing_ids[];
+
+#ifdef CONFIG_CGROUPS
+#define BTF_CGROUP_TYPE_xxx    \
+	BTF_CGROUP_TYPE(BTF_CGROUP_TYPE_CGROUP, cgroup)		\
+	BTF_CGROUP_TYPE(BTF_CGROUP_TYPE_TASK, task_struct)
+
+enum {
+#define BTF_CGROUP_TYPE(name, type) name,
+BTF_CGROUP_TYPE_xxx
+#undef BTF_CGROUP_TYPE
+MAX_BTF_CGROUP_TYPE,
+};
+#endif
+
 extern u32 bpf_cgroup_btf_id[];
 extern u32 bpf_local_storage_map_btf_id[];
 
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
index 619c13c30e87..e5b82f05910b 100644
--- a/kernel/bpf/cgroup_iter.c
+++ b/kernel/bpf/cgroup_iter.c
@@ -157,7 +157,9 @@ static const struct seq_operations cgroup_iter_seq_ops = {
 	.show   = cgroup_iter_seq_show,
 };
 
-BTF_ID_LIST_GLOBAL_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+BTF_ID_LIST_GLOBAL(bpf_cgroup_btf_id, MAX_BTF_CGROUP_TYPE)
+BTF_ID(struct, cgroup)
+BTF_ID(struct, task_struct)
 
 static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
 {
@@ -295,10 +297,153 @@ static struct bpf_iter_reg bpf_cgroup_reg_info = {
 	.seq_info		= &cgroup_iter_seq_info,
 };
 
+struct bpf_iter__cgroup_task {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+	__bpf_md_ptr(struct task_struct *, task);
+};
+
+struct cgroup_task_iter_priv {
+	struct cgroup_iter_priv common;
+	struct css_task_iter it;
+	struct task_struct *task;
+};
+
+DEFINE_BPF_ITER_FUNC(cgroup_task, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup, struct task_struct *task)
+
+static int bpf_iter_attach_cgroup_task(struct bpf_prog *prog,
+				       union bpf_iter_link_info *linfo,
+				       struct bpf_iter_aux_info *aux)
+{
+	int order = linfo->cgroup.order;
+
+	if (order != BPF_CGROUP_ITER_SELF_ONLY)
+		return -EINVAL;
+
+	aux->cgroup.order = order;
+	return __bpf_iter_attach_cgroup(prog, linfo, aux);
+}
+
+static void *cgroup_task_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct cgroup_task_iter_priv *p = seq->private;
+	struct cgroup_subsys_state *css = p->common.start_css;
+	struct css_task_iter *it = &p->it;
+	struct task_struct *task;
+
+	css_task_iter_start(css, 0, it);
+	if (*pos > 0) {
+		if (p->common.visited_all)
+			return NULL;
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	++*pos;
+	p->common.terminate = false;
+	p->common.visited_all = false;
+	task = css_task_iter_next(it);
+	p->task = task;
+	return task;
+}
+
+static void *cgroup_task_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct cgroup_task_iter_priv *p = seq->private;
+	struct css_task_iter *it = &p->it;
+	struct task_struct *task;
+
+	++*pos;
+	if (p->common.terminate)
+		return NULL;
+
+	task = css_task_iter_next(it);
+	p->task = task;
+	return task;
+}
+
+static int __cgroup_task_seq_show(struct seq_file *seq, struct cgroup_subsys_state *css,
+				bool in_stop)
+{
+	struct cgroup_task_iter_priv *p = seq->private;
+
+	struct bpf_iter__cgroup_task ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = css ? css->cgroup : NULL;
+	ctx.task = p->task;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+	if (ret)
+		p->common.terminate = true;
+	return 0;
+}
+
+static int cgroup_task_seq_show(struct seq_file *seq, void *v)
+{
+	return __cgroup_task_seq_show(seq, (struct cgroup_subsys_state *)v, false);
+}
+
+static void cgroup_task_seq_stop(struct seq_file *seq, void *v)
+{
+	struct cgroup_task_iter_priv *p = seq->private;
+	struct css_task_iter *it = &p->it;
+
+	css_task_iter_end(it);
+	if (!v) {
+		__cgroup_task_seq_show(seq, NULL, true);
+		p->common.visited_all = true;
+	}
+}
+
+static const struct seq_operations cgroup_task_seq_ops = {
+	.start	= cgroup_task_seq_start,
+	.next	= cgroup_task_seq_next,
+	.stop	= cgroup_task_seq_stop,
+	.show	= cgroup_task_seq_show,
+};
+
+static const struct bpf_iter_seq_info cgroup_task_seq_info = {
+	.seq_ops		= &cgroup_task_seq_ops,
+	.init_seq_private	= cgroup_iter_seq_init,
+	.fini_seq_private	= cgroup_iter_seq_fini,
+	.seq_priv_size		= sizeof(struct cgroup_task_iter_priv),
+};
+
+static struct bpf_iter_reg bpf_cgroup_task_reg_info = {
+	.target			= "cgroup_task",
+	.feature		= BPF_ITER_RESCHED,
+	.attach_target		= bpf_iter_attach_cgroup_task,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 2,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup_task, cgroup),
+		  PTR_TO_BTF_ID_OR_NULL },
+		{ offsetof(struct bpf_iter__cgroup_task, task),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info		= &cgroup_task_seq_info,
+};
+
 static int __init bpf_cgroup_iter_init(void)
 {
-	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
-	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+	int ret;
+
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[BTF_CGROUP_TYPE_CGROUP];
+	ret = bpf_iter_reg_target(&bpf_cgroup_reg_info);
+	if (ret)
+		return ret;
+
+	bpf_cgroup_task_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[BTF_CGROUP_TYPE_CGROUP];
+	bpf_cgroup_task_reg_info.ctx_arg_info[1].btf_id = bpf_cgroup_btf_id[BTF_CGROUP_TYPE_TASK];
+	return bpf_iter_reg_target(&bpf_cgroup_task_reg_info);
 }
 
 late_initcall(bpf_cgroup_iter_init);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH bpf-next 3/4] bpftool: Add support for cgroup_task
  2023-07-16 12:10 [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 1/4] bpf: Add __bpf_iter_attach_cgroup() Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 2/4] bpf: Add cgroup_task iter Yafang Shao
@ 2023-07-16 12:10 ` Yafang Shao
  2023-07-16 12:10 ` [RFC PATCH bpf-next 4/4] selftests/bpf: Add selftest for cgroup_task iter Yafang Shao
  2023-07-27 14:29 ` [RFC PATCH bpf-next 0/4] bpf: Introduce " Yafang Shao
  4 siblings, 0 replies; 6+ messages in thread
From: Yafang Shao @ 2023-07-16 12:10 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, quentin
  Cc: bpf, Yafang Shao

we need to make corresponding changes to the bpftool to support
cgroup_task iter.

The result:
$ bpftool link show
3: iter  prog 15  target_name cgroup_task  cgroup_id 7427  order self_only

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/bpf/bpftool/link.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c
index 65a168df63bc..efbdefdb1b18 100644
--- a/tools/bpf/bpftool/link.c
+++ b/tools/bpf/bpftool/link.c
@@ -158,7 +158,8 @@ static bool is_iter_map_target(const char *target_name)
 
 static bool is_iter_cgroup_target(const char *target_name)
 {
-	return strcmp(target_name, "cgroup") == 0;
+	return strcmp(target_name, "cgroup") == 0 ||
+	       strcmp(target_name, "cgroup_task") == 0;
 }
 
 static const char *cgroup_order_string(__u32 order)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH bpf-next 4/4] selftests/bpf: Add selftest for cgroup_task iter
  2023-07-16 12:10 [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter Yafang Shao
                   ` (2 preceding siblings ...)
  2023-07-16 12:10 ` [RFC PATCH bpf-next 3/4] bpftool: Add support for cgroup_task Yafang Shao
@ 2023-07-16 12:10 ` Yafang Shao
  2023-07-27 14:29 ` [RFC PATCH bpf-next 0/4] bpf: Introduce " Yafang Shao
  4 siblings, 0 replies; 6+ messages in thread
From: Yafang Shao @ 2023-07-16 12:10 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, quentin
  Cc: bpf, Yafang Shao

Add selftests for the newly introduced cgroup_task iter.

The result:
  #42/1    cgroup_task_iter/cgroup_task_iter__invalid_order:OK
  #42/2    cgroup_task_iter/cgroup_task_iter__no_task:OK
  #42/3    cgroup_task_iter/cgroup_task_iter__task_pid:OK
  #42/4    cgroup_task_iter/cgroup_task_iter__task_cnt:OK
  #42      cgroup_task_iter:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../bpf/prog_tests/cgroup_task_iter.c         | 197 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_task_iter.c    |  39 ++++
 2 files changed, 236 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_task_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_task_iter.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_task_iter.c b/tools/testing/selftests/bpf/prog_tests/cgroup_task_iter.c
new file mode 100644
index 000000000000..9123577524b5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_task_iter.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include <signal.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include "cgroup_helpers.h"
+#include "cgroup_task_iter.skel.h"
+
+#define PID_CNT (2)
+static char expected_output[128];
+
+static void read_from_cgroup_iter(struct bpf_program *prog, int cgroup_fd,
+				  int order, const char *testname)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+	int len, iter_fd;
+	static char buf[128];
+	size_t left;
+	char *p;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = cgroup_fd;
+	linfo.cgroup.order = order;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(prog, &opts);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (iter_fd < 0)
+		goto free_link;
+
+	memset(buf, 0, sizeof(buf));
+	left = ARRAY_SIZE(buf);
+	p = buf;
+	while ((len = read(iter_fd, p, left)) > 0) {
+		p += len;
+		left -= len;
+	}
+
+	ASSERT_STREQ(buf, expected_output, testname);
+
+	/* read() after iter finishes should be ok. */
+	if (len == 0)
+		ASSERT_OK(read(iter_fd, buf, sizeof(buf)), "second_read");
+
+	close(iter_fd);
+free_link:
+	bpf_link__destroy(link);
+}
+
+/* Invalid walk order */
+static void test_invalid_order(struct cgroup_task_iter *skel, int fd)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	enum bpf_cgroup_iter_order order;
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = fd;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	/* Only BPF_CGROUP_ITER_SELF_ONLY is supported */
+	for (order = 0; order <= BPF_CGROUP_ITER_ANCESTORS_UP; order++) {
+		if (order == BPF_CGROUP_ITER_SELF_ONLY)
+			continue;
+		linfo.cgroup.order = order;
+		link = bpf_program__attach_iter(skel->progs.cgroup_task_cnt, &opts);
+		ASSERT_ERR_PTR(link, "attach_task_iter");
+		ASSERT_EQ(errno, EINVAL, "error code on invalid walk order");
+	}
+}
+
+/*  Iterate a cgroup withouth any task */
+static void test_walk_no_task(struct cgroup_task_iter *skel, int fd)
+{
+	snprintf(expected_output, sizeof(expected_output), "nr_total 0\n");
+
+	read_from_cgroup_iter(skel->progs.cgroup_task_cnt, fd,
+			      BPF_CGROUP_ITER_SELF_ONLY, "self_only");
+}
+
+/* The forked child process do nothing. */
+static void child_sleep(void)
+{
+	while (1)
+		sleep(1);
+}
+
+/* Get task pid under a cgroup */
+static void test_walk_task_pid(struct cgroup_task_iter *skel, int fd)
+{
+	int pid, status, err;
+	char pid_str[16];
+
+	pid = fork();
+	if (!ASSERT_GE(pid, 0, "fork_task"))
+		return;
+	if (pid) {
+		snprintf(pid_str, sizeof(pid_str), "%u", pid);
+		err = write_cgroup_file("cgroup_task_iter", "cgroup.procs", pid_str);
+		if (!ASSERT_EQ(err, 0, "write cgrp file"))
+			goto out;
+		snprintf(expected_output, sizeof(expected_output), "pid %u\n", pid);
+		read_from_cgroup_iter(skel->progs.cgroup_task_pid, fd,
+				      BPF_CGROUP_ITER_SELF_ONLY, "self_only");
+out:
+		kill(pid, SIGKILL);
+		waitpid(pid, &status, 0);
+	} else {
+		child_sleep();
+	}
+}
+
+/* Get task count under a cgroup */
+static void test_walk_task_cnt(struct cgroup_task_iter *skel, int fd)
+{
+	int pids[PID_CNT], pid, status, err, i;
+	char pid_str[16];
+
+	for (i = 0; i < PID_CNT; i++)
+		pids[i] = 0;
+
+	for (i = 0; i < PID_CNT; i++) {
+		pid = fork();
+		if (!ASSERT_GE(pid, 0, "fork_task"))
+			goto out;
+		if (pid) {
+			pids[i] = pid;
+			snprintf(pid_str, sizeof(pid_str), "%u", pid);
+			err = write_cgroup_file("cgroup_task_iter", "cgroup.procs", pid_str);
+			if (!ASSERT_EQ(err, 0, "write cgrp file"))
+				goto out;
+		} else {
+			child_sleep();
+		}
+	}
+
+	snprintf(expected_output, sizeof(expected_output), "nr_total %u\n", PID_CNT);
+	read_from_cgroup_iter(skel->progs.cgroup_task_cnt, fd,
+			      BPF_CGROUP_ITER_SELF_ONLY, "self_only");
+
+out:
+	for (i = 0; i < PID_CNT; i++) {
+		if (!pids[i])
+			continue;
+		kill(pids[i], SIGKILL);
+		waitpid(pids[i], &status, 0);
+	}
+}
+
+void test_cgroup_task_iter(void)
+{
+	struct cgroup_task_iter *skel = NULL;
+	int cgrp_fd;
+
+	if (setup_cgroup_environment())
+		return;
+
+	cgrp_fd = create_and_get_cgroup("cgroup_task_iter");
+	if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
+		goto cleanup_cgrp_env;
+
+	skel = cgroup_task_iter__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_task_iter__open_and_load"))
+		goto out;
+
+	if (test__start_subtest("cgroup_task_iter__invalid_order"))
+		test_invalid_order(skel, cgrp_fd);
+	if (test__start_subtest("cgroup_task_iter__no_task"))
+		test_walk_no_task(skel, cgrp_fd);
+	if (test__start_subtest("cgroup_task_iter__task_pid"))
+		test_walk_task_pid(skel, cgrp_fd);
+	if (test__start_subtest("cgroup_task_iter__task_cnt"))
+		test_walk_task_cnt(skel, cgrp_fd);
+
+out:
+	cgroup_task_iter__destroy(skel);
+	close(cgrp_fd);
+	remove_cgroup("cgroup_task_iter");
+cleanup_cgrp_env:
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_task_iter.c b/tools/testing/selftests/bpf/progs/cgroup_task_iter.c
new file mode 100644
index 000000000000..b9a6d9d29d58
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_task_iter.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("iter/cgroup_task")
+int cgroup_task_cnt(struct bpf_iter__cgroup_task *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	static __u32 nr_total;
+
+	if (!task) {
+		BPF_SEQ_PRINTF(seq, "nr_total %u\n", nr_total);
+		return 0;
+	}
+
+	if (ctx->meta->seq_num == 0)
+		nr_total = 0;
+	nr_total++;
+	return 0;
+}
+
+SEC("iter/cgroup_task")
+int cgroup_task_pid(struct bpf_iter__cgroup_task *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+
+	if (!task)
+		return 0;
+
+	BPF_SEQ_PRINTF(seq, "pid %u\n", task->pid);
+	return 0;
+}
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter
  2023-07-16 12:10 [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter Yafang Shao
                   ` (3 preceding siblings ...)
  2023-07-16 12:10 ` [RFC PATCH bpf-next 4/4] selftests/bpf: Add selftest for cgroup_task iter Yafang Shao
@ 2023-07-27 14:29 ` Yafang Shao
  4 siblings, 0 replies; 6+ messages in thread
From: Yafang Shao @ 2023-07-27 14:29 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, quentin
  Cc: bpf

On Sun, Jul 16, 2023 at 8:10 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> This patch introduces cgroup_task iter, which allows for efficient
> iteration of tasks within a specific cgroup. For example, we can effiently
> get the nr_{running,blocked} of a container with this new feature.
>
> The cgroup_task iteration serves as an alternative to task_iter in
> container environments due to certain limitations associated with
> task_iter.
>
> - Firstly, task_iter only supports the 'current' pidns.
>   However, since our data collector operates on the host, we may need to
>   collect information from multiple containers simultaneously. Using
>   task_iter would require us to fork the collector for each container,
>   which is not ideal.
>
> - Additionally, task_iter is unable to collect task information from
> containers running in the host pidns.
>   In our container environment, we have containers running in the host
>   pidns, and we would like to collect task information from them as well.
>
> - Lastly, task_iter does not support multiple-container pods.
>   In a Kubernetes environment, a single pod may contain multiple
>   containers, all sharing the same pidns. However, we are only interested
>   in iterating tasks within the main container, which is not possible with
>   task_iter.
>
> To address the first issue, we could potentially extend task_iter to
> support specifying a pidns other than the current one. However, for the
> other two issues, extending task_iter would not provide a solution.
> Therefore, we believe it is preferable to introduce the cgroup_task iter to
> handle these scenarios effectively.
>
> Patch #1: Preparation
> Patch #2: Add cgroup_task iter
> Patch #3: Add support for cgroup_task iter in bpftool
> Patch #4: Selftests for cgroup_task iter
>
> Yafang Shao (4):
>   bpf: Add __bpf_iter_attach_cgroup()
>   bpf: Add cgroup_task iter
>   bpftool: Add support for cgroup_task
>   selftests/bpf: Add selftest for cgroup_task iter
>
>  include/linux/btf_ids.h                       |  14 ++
>  kernel/bpf/cgroup_iter.c                      | 181 ++++++++++++++--
>  tools/bpf/bpftool/link.c                      |   3 +-
>  .../bpf/prog_tests/cgroup_task_iter.c         | 197 ++++++++++++++++++
>  .../selftests/bpf/progs/cgroup_task_iter.c    |  39 ++++
>  5 files changed, 419 insertions(+), 15 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_task_iter.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgroup_task_iter.c
>

Just a kind reminder.

Anyone is interested in this idea ?

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-27 14:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-16 12:10 [RFC PATCH bpf-next 0/4] bpf: Introduce cgroup_task iter Yafang Shao
2023-07-16 12:10 ` [RFC PATCH bpf-next 1/4] bpf: Add __bpf_iter_attach_cgroup() Yafang Shao
2023-07-16 12:10 ` [RFC PATCH bpf-next 2/4] bpf: Add cgroup_task iter Yafang Shao
2023-07-16 12:10 ` [RFC PATCH bpf-next 3/4] bpftool: Add support for cgroup_task Yafang Shao
2023-07-16 12:10 ` [RFC PATCH bpf-next 4/4] selftests/bpf: Add selftest for cgroup_task iter Yafang Shao
2023-07-27 14:29 ` [RFC PATCH bpf-next 0/4] bpf: Introduce " Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).