[PATCH v4 bpf-next 0/3] bpf: Add bpf_iter

BPF List
 help / color / mirror / Atom feed

* [PATCH v4 bpf-next 0/3] bpf: Add bpf_iter_cpumask
@ 2024-01-23 15:27 Yafang Shao
  2024-01-23 15:27 ` [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Yafang Shao @ 2024-01-23 15:27 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj
  Cc: bpf, Yafang Shao

Three new kfuncs, namely bpf_iter_cpumask_{new,next,destroy}, have been
added for the new bpf_iter_cpumask functionality. These kfuncs enable the
iteration of percpu data, such as runqueues, system_group_pcpu, and more.

In our specific use case, we leverage the cgroup iterator to traverse
percpu data, subsequently exposing it to userspace through a seq file.
Refer to the test cases in patch #2 for further context and examples.

Changes:
- v3 -> v4:
  - Use a copy of cpumask to avoid potential use-after-free (Yonghong)
  - Various code improvement in selftests (Yonghong)
- v2 -> v3:
  - Define KF_RCU_PROTECTED for bpf_iter_cpumask_new (Alexei)
  - Code improvement in selftests
  - Fix build error in selftest due to CONFIG_PSI=n
    reported by kernel test robot <lkp@intel.com>
- v1 -> v2: 
  - Avoid changing cgroup subsystem (Tejun)
  - Remove bpf_cpumask_set_from_pid(), and use bpf_cpumask_copy()
    instead (Tejun)
  - Use `int cpu;` field in bpf_iter_cpumask_kern (Andrii)
- bpf: Add new bpf helper bpf_for_each_cpu
  https://lwn.net/ml/bpf/20230801142912.55078-1-laoar.shao@gmail.com/

Yafang Shao (3):
  bpf: Add bpf_iter_cpumask kfuncs
  bpf, doc: Add document for cpumask iter
  selftests/bpf: Add selftests for cpumask iter

 Documentation/bpf/cpumasks.rst                |  17 +++
 kernel/bpf/cpumask.c                          |  82 +++++++++++
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/cpumask_iter.c   | 130 ++++++++++++++++++
 .../selftests/bpf/progs/cpumask_common.h      |   3 +
 .../selftests/bpf/progs/test_cpumask_iter.c   |  56 ++++++++
 6 files changed, 289 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c

-- 
2.39.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs
  2024-01-23 15:27 [PATCH v4 bpf-next 0/3] bpf: Add bpf_iter_cpumask Yafang Shao
@ 2024-01-23 15:27 ` Yafang Shao
  2024-01-23 18:26   ` David Vernet
  2024-01-23 15:27 ` [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter Yafang Shao
  2024-01-23 15:27 ` [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests " Yafang Shao
  2 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-01-23 15:27 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj
  Cc: bpf, Yafang Shao

Add three new kfuncs for bpf_iter_cpumask.
- bpf_iter_cpumask_new
  KF_RCU is defined because the cpumask must be a RCU trusted pointer
  such as task->cpus_ptr.
- bpf_iter_cpumask_next
- bpf_iter_cpumask_destroy

These new kfuncs facilitate the iteration of percpu data, such as
runqueues, psi_cgroup_cpu, and more.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/cpumask.c | 82 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
index 2e73533a3811..474072a235d6 100644
--- a/kernel/bpf/cpumask.c
+++ b/kernel/bpf/cpumask.c
@@ -422,6 +422,85 @@ __bpf_kfunc u32 bpf_cpumask_weight(const struct cpumask *cpumask)
 	return cpumask_weight(cpumask);
 }
 
+struct bpf_iter_cpumask {
+	__u64 __opaque[2];
+} __aligned(8);
+
+struct bpf_iter_cpumask_kern {
+	struct cpumask *mask;
+	int cpu;
+} __aligned(8);
+
+/**
+ * bpf_iter_cpumask_new() - Create a new bpf_iter_cpumask for a specified cpumask
+ * @it: The new bpf_iter_cpumask to be created.
+ * @mask: The cpumask to be iterated over.
+ *
+ * This function initializes a new bpf_iter_cpumask structure for iterating over
+ * the specified CPU mask. It assigns the provided cpumask to the newly created
+ * bpf_iter_cpumask @it for subsequent iteration operations.
+ *
+ * On success, 0 is returen. On failure, ERR is returned.
+ */
+__bpf_kfunc int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask)
+{
+	struct bpf_iter_cpumask_kern *kit = (void *)it;
+
+	BUILD_BUG_ON(sizeof(struct bpf_iter_cpumask_kern) > sizeof(struct bpf_iter_cpumask));
+	BUILD_BUG_ON(__alignof__(struct bpf_iter_cpumask_kern) !=
+		     __alignof__(struct bpf_iter_cpumask));
+
+	kit->mask = bpf_mem_alloc(&bpf_global_ma, sizeof(struct cpumask));
+	if (!kit->mask)
+		return -ENOMEM;
+
+	cpumask_copy(kit->mask, mask);
+	kit->cpu = -1;
+	return 0;
+}
+
+/**
+ * bpf_iter_cpumask_next() - Get the next CPU in a bpf_iter_cpumask
+ * @it: The bpf_iter_cpumask
+ *
+ * This function retrieves a pointer to the number of the next CPU within the
+ * specified bpf_iter_cpumask. It allows sequential access to CPUs within the
+ * cpumask. If there are no further CPUs available, it returns NULL.
+ *
+ * Returns a pointer to the number of the next CPU in the cpumask or NULL if no
+ * further CPUs.
+ */
+__bpf_kfunc int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it)
+{
+	struct bpf_iter_cpumask_kern *kit = (void *)it;
+	const struct cpumask *mask = kit->mask;
+	int cpu;
+
+	if (!mask)
+		return NULL;
+	cpu = cpumask_next(kit->cpu, mask);
+	if (cpu >= nr_cpu_ids)
+		return NULL;
+
+	kit->cpu = cpu;
+	return &kit->cpu;
+}
+
+/**
+ * bpf_iter_cpumask_destroy() - Destroy a bpf_iter_cpumask
+ * @it: The bpf_iter_cpumask to be destroyed.
+ *
+ * Destroy the resource assiciated with the bpf_iter_cpumask.
+ */
+__bpf_kfunc void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it)
+{
+	struct bpf_iter_cpumask_kern *kit = (void *)it;
+
+	if (!kit->mask)
+		return;
+	bpf_mem_free(&bpf_global_ma, kit->mask);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_SET8_START(cpumask_kfunc_btf_ids)
@@ -450,6 +529,9 @@ BTF_ID_FLAGS(func, bpf_cpumask_copy, KF_RCU)
 BTF_ID_FLAGS(func, bpf_cpumask_any_distribute, KF_RCU)
 BTF_ID_FLAGS(func, bpf_cpumask_any_and_distribute, KF_RCU)
 BTF_ID_FLAGS(func, bpf_cpumask_weight, KF_RCU)
+BTF_ID_FLAGS(func, bpf_iter_cpumask_new, KF_ITER_NEW | KF_RCU)
+BTF_ID_FLAGS(func, bpf_iter_cpumask_next, KF_ITER_NEXT | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_iter_cpumask_destroy, KF_ITER_DESTROY)
 BTF_SET8_END(cpumask_kfunc_btf_ids)
 
 static const struct btf_kfunc_id_set cpumask_kfunc_set = {
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter
  2024-01-23 15:27 [PATCH v4 bpf-next 0/3] bpf: Add bpf_iter_cpumask Yafang Shao
  2024-01-23 15:27 ` [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs Yafang Shao
@ 2024-01-23 15:27 ` Yafang Shao
  2024-01-23 20:28   ` David Vernet
  2024-01-23 15:27 ` [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests " Yafang Shao
  2 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-01-23 15:27 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj
  Cc: bpf, Yafang Shao

This patch adds the document for the newly added cpumask iterator
kfuncs.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/bpf/cpumasks.rst | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst
index b5d47a04da5d..523f377afc6e 100644
--- a/Documentation/bpf/cpumasks.rst
+++ b/Documentation/bpf/cpumasks.rst
@@ -372,6 +372,23 @@ used.
 .. _tools/testing/selftests/bpf/progs/cpumask_success.c:
    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c
 
+3.3 cpumask iterator
+--------------------
+
+The cpumask iterator enables the iteration of percpu data, such as runqueues,
+system_group_pcpu, and more.
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+   :identifiers: bpf_iter_cpumask_new bpf_iter_cpumask_next
+                 bpf_iter_cpumask_destroy
+
+----
+
+Some example usages of the cpumask iterator can be found in
+`tools/testing/selftests/bpf/progs/test_cpumask_iter.c`_.
+
+.. _tools/testing/selftests/bpf/progs/test_cpumask_iter.c:
+   https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
 
 4. Adding BPF cpumask kfuncs
 ============================
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests for cpumask iter
  2024-01-23 15:27 [PATCH v4 bpf-next 0/3] bpf: Add bpf_iter_cpumask Yafang Shao
  2024-01-23 15:27 ` [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs Yafang Shao
  2024-01-23 15:27 ` [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter Yafang Shao
@ 2024-01-23 15:27 ` Yafang Shao
  2024-01-23 20:47   ` David Vernet
  2 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-01-23 15:27 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj
  Cc: bpf, Yafang Shao

Within the BPF program, we leverage the cgroup iterator to iterate through
percpu runqueue data, specifically the 'nr_running' metric. Subsequently
 we expose this data to userspace by means of a sequence file.

The CPU affinity for the cpumask is determined by the PID of a task:

- PID of the init task (PID 1)
  We typically don't set CPU affinity for init task and thus we can iterate
  across all possible CPUs using the init task. However, in scenarios where
  you've set CPU affinity for the init task, you should set your
  current-task's cpu affinity to all possible CPUs and then proceed to
  iterate through all possible CPUs using the current task.
- PID of a task with defined CPU affinity
  The aim here is to iterate through a specific cpumask. This scenario
  aligns with tasks residing within a cpuset cgroup.
- Invalid PID (e.g., PID -1)
  No cpumask is available in this case.

The result as follows,
  #65/1    cpumask_iter/init_pid:OK
  #65/2    cpumask_iter/invalid_pid:OK
  #65/3    cpumask_iter/self_pid_one_cpu:OK
  #65/4    cpumask_iter/self_pid_multi_cpus:OK
  #65      cpumask_iter:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

CONFIG_PSI=y is required for this testcase.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/cpumask_iter.c   | 130 ++++++++++++++++++
 .../selftests/bpf/progs/cpumask_common.h      |   3 +
 .../selftests/bpf/progs/test_cpumask_iter.c   |  56 ++++++++
 4 files changed, 190 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c125c441abc7..9c42568ed376 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y
 CONFIG_NF_DEFRAG_IPV4=y
 CONFIG_NF_DEFRAG_IPV6=y
 CONFIG_NF_NAT=y
+CONFIG_PSI=y
 CONFIG_RC_CORE=y
 CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
new file mode 100644
index 000000000000..1db4efc57c5f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "test_cpumask_iter.skel.h"
+
+static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp)
+{
+	int iter_fd, len, item, nr_running, psi_running, nr_cpus;
+	char buf[128];
+	size_t left;
+	char *p;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "iter_fd"))
+		return;
+
+	memset(buf, 0, sizeof(buf));
+	left = ARRAY_SIZE(buf);
+	p = buf;
+	while ((len = read(iter_fd, p, left)) > 0) {
+		p += len;
+		left -= len;
+	}
+
+	item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n",
+		      &nr_running, &nr_cpus, &psi_running);
+	if (nr_cpu_exp == -1) {
+		ASSERT_EQ(item, -1, "seq_format");
+		goto out;
+	}
+
+	ASSERT_EQ(item, 3, "seq_format");
+	ASSERT_GE(nr_running, nr_running_exp, "nr_running");
+	ASSERT_GE(psi_running, nr_running_exp, "psi_running");
+	ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus");
+
+out:
+	close(iter_fd);
+}
+
+void test_cpumask_iter(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	int nr_possible, cgrp_fd, pid, err, cnt, i;
+	struct test_cpumask_iter *skel;
+	union bpf_iter_link_info linfo;
+	int cpu_ids[] = {1, 3, 4, 5};
+	struct bpf_link *link;
+	cpu_set_t set;
+
+	skel = test_cpumask_iter__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load"))
+		return;
+
+	if (setup_cgroup_environment())
+		goto destroy;
+
+	/* Utilize the cgroup iter */
+	cgrp_fd = get_root_cgroup();
+	if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
+		goto cleanup;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = cgrp_fd;
+	linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		goto close_fd;
+
+	skel->bss->target_pid = 1;
+	/* In case init task is set CPU affinity */
+	err = sched_getaffinity(1, sizeof(set), &set);
+	if (!ASSERT_OK(err, "setaffinity"))
+		goto free_link;
+
+	cnt = CPU_COUNT(&set);
+	nr_possible = bpf_num_possible_cpus();
+	if (test__start_subtest("init_pid"))
+		/* current task is running. */
+		verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0);
+
+	skel->bss->target_pid = -1;
+	if (test__start_subtest("invalid_pid"))
+		verify_percpu_data(link, -1, -1);
+
+	pid = getpid();
+	skel->bss->target_pid = pid;
+	CPU_ZERO(&set);
+	CPU_SET(0, &set);
+	err = sched_setaffinity(pid, sizeof(set), &set);
+	if (!ASSERT_OK(err, "setaffinity"))
+		goto free_link;
+
+	if (test__start_subtest("self_pid_one_cpu"))
+		verify_percpu_data(link, 1, 1);
+
+	/* Assume there are at least 8 CPUs on the testbed */
+	if (nr_possible < 8)
+		goto free_link;
+
+	CPU_ZERO(&set);
+	/* Set the CPU affinitiy: 1,3-5 */
+	for (i = 0; i < ARRAY_SIZE(cpu_ids); i++)
+		CPU_SET(cpu_ids[i], &set);
+	err = sched_setaffinity(pid, sizeof(set), &set);
+	if (!ASSERT_OK(err, "setaffinity"))
+		goto free_link;
+
+	if (test__start_subtest("self_pid_multi_cpus"))
+		verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1);
+
+free_link:
+	bpf_link__destroy(link);
+close_fd:
+	close(cgrp_fd);
+cleanup:
+	cleanup_cgroup_environment();
+destroy:
+	test_cpumask_iter__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h
index 0cd4aebb97cf..cdb9dc95e9d9 100644
--- a/tools/testing/selftests/bpf/progs/cpumask_common.h
+++ b/tools/testing/selftests/bpf/progs/cpumask_common.h
@@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym
 u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym;
 u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym;
 u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym;
+int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym;
+int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym;
+void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym;
 
 void bpf_rcu_read_lock(void) __ksym;
 void bpf_rcu_read_unlock(void) __ksym;
diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
new file mode 100644
index 000000000000..cb8b8359516b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "task_kfunc_common.h"
+#include "cpumask_common.h"
+
+extern const struct psi_group_cpu system_group_pcpu __ksym __weak;
+extern const struct rq runqueues __ksym __weak;
+
+int target_pid;
+
+SEC("iter.s/cgroup")
+int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp)
+{
+	u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0;
+	struct psi_group_cpu *groupc;
+	struct task_struct *p;
+	struct rq *rq;
+	int *cpu;
+
+	/* epilogue */
+	if (cgrp == NULL)
+		return 0;
+
+	bpf_rcu_read_lock();
+	p = bpf_task_from_pid(target_pid);
+	if (!p) {
+		bpf_rcu_read_unlock();
+		return 1;
+	}
+
+	bpf_for_each(cpumask, cpu, p->cpus_ptr) {
+		rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu);
+		if (!rq)
+			continue;
+		nr_running += rq->nr_running;
+		nr_cpus += 1;
+
+		groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu);
+		if (!groupc)
+			continue;
+		psi_nr_running += groupc->tasks[NR_RUNNING];
+	}
+	BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n",
+		       nr_running, nr_cpus, psi_nr_running);
+
+	bpf_task_release(p);
+	bpf_rcu_read_unlock();
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs
  2024-01-23 15:27 ` [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs Yafang Shao
@ 2024-01-23 18:26   ` David Vernet
  2024-01-24  9:30     ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: David Vernet @ 2024-01-23 18:26 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

[-- Attachment #1: Type: text/plain, Size: 4378 bytes --]

On Tue, Jan 23, 2024 at 11:27:14PM +0800, Yafang Shao wrote:
> Add three new kfuncs for bpf_iter_cpumask.
> - bpf_iter_cpumask_new
>   KF_RCU is defined because the cpumask must be a RCU trusted pointer
>   such as task->cpus_ptr.
> - bpf_iter_cpumask_next
> - bpf_iter_cpumask_destroy
> 
> These new kfuncs facilitate the iteration of percpu data, such as
> runqueues, psi_cgroup_cpu, and more.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Thanks for working on this, this will be nice to have!

> ---
>  kernel/bpf/cpumask.c | 82 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 
> diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> index 2e73533a3811..474072a235d6 100644
> --- a/kernel/bpf/cpumask.c
> +++ b/kernel/bpf/cpumask.c
> @@ -422,6 +422,85 @@ __bpf_kfunc u32 bpf_cpumask_weight(const struct cpumask *cpumask)
>  	return cpumask_weight(cpumask);
>  }
>  
> +struct bpf_iter_cpumask {
> +	__u64 __opaque[2];
> +} __aligned(8);
> +
> +struct bpf_iter_cpumask_kern {
> +	struct cpumask *mask;
> +	int cpu;
> +} __aligned(8);

Why do we need both of these if we're not going to put the opaque
iterator in UAPI?

> +
> +/**
> + * bpf_iter_cpumask_new() - Create a new bpf_iter_cpumask for a specified cpumask
> + * @it: The new bpf_iter_cpumask to be created.
> + * @mask: The cpumask to be iterated over.
> + *
> + * This function initializes a new bpf_iter_cpumask structure for iterating over
> + * the specified CPU mask. It assigns the provided cpumask to the newly created
> + * bpf_iter_cpumask @it for subsequent iteration operations.
> + *
> + * On success, 0 is returen. On failure, ERR is returned.
> + */
> +__bpf_kfunc int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask)
> +{
> +	struct bpf_iter_cpumask_kern *kit = (void *)it;
> +
> +	BUILD_BUG_ON(sizeof(struct bpf_iter_cpumask_kern) > sizeof(struct bpf_iter_cpumask));
> +	BUILD_BUG_ON(__alignof__(struct bpf_iter_cpumask_kern) !=
> +		     __alignof__(struct bpf_iter_cpumask));

Why are we checking > in the first expression instead of just plain
equality?

> +
> +	kit->mask = bpf_mem_alloc(&bpf_global_ma, sizeof(struct cpumask));

Probably better to use cpumask_size() here.

> +	if (!kit->mask)
> +		return -ENOMEM;
> +
> +	cpumask_copy(kit->mask, mask);
> +	kit->cpu = -1;
> +	return 0;
> +}
> +
> +/**
> + * bpf_iter_cpumask_next() - Get the next CPU in a bpf_iter_cpumask
> + * @it: The bpf_iter_cpumask
> + *
> + * This function retrieves a pointer to the number of the next CPU within the
> + * specified bpf_iter_cpumask. It allows sequential access to CPUs within the
> + * cpumask. If there are no further CPUs available, it returns NULL.
> + *
> + * Returns a pointer to the number of the next CPU in the cpumask or NULL if no
> + * further CPUs.
> + */
> +__bpf_kfunc int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it)
> +{
> +	struct bpf_iter_cpumask_kern *kit = (void *)it;
> +	const struct cpumask *mask = kit->mask;
> +	int cpu;
> +
> +	if (!mask)
> +		return NULL;
> +	cpu = cpumask_next(kit->cpu, mask);
> +	if (cpu >= nr_cpu_ids)
> +		return NULL;
> +
> +	kit->cpu = cpu;
> +	return &kit->cpu;
> +}
> +
> +/**
> + * bpf_iter_cpumask_destroy() - Destroy a bpf_iter_cpumask
> + * @it: The bpf_iter_cpumask to be destroyed.
> + *
> + * Destroy the resource assiciated with the bpf_iter_cpumask.
> + */
> +__bpf_kfunc void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it)
> +{
> +	struct bpf_iter_cpumask_kern *kit = (void *)it;
> +
> +	if (!kit->mask)
> +		return;
> +	bpf_mem_free(&bpf_global_ma, kit->mask);
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_SET8_START(cpumask_kfunc_btf_ids)
> @@ -450,6 +529,9 @@ BTF_ID_FLAGS(func, bpf_cpumask_copy, KF_RCU)
>  BTF_ID_FLAGS(func, bpf_cpumask_any_distribute, KF_RCU)
>  BTF_ID_FLAGS(func, bpf_cpumask_any_and_distribute, KF_RCU)
>  BTF_ID_FLAGS(func, bpf_cpumask_weight, KF_RCU)
> +BTF_ID_FLAGS(func, bpf_iter_cpumask_new, KF_ITER_NEW | KF_RCU)
> +BTF_ID_FLAGS(func, bpf_iter_cpumask_next, KF_ITER_NEXT | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_iter_cpumask_destroy, KF_ITER_DESTROY)
>  BTF_SET8_END(cpumask_kfunc_btf_ids)
>  
>  static const struct btf_kfunc_id_set cpumask_kfunc_set = {
> -- 
> 2.39.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter
  2024-01-23 15:27 ` [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter Yafang Shao
@ 2024-01-23 20:28   ` David Vernet
  2024-01-24  9:32     ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: David Vernet @ 2024-01-23 20:28 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

On Tue, Jan 23, 2024 at 11:27:15PM +0800, Yafang Shao wrote:
> This patch adds the document for the newly added cpumask iterator
> kfuncs.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  Documentation/bpf/cpumasks.rst | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst
> index b5d47a04da5d..523f377afc6e 100644
> --- a/Documentation/bpf/cpumasks.rst
> +++ b/Documentation/bpf/cpumasks.rst
> @@ -372,6 +372,23 @@ used.
>  .. _tools/testing/selftests/bpf/progs/cpumask_success.c:
>     https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c
>  
> +3.3 cpumask iterator
> +--------------------
> +
> +The cpumask iterator enables the iteration of percpu data, such as runqueues,
> +system_group_pcpu, and more.
> +
> +.. kernel-doc:: kernel/bpf/cpumask.c
> +   :identifiers: bpf_iter_cpumask_new bpf_iter_cpumask_next
> +                 bpf_iter_cpumask_destroy

Practically speaking I don't think documenting these kfuncs is going to
be super useful to most users. I expect we'd wrap this in a macro, just
like we do for bpf_for(), and I think it would be much more useful to a
reader to show how they can use such a macro with a full, self-contained
example rather than just embedding the doxygen comment here.

> +
> +----
> +
> +Some example usages of the cpumask iterator can be found in
> +`tools/testing/selftests/bpf/progs/test_cpumask_iter.c`_.
> +
> +.. _tools/testing/selftests/bpf/progs/test_cpumask_iter.c:
> +   https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/test_cpumask_iter.c

I know it's typical for BPF to link to selftests like this, but I
personally strongly prefer actual examples in the documentation. We have
examples elsewhere in this file, so can we please do the same here?

>  
>  4. Adding BPF cpumask kfuncs
>  ============================
> -- 
> 2.39.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests for cpumask iter
  2024-01-23 15:27 ` [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests " Yafang Shao
@ 2024-01-23 20:47   ` David Vernet
  2024-01-24  9:48     ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: David Vernet @ 2024-01-23 20:47 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

[-- Attachment #1: Type: text/plain, Size: 11268 bytes --]

On Tue, Jan 23, 2024 at 11:27:16PM +0800, Yafang Shao wrote:
> Within the BPF program, we leverage the cgroup iterator to iterate through
> percpu runqueue data, specifically the 'nr_running' metric. Subsequently
>  we expose this data to userspace by means of a sequence file.
> 
> The CPU affinity for the cpumask is determined by the PID of a task:
> 
> - PID of the init task (PID 1)
>   We typically don't set CPU affinity for init task and thus we can iterate
>   across all possible CPUs using the init task. However, in scenarios where
>   you've set CPU affinity for the init task, you should set your
>   current-task's cpu affinity to all possible CPUs and then proceed to
>   iterate through all possible CPUs using the current task.
> - PID of a task with defined CPU affinity
>   The aim here is to iterate through a specific cpumask. This scenario
>   aligns with tasks residing within a cpuset cgroup.
> - Invalid PID (e.g., PID -1)
>   No cpumask is available in this case.
> 
> The result as follows,
>   #65/1    cpumask_iter/init_pid:OK
>   #65/2    cpumask_iter/invalid_pid:OK
>   #65/3    cpumask_iter/self_pid_one_cpu:OK
>   #65/4    cpumask_iter/self_pid_multi_cpus:OK
>   #65      cpumask_iter:OK
>   Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
> 
> CONFIG_PSI=y is required for this testcase.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  tools/testing/selftests/bpf/config            |   1 +
>  .../selftests/bpf/prog_tests/cpumask_iter.c   | 130 ++++++++++++++++++
>  .../selftests/bpf/progs/cpumask_common.h      |   3 +
>  .../selftests/bpf/progs/test_cpumask_iter.c   |  56 ++++++++
>  4 files changed, 190 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> 
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index c125c441abc7..9c42568ed376 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y
>  CONFIG_NF_DEFRAG_IPV4=y
>  CONFIG_NF_DEFRAG_IPV6=y
>  CONFIG_NF_NAT=y
> +CONFIG_PSI=y
>  CONFIG_RC_CORE=y
>  CONFIG_SECURITY=y
>  CONFIG_SECURITYFS=y
> diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> new file mode 100644
> index 000000000000..1db4efc57c5f
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> @@ -0,0 +1,130 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */
> +
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include <test_progs.h>
> +#include "cgroup_helpers.h"
> +#include "test_cpumask_iter.skel.h"
> +
> +static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp)

In general it seems wrong for this to be void. If we fail to verify
something, why would we want to keep chugging along in the caller? That
seems like it would just add a bunch of test failure noise for all the
stuff that fails after.

> +{
> +	int iter_fd, len, item, nr_running, psi_running, nr_cpus;
> +	char buf[128];
> +	size_t left;
> +	char *p;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_GE(iter_fd, 0, "iter_fd"))
> +		return;
> +
> +	memset(buf, 0, sizeof(buf));
> +	left = ARRAY_SIZE(buf);
> +	p = buf;
> +	while ((len = read(iter_fd, p, left)) > 0) {
> +		p += len;
> +		left -= len;
> +	}
> +
> +	item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n",
> +		      &nr_running, &nr_cpus, &psi_running);

I don't think there's any reason we should need to do string parsing
like this. We can set all of these variables from BPF and then check
them in the skeleton from user space.

> +	if (nr_cpu_exp == -1) {
> +		ASSERT_EQ(item, -1, "seq_format");
> +		goto out;
> +	}
> +
> +	ASSERT_EQ(item, 3, "seq_format");
> +	ASSERT_GE(nr_running, nr_running_exp, "nr_running");
> +	ASSERT_GE(psi_running, nr_running_exp, "psi_running");
> +	ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus");
> +
> +out:
> +	close(iter_fd);
> +}
> +
> +void test_cpumask_iter(void)

We should also have negative testcases that verify that we fail to load
with bogus / unsafe inputs. See e.g. [0].

[0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/cpumask_failure.c

> +{
> +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> +	int nr_possible, cgrp_fd, pid, err, cnt, i;
> +	struct test_cpumask_iter *skel;
> +	union bpf_iter_link_info linfo;
> +	int cpu_ids[] = {1, 3, 4, 5};
> +	struct bpf_link *link;
> +	cpu_set_t set;
> +
> +	skel = test_cpumask_iter__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load"))
> +		return;
> +
> +	if (setup_cgroup_environment())
> +		goto destroy;
> +
> +	/* Utilize the cgroup iter */
> +	cgrp_fd = get_root_cgroup();
> +	if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
> +		goto cleanup;
> +
> +	memset(&linfo, 0, sizeof(linfo));
> +	linfo.cgroup.cgroup_fd = cgrp_fd;
> +	linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY;
> +	opts.link_info = &linfo;
> +	opts.link_info_len = sizeof(linfo);
> +
> +	link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts);
> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
> +		goto close_fd;
> +
> +	skel->bss->target_pid = 1;
> +	/* In case init task is set CPU affinity */
> +	err = sched_getaffinity(1, sizeof(set), &set);
> +	if (!ASSERT_OK(err, "setaffinity"))
> +		goto free_link;
> +
> +	cnt = CPU_COUNT(&set);
> +	nr_possible = bpf_num_possible_cpus();
> +	if (test__start_subtest("init_pid"))

Can you instead please do what we do in [1] and have each testcase be
self contained and separately invoked? Doing things the way you have it
in this patch has a few drawbacks:

1. The testcases are now all interdependent. If one fails we continue
   onto others even though we may or may not have any reason to believe
   that the system state is sane or what we expect. If the later
   testcases fail, what does it even mean? Or if they pass, is that
   expected? We should be setting up and tearing down whatever state we
   need between testcases to have confidence that the signal from each
   testcase is independent of whatever signal was in the previous one.

2. This is also confusing because we're doing a bunch of validation and
   testing even if test__start_subtest() returns false.  It's unclear
   what the expecations are in such a case, and I think it's a lot more
   logical if you just don't do anything and skip the testcase.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/cpumask.c#n8

> +		/* current task is running. */
> +		verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0);
> +
> +	skel->bss->target_pid = -1;
> +	if (test__start_subtest("invalid_pid"))
> +		verify_percpu_data(link, -1, -1);
> +
> +	pid = getpid();
> +	skel->bss->target_pid = pid;
> +	CPU_ZERO(&set);
> +	CPU_SET(0, &set);
> +	err = sched_setaffinity(pid, sizeof(set), &set);
> +	if (!ASSERT_OK(err, "setaffinity"))
> +		goto free_link;
> +
> +	if (test__start_subtest("self_pid_one_cpu"))
> +		verify_percpu_data(link, 1, 1);
> +
> +	/* Assume there are at least 8 CPUs on the testbed */
> +	if (nr_possible < 8)
> +		goto free_link;
> +
> +	CPU_ZERO(&set);
> +	/* Set the CPU affinitiy: 1,3-5 */
> +	for (i = 0; i < ARRAY_SIZE(cpu_ids); i++)
> +		CPU_SET(cpu_ids[i], &set);
> +	err = sched_setaffinity(pid, sizeof(set), &set);
> +	if (!ASSERT_OK(err, "setaffinity"))
> +		goto free_link;
> +
> +	if (test__start_subtest("self_pid_multi_cpus"))
> +		verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1);
> +
> +free_link:
> +	bpf_link__destroy(link);
> +close_fd:
> +	close(cgrp_fd);
> +cleanup:
> +	cleanup_cgroup_environment();
> +destroy:
> +	test_cpumask_iter__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h
> index 0cd4aebb97cf..cdb9dc95e9d9 100644
> --- a/tools/testing/selftests/bpf/progs/cpumask_common.h
> +++ b/tools/testing/selftests/bpf/progs/cpumask_common.h
> @@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym
>  u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym;
>  u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym;
>  u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym;
> +int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym;
> +int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym;
> +void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym;
>  
>  void bpf_rcu_read_lock(void) __ksym;
>  void bpf_rcu_read_unlock(void) __ksym;
> diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> new file mode 100644
> index 000000000000..cb8b8359516b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +#include "task_kfunc_common.h"
> +#include "cpumask_common.h"
> +
> +extern const struct psi_group_cpu system_group_pcpu __ksym __weak;
> +extern const struct rq runqueues __ksym __weak;
> +
> +int target_pid;
> +
> +SEC("iter.s/cgroup")
> +int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> +{
> +	u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0;
> +	struct psi_group_cpu *groupc;
> +	struct task_struct *p;
> +	struct rq *rq;
> +	int *cpu;
> +
> +	/* epilogue */
> +	if (cgrp == NULL)
> +		return 0;
> +
> +	bpf_rcu_read_lock();

Why is this required? p->cpus_ptr should be trusted given that @p is
trusted, no? Does this fail to load if you don't have all of this in an
RCU read region?

> +	p = bpf_task_from_pid(target_pid);
> +	if (!p) {
> +		bpf_rcu_read_unlock();
> +		return 1;
> +	}
> +
> +	bpf_for_each(cpumask, cpu, p->cpus_ptr) {
> +		rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu);
> +		if (!rq)
> +			continue;

If this happens we should set an error code and check it in user space.

> +		nr_running += rq->nr_running;
> +		nr_cpus += 1;
> +
> +		groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu);
> +		if (!groupc)
> +			continue;

Same here.

> +		psi_nr_running += groupc->tasks[NR_RUNNING];
> +	}
> +	BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n",
> +		       nr_running, nr_cpus, psi_nr_running);
> +
> +	bpf_task_release(p);
> +	bpf_rcu_read_unlock();
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";

Can you please move this to the top of the file?

> -- 
> 2.39.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs
  2024-01-23 18:26   ` David Vernet
@ 2024-01-24  9:30     ` Yafang Shao
  2024-01-24 17:50       ` Andrii Nakryiko
  0 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-01-24  9:30 UTC (permalink / raw)
  To: David Vernet
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

On Wed, Jan 24, 2024 at 2:26 AM David Vernet <void@manifault.com> wrote:
>
> On Tue, Jan 23, 2024 at 11:27:14PM +0800, Yafang Shao wrote:
> > Add three new kfuncs for bpf_iter_cpumask.
> > - bpf_iter_cpumask_new
> >   KF_RCU is defined because the cpumask must be a RCU trusted pointer
> >   such as task->cpus_ptr.
> > - bpf_iter_cpumask_next
> > - bpf_iter_cpumask_destroy
> >
> > These new kfuncs facilitate the iteration of percpu data, such as
> > runqueues, psi_cgroup_cpu, and more.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> Thanks for working on this, this will be nice to have!
>
> > ---
> >  kernel/bpf/cpumask.c | 82 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> >
> > diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> > index 2e73533a3811..474072a235d6 100644
> > --- a/kernel/bpf/cpumask.c
> > +++ b/kernel/bpf/cpumask.c
> > @@ -422,6 +422,85 @@ __bpf_kfunc u32 bpf_cpumask_weight(const struct cpumask *cpumask)
> >       return cpumask_weight(cpumask);
> >  }
> >
> > +struct bpf_iter_cpumask {
> > +     __u64 __opaque[2];
> > +} __aligned(8);
> > +
> > +struct bpf_iter_cpumask_kern {
> > +     struct cpumask *mask;
> > +     int cpu;
> > +} __aligned(8);
>
> Why do we need both of these if we're not going to put the opaque
> iterator in UAPI?

Good point! Will remove it.
It aligns with the pattern seen in
bpf_iter_{css,task,task_vma,task_css}_kern, suggesting that we should
indeed eliminate them.

>
> > +
> > +/**
> > + * bpf_iter_cpumask_new() - Create a new bpf_iter_cpumask for a specified cpumask
> > + * @it: The new bpf_iter_cpumask to be created.
> > + * @mask: The cpumask to be iterated over.
> > + *
> > + * This function initializes a new bpf_iter_cpumask structure for iterating over
> > + * the specified CPU mask. It assigns the provided cpumask to the newly created
> > + * bpf_iter_cpumask @it for subsequent iteration operations.
> > + *
> > + * On success, 0 is returen. On failure, ERR is returned.
> > + */
> > +__bpf_kfunc int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask)
> > +{
> > +     struct bpf_iter_cpumask_kern *kit = (void *)it;
> > +
> > +     BUILD_BUG_ON(sizeof(struct bpf_iter_cpumask_kern) > sizeof(struct bpf_iter_cpumask));
> > +     BUILD_BUG_ON(__alignof__(struct bpf_iter_cpumask_kern) !=
> > +                  __alignof__(struct bpf_iter_cpumask));
>
> Why are we checking > in the first expression instead of just plain
> equality?

Similar to the previous case, it aligns with others. Once we eliminate
the struct bpf_iter_cpumask_kern, we can safely discard these
BUILD_BUG_ON() statements as well.

>
> > +
> > +     kit->mask = bpf_mem_alloc(&bpf_global_ma, sizeof(struct cpumask));
>
> Probably better to use cpumask_size() here.

will use it.

>
> > +     if (!kit->mask)
> > +             return -ENOMEM;
> > +
> > +     cpumask_copy(kit->mask, mask);
> > +     kit->cpu = -1;
> > +     return 0;
> > +}
> > +
> > +/**
> > + * bpf_iter_cpumask_next() - Get the next CPU in a bpf_iter_cpumask
> > + * @it: The bpf_iter_cpumask
> > + *
> > + * This function retrieves a pointer to the number of the next CPU within the
> > + * specified bpf_iter_cpumask. It allows sequential access to CPUs within the
> > + * cpumask. If there are no further CPUs available, it returns NULL.
> > + *
> > + * Returns a pointer to the number of the next CPU in the cpumask or NULL if no
> > + * further CPUs.
> > + */
> > +__bpf_kfunc int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it)
> > +{
> > +     struct bpf_iter_cpumask_kern *kit = (void *)it;
> > +     const struct cpumask *mask = kit->mask;
> > +     int cpu;
> > +
> > +     if (!mask)
> > +             return NULL;
> > +     cpu = cpumask_next(kit->cpu, mask);
> > +     if (cpu >= nr_cpu_ids)
> > +             return NULL;
> > +
> > +     kit->cpu = cpu;
> > +     return &kit->cpu;
> > +}
> > +
> > +/**
> > + * bpf_iter_cpumask_destroy() - Destroy a bpf_iter_cpumask
> > + * @it: The bpf_iter_cpumask to be destroyed.
> > + *
> > + * Destroy the resource assiciated with the bpf_iter_cpumask.
> > + */
> > +__bpf_kfunc void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it)
> > +{
> > +     struct bpf_iter_cpumask_kern *kit = (void *)it;
> > +
> > +     if (!kit->mask)
> > +             return;
> > +     bpf_mem_free(&bpf_global_ma, kit->mask);
> > +}
> > +
> >  __bpf_kfunc_end_defs();
> >
> >  BTF_SET8_START(cpumask_kfunc_btf_ids)
> > @@ -450,6 +529,9 @@ BTF_ID_FLAGS(func, bpf_cpumask_copy, KF_RCU)
> >  BTF_ID_FLAGS(func, bpf_cpumask_any_distribute, KF_RCU)
> >  BTF_ID_FLAGS(func, bpf_cpumask_any_and_distribute, KF_RCU)
> >  BTF_ID_FLAGS(func, bpf_cpumask_weight, KF_RCU)
> > +BTF_ID_FLAGS(func, bpf_iter_cpumask_new, KF_ITER_NEW | KF_RCU)
> > +BTF_ID_FLAGS(func, bpf_iter_cpumask_next, KF_ITER_NEXT | KF_RET_NULL)
> > +BTF_ID_FLAGS(func, bpf_iter_cpumask_destroy, KF_ITER_DESTROY)
> >  BTF_SET8_END(cpumask_kfunc_btf_ids)
> >
> >  static const struct btf_kfunc_id_set cpumask_kfunc_set = {
> > --
> > 2.39.1
> >
> >

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter
  2024-01-23 20:28   ` David Vernet
@ 2024-01-24  9:32     ` Yafang Shao
  0 siblings, 0 replies; 12+ messages in thread
From: Yafang Shao @ 2024-01-24  9:32 UTC (permalink / raw)
  To: David Vernet
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

On Wed, Jan 24, 2024 at 4:28 AM David Vernet <void@manifault.com> wrote:
>
> On Tue, Jan 23, 2024 at 11:27:15PM +0800, Yafang Shao wrote:
> > This patch adds the document for the newly added cpumask iterator
> > kfuncs.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  Documentation/bpf/cpumasks.rst | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst
> > index b5d47a04da5d..523f377afc6e 100644
> > --- a/Documentation/bpf/cpumasks.rst
> > +++ b/Documentation/bpf/cpumasks.rst
> > @@ -372,6 +372,23 @@ used.
> >  .. _tools/testing/selftests/bpf/progs/cpumask_success.c:
> >     https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c
> >
> > +3.3 cpumask iterator
> > +--------------------
> > +
> > +The cpumask iterator enables the iteration of percpu data, such as runqueues,
> > +system_group_pcpu, and more.
> > +
> > +.. kernel-doc:: kernel/bpf/cpumask.c
> > +   :identifiers: bpf_iter_cpumask_new bpf_iter_cpumask_next
> > +                 bpf_iter_cpumask_destroy
>
> Practically speaking I don't think documenting these kfuncs is going to
> be super useful to most users. I expect we'd wrap this in a macro, just
> like we do for bpf_for(), and I think it would be much more useful to a
> reader to show how they can use such a macro with a full, self-contained
> example rather than just embedding the doxygen comment here.

Agree.

>
> > +
> > +----
> > +
> > +Some example usages of the cpumask iterator can be found in
> > +`tools/testing/selftests/bpf/progs/test_cpumask_iter.c`_.
> > +
> > +.. _tools/testing/selftests/bpf/progs/test_cpumask_iter.c:
> > +   https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
>
> I know it's typical for BPF to link to selftests like this, but I
> personally strongly prefer actual examples in the documentation. We have
> examples elsewhere in this file, so can we please do the same here?

will do it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests for cpumask iter
  2024-01-23 20:47   ` David Vernet
@ 2024-01-24  9:48     ` Yafang Shao
  0 siblings, 0 replies; 12+ messages in thread
From: Yafang Shao @ 2024-01-24  9:48 UTC (permalink / raw)
  To: David Vernet
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

On Wed, Jan 24, 2024 at 4:48 AM David Vernet <void@manifault.com> wrote:
>
> On Tue, Jan 23, 2024 at 11:27:16PM +0800, Yafang Shao wrote:
> > Within the BPF program, we leverage the cgroup iterator to iterate through
> > percpu runqueue data, specifically the 'nr_running' metric. Subsequently
> >  we expose this data to userspace by means of a sequence file.
> >
> > The CPU affinity for the cpumask is determined by the PID of a task:
> >
> > - PID of the init task (PID 1)
> >   We typically don't set CPU affinity for init task and thus we can iterate
> >   across all possible CPUs using the init task. However, in scenarios where
> >   you've set CPU affinity for the init task, you should set your
> >   current-task's cpu affinity to all possible CPUs and then proceed to
> >   iterate through all possible CPUs using the current task.
> > - PID of a task with defined CPU affinity
> >   The aim here is to iterate through a specific cpumask. This scenario
> >   aligns with tasks residing within a cpuset cgroup.
> > - Invalid PID (e.g., PID -1)
> >   No cpumask is available in this case.
> >
> > The result as follows,
> >   #65/1    cpumask_iter/init_pid:OK
> >   #65/2    cpumask_iter/invalid_pid:OK
> >   #65/3    cpumask_iter/self_pid_one_cpu:OK
> >   #65/4    cpumask_iter/self_pid_multi_cpus:OK
> >   #65      cpumask_iter:OK
> >   Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
> >
> > CONFIG_PSI=y is required for this testcase.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  tools/testing/selftests/bpf/config            |   1 +
> >  .../selftests/bpf/prog_tests/cpumask_iter.c   | 130 ++++++++++++++++++
> >  .../selftests/bpf/progs/cpumask_common.h      |   3 +
> >  .../selftests/bpf/progs/test_cpumask_iter.c   |  56 ++++++++
> >  4 files changed, 190 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> >
> > diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> > index c125c441abc7..9c42568ed376 100644
> > --- a/tools/testing/selftests/bpf/config
> > +++ b/tools/testing/selftests/bpf/config
> > @@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y
> >  CONFIG_NF_DEFRAG_IPV4=y
> >  CONFIG_NF_DEFRAG_IPV6=y
> >  CONFIG_NF_NAT=y
> > +CONFIG_PSI=y
> >  CONFIG_RC_CORE=y
> >  CONFIG_SECURITY=y
> >  CONFIG_SECURITYFS=y
> > diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> > new file mode 100644
> > index 000000000000..1db4efc57c5f
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> > @@ -0,0 +1,130 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */
> > +
> > +#define _GNU_SOURCE
> > +#include <sched.h>
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +
> > +#include <test_progs.h>
> > +#include "cgroup_helpers.h"
> > +#include "test_cpumask_iter.skel.h"
> > +
> > +static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp)
>
> In general it seems wrong for this to be void. If we fail to verify
> something, why would we want to keep chugging along in the caller? That
> seems like it would just add a bunch of test failure noise for all the
> stuff that fails after.

makes sense.

>
> > +{
> > +     int iter_fd, len, item, nr_running, psi_running, nr_cpus;
> > +     char buf[128];
> > +     size_t left;
> > +     char *p;
> > +
> > +     iter_fd = bpf_iter_create(bpf_link__fd(link));
> > +     if (!ASSERT_GE(iter_fd, 0, "iter_fd"))
> > +             return;
> > +
> > +     memset(buf, 0, sizeof(buf));
> > +     left = ARRAY_SIZE(buf);
> > +     p = buf;
> > +     while ((len = read(iter_fd, p, left)) > 0) {
> > +             p += len;
> > +             left -= len;
> > +     }
> > +
> > +     item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n",
> > +                   &nr_running, &nr_cpus, &psi_running);
>
> I don't think there's any reason we should need to do string parsing
> like this. We can set all of these variables from BPF and then check
> them in the skeleton from user space.


I agree with your point about the simplification achieved through
skeleton variables. However, it's worth noting that a seq file in this
context serves to demonstrate how we expose the data through a seq
file.

>
> > +     if (nr_cpu_exp == -1) {
> > +             ASSERT_EQ(item, -1, "seq_format");
> > +             goto out;
> > +     }
> > +
> > +     ASSERT_EQ(item, 3, "seq_format");
> > +     ASSERT_GE(nr_running, nr_running_exp, "nr_running");
> > +     ASSERT_GE(psi_running, nr_running_exp, "psi_running");
> > +     ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus");
> > +
> > +out:
> > +     close(iter_fd);
> > +}
> > +
> > +void test_cpumask_iter(void)
>
> We should also have negative testcases that verify that we fail to load
> with bogus / unsafe inputs. See e.g. [0].

will add it.

>
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/cpumask_failure.c
>
> > +{
> > +     DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > +     int nr_possible, cgrp_fd, pid, err, cnt, i;
> > +     struct test_cpumask_iter *skel;
> > +     union bpf_iter_link_info linfo;
> > +     int cpu_ids[] = {1, 3, 4, 5};
> > +     struct bpf_link *link;
> > +     cpu_set_t set;
> > +
> > +     skel = test_cpumask_iter__open_and_load();
> > +     if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load"))
> > +             return;
> > +
> > +     if (setup_cgroup_environment())
> > +             goto destroy;
> > +
> > +     /* Utilize the cgroup iter */
> > +     cgrp_fd = get_root_cgroup();
> > +     if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
> > +             goto cleanup;
> > +
> > +     memset(&linfo, 0, sizeof(linfo));
> > +     linfo.cgroup.cgroup_fd = cgrp_fd;
> > +     linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY;
> > +     opts.link_info = &linfo;
> > +     opts.link_info_len = sizeof(linfo);
> > +
> > +     link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts);
> > +     if (!ASSERT_OK_PTR(link, "attach_iter"))
> > +             goto close_fd;
> > +
> > +     skel->bss->target_pid = 1;
> > +     /* In case init task is set CPU affinity */
> > +     err = sched_getaffinity(1, sizeof(set), &set);
> > +     if (!ASSERT_OK(err, "setaffinity"))
> > +             goto free_link;
> > +
> > +     cnt = CPU_COUNT(&set);
> > +     nr_possible = bpf_num_possible_cpus();
> > +     if (test__start_subtest("init_pid"))
>
> Can you instead please do what we do in [1] and have each testcase be
> self contained and separately invoked? Doing things the way you have it
> in this patch has a few drawbacks:
>
> 1. The testcases are now all interdependent. If one fails we continue
>    onto others even though we may or may not have any reason to believe
>    that the system state is sane or what we expect. If the later
>    testcases fail, what does it even mean? Or if they pass, is that
>    expected? We should be setting up and tearing down whatever state we
>    need between testcases to have confidence that the signal from each
>    testcase is independent of whatever signal was in the previous one.
>
> 2. This is also confusing because we're doing a bunch of validation and
>    testing even if test__start_subtest() returns false.  It's unclear
>    what the expecations are in such a case, and I think it's a lot more
>    logical if you just don't do anything and skip the testcase.
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/cpumask.c#n8

Will check it. Appreciate your insights. Thank you.

>
> > +             /* current task is running. */
> > +             verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0);
> > +
> > +     skel->bss->target_pid = -1;
> > +     if (test__start_subtest("invalid_pid"))
> > +             verify_percpu_data(link, -1, -1);
> > +
> > +     pid = getpid();
> > +     skel->bss->target_pid = pid;
> > +     CPU_ZERO(&set);
> > +     CPU_SET(0, &set);
> > +     err = sched_setaffinity(pid, sizeof(set), &set);
> > +     if (!ASSERT_OK(err, "setaffinity"))
> > +             goto free_link;
> > +
> > +     if (test__start_subtest("self_pid_one_cpu"))
> > +             verify_percpu_data(link, 1, 1);
> > +
> > +     /* Assume there are at least 8 CPUs on the testbed */
> > +     if (nr_possible < 8)
> > +             goto free_link;
> > +
> > +     CPU_ZERO(&set);
> > +     /* Set the CPU affinitiy: 1,3-5 */
> > +     for (i = 0; i < ARRAY_SIZE(cpu_ids); i++)
> > +             CPU_SET(cpu_ids[i], &set);
> > +     err = sched_setaffinity(pid, sizeof(set), &set);
> > +     if (!ASSERT_OK(err, "setaffinity"))
> > +             goto free_link;
> > +
> > +     if (test__start_subtest("self_pid_multi_cpus"))
> > +             verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1);
> > +
> > +free_link:
> > +     bpf_link__destroy(link);
> > +close_fd:
> > +     close(cgrp_fd);
> > +cleanup:
> > +     cleanup_cgroup_environment();
> > +destroy:
> > +     test_cpumask_iter__destroy(skel);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h
> > index 0cd4aebb97cf..cdb9dc95e9d9 100644
> > --- a/tools/testing/selftests/bpf/progs/cpumask_common.h
> > +++ b/tools/testing/selftests/bpf/progs/cpumask_common.h
> > @@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym
> >  u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym;
> >  u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym;
> >  u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym;
> > +int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym;
> > +int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym;
> > +void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym;
> >
> >  void bpf_rcu_read_lock(void) __ksym;
> >  void bpf_rcu_read_unlock(void) __ksym;
> > diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> > new file mode 100644
> > index 000000000000..cb8b8359516b
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> > @@ -0,0 +1,56 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */
> > +
> > +#include "vmlinux.h"
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +
> > +#include "task_kfunc_common.h"
> > +#include "cpumask_common.h"
> > +
> > +extern const struct psi_group_cpu system_group_pcpu __ksym __weak;
> > +extern const struct rq runqueues __ksym __weak;
> > +
> > +int target_pid;
> > +
> > +SEC("iter.s/cgroup")
> > +int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > +{
> > +     u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0;
> > +     struct psi_group_cpu *groupc;
> > +     struct task_struct *p;
> > +     struct rq *rq;
> > +     int *cpu;
> > +
> > +     /* epilogue */
> > +     if (cgrp == NULL)
> > +             return 0;
> > +
> > +     bpf_rcu_read_lock();
>
> Why is this required? p->cpus_ptr should be trusted given that @p is
> trusted, no? Does this fail to load if you don't have all of this in an
> RCU read region?

It is a sleepable prog iter.s, so we must explicitly add a rcu lock,
otherwise the load will fail.

>
> > +     p = bpf_task_from_pid(target_pid);
> > +     if (!p) {
> > +             bpf_rcu_read_unlock();
> > +             return 1;
> > +     }
> > +
> > +     bpf_for_each(cpumask, cpu, p->cpus_ptr) {
> > +             rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu);
> > +             if (!rq)
> > +                     continue;
>
> If this happens we should set an error code and check it in user space.

sure, will do it.

>
> > +             nr_running += rq->nr_running;
> > +             nr_cpus += 1;
> > +
> > +             groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu);
> > +             if (!groupc)
> > +                     continue;
>
> Same here.

will do it.

>
> > +             psi_nr_running += groupc->tasks[NR_RUNNING];
> > +     }
> > +     BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n",
> > +                    nr_running, nr_cpus, psi_nr_running);
> > +
> > +     bpf_task_release(p);
> > +     bpf_rcu_read_unlock();
> > +     return 0;
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
>
> Can you please move this to the top of the file?

will do it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs
  2024-01-24  9:30     ` Yafang Shao
@ 2024-01-24 17:50       ` Andrii Nakryiko
  2024-01-25 12:31         ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2024-01-24 17:50 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Vernet, ast, daniel, john.fastabend, andrii, martin.lau,
	song, yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

On Wed, Jan 24, 2024 at 1:31 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 2:26 AM David Vernet <void@manifault.com> wrote:
> >
> > On Tue, Jan 23, 2024 at 11:27:14PM +0800, Yafang Shao wrote:
> > > Add three new kfuncs for bpf_iter_cpumask.
> > > - bpf_iter_cpumask_new
> > >   KF_RCU is defined because the cpumask must be a RCU trusted pointer
> > >   such as task->cpus_ptr.
> > > - bpf_iter_cpumask_next
> > > - bpf_iter_cpumask_destroy
> > >
> > > These new kfuncs facilitate the iteration of percpu data, such as
> > > runqueues, psi_cgroup_cpu, and more.
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >
> > Thanks for working on this, this will be nice to have!
> >
> > > ---
> > >  kernel/bpf/cpumask.c | 82 ++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 82 insertions(+)
> > >
> > > diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> > > index 2e73533a3811..474072a235d6 100644
> > > --- a/kernel/bpf/cpumask.c
> > > +++ b/kernel/bpf/cpumask.c
> > > @@ -422,6 +422,85 @@ __bpf_kfunc u32 bpf_cpumask_weight(const struct cpumask *cpumask)
> > >       return cpumask_weight(cpumask);
> > >  }
> > >
> > > +struct bpf_iter_cpumask {
> > > +     __u64 __opaque[2];
> > > +} __aligned(8);
> > > +
> > > +struct bpf_iter_cpumask_kern {
> > > +     struct cpumask *mask;
> > > +     int cpu;
> > > +} __aligned(8);
> >
> > Why do we need both of these if we're not going to put the opaque
> > iterator in UAPI?
>
> Good point! Will remove it.
> It aligns with the pattern seen in
> bpf_iter_{css,task,task_vma,task_css}_kern, suggesting that we should
> indeed eliminate them.
>

It feels a bit cleaner to have API-oriented (despite being unstable
and coming from vmlinux.h) iter struct like bpf_iter_cpumask with just
"__opaque" field. And then having _kern variant with actual memory
layout. Technically _kern struct could grow smaller.

I certainly wanted this split for bpf_iter_num as that one is more of
a general purpose and stable struct. It's less relevant for more
unstable iters here.

> >
> > > +
> > > +/**
> > > + * bpf_iter_cpumask_new() - Create a new bpf_iter_cpumask for a specified cpumask
> > > + * @it: The new bpf_iter_cpumask to be created.
> > > + * @mask: The cpumask to be iterated over.
> > > + *
> > > + * This function initializes a new bpf_iter_cpumask structure for iterating over
> > > + * the specified CPU mask. It assigns the provided cpumask to the newly created
> > > + * bpf_iter_cpumask @it for subsequent iteration operations.
> > > + *
> > > + * On success, 0 is returen. On failure, ERR is returned.
> > > + */
> > > +__bpf_kfunc int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask)
> > > +{
> > > +     struct bpf_iter_cpumask_kern *kit = (void *)it;
> > > +
> > > +     BUILD_BUG_ON(sizeof(struct bpf_iter_cpumask_kern) > sizeof(struct bpf_iter_cpumask));
> > > +     BUILD_BUG_ON(__alignof__(struct bpf_iter_cpumask_kern) !=
> > > +                  __alignof__(struct bpf_iter_cpumask));
> >
> > Why are we checking > in the first expression instead of just plain
> > equality?
>
> Similar to the previous case, it aligns with others. Once we eliminate
> the struct bpf_iter_cpumask_kern, we can safely discard these
> BUILD_BUG_ON() statements as well.
>
> >
> > > +
> > > +     kit->mask = bpf_mem_alloc(&bpf_global_ma, sizeof(struct cpumask));
> >
> > Probably better to use cpumask_size() here.
>
> will use it.
>
> >
> > > +     if (!kit->mask)
> > > +             return -ENOMEM;
> > > +
> > > +     cpumask_copy(kit->mask, mask);
> > > +     kit->cpu = -1;
> > > +     return 0;
> > > +}
> > > +
> > > +/**
> > > + * bpf_iter_cpumask_next() - Get the next CPU in a bpf_iter_cpumask
> > > + * @it: The bpf_iter_cpumask
> > > + *
> > > + * This function retrieves a pointer to the number of the next CPU within the
> > > + * specified bpf_iter_cpumask. It allows sequential access to CPUs within the
> > > + * cpumask. If there are no further CPUs available, it returns NULL.
> > > + *
> > > + * Returns a pointer to the number of the next CPU in the cpumask or NULL if no
> > > + * further CPUs.
> > > + */
> > > +__bpf_kfunc int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it)
> > > +{
> > > +     struct bpf_iter_cpumask_kern *kit = (void *)it;
> > > +     const struct cpumask *mask = kit->mask;
> > > +     int cpu;
> > > +
> > > +     if (!mask)
> > > +             return NULL;
> > > +     cpu = cpumask_next(kit->cpu, mask);
> > > +     if (cpu >= nr_cpu_ids)
> > > +             return NULL;
> > > +
> > > +     kit->cpu = cpu;
> > > +     return &kit->cpu;
> > > +}
> > > +
> > > +/**
> > > + * bpf_iter_cpumask_destroy() - Destroy a bpf_iter_cpumask
> > > + * @it: The bpf_iter_cpumask to be destroyed.
> > > + *
> > > + * Destroy the resource assiciated with the bpf_iter_cpumask.
> > > + */
> > > +__bpf_kfunc void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it)
> > > +{
> > > +     struct bpf_iter_cpumask_kern *kit = (void *)it;
> > > +
> > > +     if (!kit->mask)
> > > +             return;
> > > +     bpf_mem_free(&bpf_global_ma, kit->mask);
> > > +}
> > > +
> > >  __bpf_kfunc_end_defs();
> > >
> > >  BTF_SET8_START(cpumask_kfunc_btf_ids)
> > > @@ -450,6 +529,9 @@ BTF_ID_FLAGS(func, bpf_cpumask_copy, KF_RCU)
> > >  BTF_ID_FLAGS(func, bpf_cpumask_any_distribute, KF_RCU)
> > >  BTF_ID_FLAGS(func, bpf_cpumask_any_and_distribute, KF_RCU)
> > >  BTF_ID_FLAGS(func, bpf_cpumask_weight, KF_RCU)
> > > +BTF_ID_FLAGS(func, bpf_iter_cpumask_new, KF_ITER_NEW | KF_RCU)
> > > +BTF_ID_FLAGS(func, bpf_iter_cpumask_next, KF_ITER_NEXT | KF_RET_NULL)
> > > +BTF_ID_FLAGS(func, bpf_iter_cpumask_destroy, KF_ITER_DESTROY)
> > >  BTF_SET8_END(cpumask_kfunc_btf_ids)
> > >
> > >  static const struct btf_kfunc_id_set cpumask_kfunc_set = {
> > > --
> > > 2.39.1
> > >
> > >
>
> --
> Regards
> Yafang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs
  2024-01-24 17:50       ` Andrii Nakryiko
@ 2024-01-25 12:31         ` Yafang Shao
  0 siblings, 0 replies; 12+ messages in thread
From: Yafang Shao @ 2024-01-25 12:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David Vernet, ast, daniel, john.fastabend, andrii, martin.lau,
	song, yonghong.song, kpsingh, sdf, haoluo, jolsa, tj, bpf

On Thu, Jan 25, 2024 at 1:51 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 1:31 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Jan 24, 2024 at 2:26 AM David Vernet <void@manifault.com> wrote:
> > >
> > > On Tue, Jan 23, 2024 at 11:27:14PM +0800, Yafang Shao wrote:
> > > > Add three new kfuncs for bpf_iter_cpumask.
> > > > - bpf_iter_cpumask_new
> > > >   KF_RCU is defined because the cpumask must be a RCU trusted pointer
> > > >   such as task->cpus_ptr.
> > > > - bpf_iter_cpumask_next
> > > > - bpf_iter_cpumask_destroy
> > > >
> > > > These new kfuncs facilitate the iteration of percpu data, such as
> > > > runqueues, psi_cgroup_cpu, and more.
> > > >
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > >
> > > Thanks for working on this, this will be nice to have!
> > >
> > > > ---
> > > >  kernel/bpf/cpumask.c | 82 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 82 insertions(+)
> > > >
> > > > diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> > > > index 2e73533a3811..474072a235d6 100644
> > > > --- a/kernel/bpf/cpumask.c
> > > > +++ b/kernel/bpf/cpumask.c
> > > > @@ -422,6 +422,85 @@ __bpf_kfunc u32 bpf_cpumask_weight(const struct cpumask *cpumask)
> > > >       return cpumask_weight(cpumask);
> > > >  }
> > > >
> > > > +struct bpf_iter_cpumask {
> > > > +     __u64 __opaque[2];
> > > > +} __aligned(8);
> > > > +
> > > > +struct bpf_iter_cpumask_kern {
> > > > +     struct cpumask *mask;
> > > > +     int cpu;
> > > > +} __aligned(8);
> > >
> > > Why do we need both of these if we're not going to put the opaque
> > > iterator in UAPI?
> >
> > Good point! Will remove it.
> > It aligns with the pattern seen in
> > bpf_iter_{css,task,task_vma,task_css}_kern, suggesting that we should
> > indeed eliminate them.
> >
>
> It feels a bit cleaner to have API-oriented (despite being unstable
> and coming from vmlinux.h) iter struct like bpf_iter_cpumask with just
> "__opaque" field. And then having _kern variant with actual memory
> layout. Technically _kern struct could grow smaller.
>
> I certainly wanted this split for bpf_iter_num as that one is more of
> a general purpose and stable struct. It's less relevant for more
> unstable iters here.

Understood. Thanks for your explanation.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-25 12:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-23 15:27 [PATCH v4 bpf-next 0/3] bpf: Add bpf_iter_cpumask Yafang Shao
2024-01-23 15:27 ` [PATCH v4 bpf-next 1/3] bpf: Add bpf_iter_cpumask kfuncs Yafang Shao
2024-01-23 18:26   ` David Vernet
2024-01-24  9:30     ` Yafang Shao
2024-01-24 17:50       ` Andrii Nakryiko
2024-01-25 12:31         ` Yafang Shao
2024-01-23 15:27 ` [PATCH v4 bpf-next 2/3] bpf, doc: Add document for cpumask iter Yafang Shao
2024-01-23 20:28   ` David Vernet
2024-01-24  9:32     ` Yafang Shao
2024-01-23 15:27 ` [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests " Yafang Shao
2024-01-23 20:47   ` David Vernet
2024-01-24  9:48     ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox