[RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
@ 2023-08-01 14:29 Yafang Shao
  2023-08-01 14:29 ` [RFC PATCH bpf-next 1/3] bpf: Add bpf_for_each_cpu helper Yafang Shao
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Yafang Shao @ 2023-08-01 14:29 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa
  Cc: bpf, Yafang Shao

Some statistic data is stored in percpu pointer but the kernel doesn't
aggregate it into a single value, for example, the data in struct
psi_group_cpu.

Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr:

  for_loop(nr_cpus, callback_fn, callback_ctx, 0)

In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr().
The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be
rejected by the verifier, hindering deployment, as servers may have
different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal.

Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask.
However, it requires creating a bpf_cpumask, copying the cpumask from the
task, and then parsing the CPU IDs from it, resulting in low efficiency.
Introducing other kfuncs like bpf_cpumask_next might be necessary.

A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse
percpu data, covering all scenarios. It includes
for_each_{possible, present, online}_cpu. The user can also traverse CPUs
from a specific task, such as walking the CPUs of a cpuset cgroup when the
task is in that cgroup.

In our use case, we utilize this new helper to traverse percpu psi data.
This aids in understanding why CPU, Memory, and IO pressure data are high
on a server or a container.

Due to the __percpu annotation, clang-14+ and pahole-1.23+ are required.

Yafang Shao (3):
  bpf: Add bpf_for_each_cpu helper
  cgroup, psi: Init root cgroup psi to psi_system
  selftests/bpf: Add selftest for for_each_cpu

 include/linux/bpf.h                                |   1 +
 include/linux/psi.h                                |   2 +-
 include/uapi/linux/bpf.h                           |  32 +++++
 kernel/bpf/bpf_iter.c                              |  72 +++++++++++
 kernel/bpf/helpers.c                               |   2 +
 kernel/bpf/verifier.c                              |  29 ++++-
 kernel/cgroup/cgroup.c                             |   5 +-
 tools/include/uapi/linux/bpf.h                     |  32 +++++
 .../selftests/bpf/prog_tests/for_each_cpu.c        | 137 +++++++++++++++++++++
 .../selftests/bpf/progs/test_for_each_cpu.c        |  63 ++++++++++
 10 files changed, 372 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 1/3] bpf: Add bpf_for_each_cpu helper
  2023-08-01 14:29 [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yafang Shao
@ 2023-08-01 14:29 ` Yafang Shao
  2023-08-01 14:29 ` [RFC PATCH bpf-next 2/3] cgroup, psi: Init root cgroup psi to psi_system Yafang Shao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2023-08-01 14:29 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa
  Cc: bpf, Yafang Shao

Some statistical data is stored in percpu pointers, but the kernel does not
consolidate them into a single value, such as the data stored within struct
psi_group_cpu.

To facilitate obtaining the sum of this data, a new bpf helper called
bpf_for_each_cpu is introduced.

This new helper implements for_each_{possible, present, online}_cpu,
allowing the user to traverse CPUs conveniently. For instance, it enables
walking through the CPUs of a cpuset cgroup when the task is within that cgroup.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 32 +++++++++++++++++++
 kernel/bpf/bpf_iter.c          | 72 ++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/helpers.c           |  2 ++
 kernel/bpf/verifier.c          | 29 ++++++++++++++++-
 tools/include/uapi/linux/bpf.h | 32 +++++++++++++++++++
 6 files changed, 167 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ceaa8c2..3e63607 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2951,6 +2951,7 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
 extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
 extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
 extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
+extern const struct bpf_func_proto bpf_for_each_cpu_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7fc98f4..e8a0ac7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1663,6 +1663,14 @@ struct bpf_stack_build_id {
 
 } __attribute__((aligned(8)));
 
+enum bpf_cpu_mask_type {
+	CPU_MASK_UNSPEC = 0,
+	CPU_MASK_POSSIBLE = 1,
+	CPU_MASK_ONLINE = 2,
+	CPU_MASK_PRESENT = 3,
+	CPU_MASK_TASK = 4, /* cpu mask of a task */
+};
+
 /* The description below is an attempt at providing documentation to eBPF
  * developers about the multiple available eBPF helper functions. It can be
  * parsed and used to produce a manual page. The workflow is the following,
@@ -5609,6 +5617,29 @@ struct bpf_stack_build_id {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_for_each_cpu(void *callback_fn, void *callback_ctx, const void *pcpu_ptr, u32 type, u32 target)
+ *	Description
+ *		Walk the percpu pointer **pcpu_ptr** with the callback **callback_fn** function.
+ *		The **callback_fn** should be a static function and the **callback_ctx** should
+ *		be a pointer to the stack.
+ *		The **callback_ctx** is the context parameter.
+ *		The **type** and **tartet** specify which CPUs to walk. If **target** is specified,
+ *		it will get the cpumask from the associated target.
+ *
+ *		long (\*callback_fn)(u32 cpu, void \*ctx, const void \*ptr);
+ *
+ *		where **cpu** is the current cpu in the walk, the **ctx** is the **callback_ctx**,
+ *		and the **ptr** is the address of **pcpu_ptr** on current cpu.
+ *
+ *		If **callback_fn** returns 0, the helper will continue to the next
+ *		loop. If return value is 1, the helper will skip the rest of
+ *		the loops and return. Other return values are not used now,
+ *		and will be rejected by the verifier.
+ *
+ *	Return
+ *		The number of CPUs walked, **-EINVAL** for invalid **type**, **target** or
+ *		**pcpu_ptr**.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5823,6 +5854,7 @@ struct bpf_stack_build_id {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(for_each_cpu, 212, ##ctx)			\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 96856f1..e588a14 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -6,6 +6,8 @@
 #include <linux/filter.h>
 #include <linux/bpf.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/btf.h>
+#include <linux/cpumask.h>
 
 struct bpf_iter_target_info {
 	struct list_head list;
@@ -777,6 +779,76 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
 	.arg4_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_5(bpf_for_each_cpu, void *, callback_fn, void *, callback_ctx,
+	   const void *, pcpu_ptr, u32, type, u32, target)
+{
+	bpf_callback_t callback = (bpf_callback_t)callback_fn;
+	struct task_struct *task = NULL;
+	const cpumask_t *mask;
+	const void *ptr;
+	u64 ret;
+	u32 cpu;
+
+	if (!pcpu_ptr)
+		return -EINVAL;
+
+	if ((type != CPU_MASK_TASK && target) || (type == CPU_MASK_TASK && !target))
+		return -EINVAL;
+
+	switch (type) {
+	case CPU_MASK_POSSIBLE:
+		mask = cpu_possible_mask;
+		break;
+	case CPU_MASK_ONLINE:
+		mask = cpu_online_mask;
+		break;
+	case CPU_MASK_PRESENT:
+		mask = cpu_present_mask;
+		break;
+	case CPU_MASK_TASK:
+		rcu_read_lock();
+		task = get_pid_task(find_vpid(target), PIDTYPE_PID);
+		rcu_read_unlock();
+		if (!task)
+			return -EINVAL;
+		mask = task->cpus_ptr;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	for_each_cpu(cpu, mask) {
+		ptr = per_cpu_ptr((const void __percpu *)pcpu_ptr, cpu);
+		if (!ptr) {
+			if (task)
+				put_task_struct(task);
+			return cpu + 1;
+		}
+
+		ret = callback((u64)cpu, (u64)(long)callback_ctx, (u64)(long)ptr, 0, 0);
+		if (ret) {
+			if (task)
+				put_task_struct(task);
+			return cpu + 1;
+		}
+	}
+
+	if (task)
+		put_task_struct(task);
+	return cpu;
+}
+
+const struct bpf_func_proto bpf_for_each_cpu_proto = {
+	.func		= bpf_for_each_cpu,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_FUNC,
+	.arg2_type	= ARG_PTR_TO_STACK_OR_NULL,
+	.arg3_type	= ARG_PTR_TO_PERCPU_BTF_ID,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_ANYTHING,
+};
+
 struct bpf_iter_num_kern {
 	int cur; /* current value, inclusive */
 	int end; /* final value, exclusive */
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 56ce500..06f624e 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1768,6 +1768,8 @@ static int bpf_dynptr_check_off_len(const struct bpf_dynptr_kern *ptr, u32 offse
 	case BPF_FUNC_get_current_ancestor_cgroup_id:
 		return &bpf_get_current_ancestor_cgroup_id_proto;
 #endif
+	case BPF_FUNC_for_each_cpu:
+		return &bpf_for_each_cpu_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 71473c1..cd6d0a4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -549,7 +549,8 @@ static bool is_callback_calling_function(enum bpf_func_id func_id)
 	       func_id == BPF_FUNC_timer_set_callback ||
 	       func_id == BPF_FUNC_find_vma ||
 	       func_id == BPF_FUNC_loop ||
-	       func_id == BPF_FUNC_user_ringbuf_drain;
+	       func_id == BPF_FUNC_user_ringbuf_drain ||
+	       func_id == BPF_FUNC_for_each_cpu;
 }
 
 static bool is_async_callback_calling_function(enum bpf_func_id func_id)
@@ -9028,6 +9029,28 @@ static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_for_each_cpu_callback_state(struct bpf_verifier_env *env,
+					   struct bpf_func_state *caller,
+					   struct bpf_func_state *callee,
+					   int insn_idx)
+{
+	/* long bpf_for_each_cpu(bpf_callback_t callback_fn, void *callback_ctx,
+	 *                       const void *pc, u32 type, u64 flags)
+	 * callback_fn(u64 cpu, void *callback_ctx, const void *pc);
+	 */
+	callee->regs[BPF_REG_1].type = SCALAR_VALUE;
+	callee->regs[BPF_REG_2] = caller->regs[BPF_REG_2];
+	callee->regs[BPF_REG_3] = caller->regs[BPF_REG_3];
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+
+	callee->in_callback_fn = true;
+	callee->callback_ret_range = tnum_range(0, 1);
+	return 0;
+}
+
 static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
 					 struct bpf_func_state *caller,
 					 struct bpf_func_state *callee,
@@ -9625,6 +9648,10 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
 					set_user_ringbuf_callback_state);
 		break;
+	case BPF_FUNC_for_each_cpu:
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_for_each_cpu_callback_state);
+		break;
 	}
 
 	if (err)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7fc98f4..e8a0ac7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1663,6 +1663,14 @@ struct bpf_stack_build_id {
 
 } __attribute__((aligned(8)));
 
+enum bpf_cpu_mask_type {
+	CPU_MASK_UNSPEC = 0,
+	CPU_MASK_POSSIBLE = 1,
+	CPU_MASK_ONLINE = 2,
+	CPU_MASK_PRESENT = 3,
+	CPU_MASK_TASK = 4, /* cpu mask of a task */
+};
+
 /* The description below is an attempt at providing documentation to eBPF
  * developers about the multiple available eBPF helper functions. It can be
  * parsed and used to produce a manual page. The workflow is the following,
@@ -5609,6 +5617,29 @@ struct bpf_stack_build_id {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_for_each_cpu(void *callback_fn, void *callback_ctx, const void *pcpu_ptr, u32 type, u32 target)
+ *	Description
+ *		Walk the percpu pointer **pcpu_ptr** with the callback **callback_fn** function.
+ *		The **callback_fn** should be a static function and the **callback_ctx** should
+ *		be a pointer to the stack.
+ *		The **callback_ctx** is the context parameter.
+ *		The **type** and **tartet** specify which CPUs to walk. If **target** is specified,
+ *		it will get the cpumask from the associated target.
+ *
+ *		long (\*callback_fn)(u32 cpu, void \*ctx, const void \*ptr);
+ *
+ *		where **cpu** is the current cpu in the walk, the **ctx** is the **callback_ctx**,
+ *		and the **ptr** is the address of **pcpu_ptr** on current cpu.
+ *
+ *		If **callback_fn** returns 0, the helper will continue to the next
+ *		loop. If return value is 1, the helper will skip the rest of
+ *		the loops and return. Other return values are not used now,
+ *		and will be rejected by the verifier.
+ *
+ *	Return
+ *		The number of CPUs walked, **-EINVAL** for invalid **type**, **target** or
+ *		**pcpu_ptr**.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5823,6 +5854,7 @@ struct bpf_stack_build_id {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(for_each_cpu, 212, ##ctx)			\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 2/3] cgroup, psi: Init root cgroup psi to psi_system
  2023-08-01 14:29 [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yafang Shao
  2023-08-01 14:29 ` [RFC PATCH bpf-next 1/3] bpf: Add bpf_for_each_cpu helper Yafang Shao
@ 2023-08-01 14:29 ` Yafang Shao
  2023-08-01 14:29 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add selftest for for_each_cpu Yafang Shao
  2023-08-01 17:53 ` [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yonghong Song
  3 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2023-08-01 14:29 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa
  Cc: bpf, Yafang Shao

By initializing the root cgroup's psi field to psi_system, we can
consistently obtain the psi information for all cgroups from the struct
cgroup.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/psi.h    | 2 +-
 kernel/cgroup/cgroup.c | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index e074587..8f2db51 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -34,7 +34,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 #ifdef CONFIG_CGROUPS
 static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
 {
-	return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+	return cgrp->psi;
 }
 
 int psi_cgroup_alloc(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f55a40d..d7ba5fa 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -164,7 +164,10 @@ struct cgroup_subsys *cgroup_subsys[] = {
 static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = {
+	.cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu,
+	.cgrp.psi = &psi_system,
+};
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 3/3] selftests/bpf: Add selftest for for_each_cpu
  2023-08-01 14:29 [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yafang Shao
  2023-08-01 14:29 ` [RFC PATCH bpf-next 1/3] bpf: Add bpf_for_each_cpu helper Yafang Shao
  2023-08-01 14:29 ` [RFC PATCH bpf-next 2/3] cgroup, psi: Init root cgroup psi to psi_system Yafang Shao
@ 2023-08-01 14:29 ` Yafang Shao
  2023-08-01 17:53 ` [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yonghong Song
  3 siblings, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2023-08-01 14:29 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa
  Cc: bpf, Yafang Shao

Add selftest for the new for_each_cpu helper.

The result:
  $ tools/testing/selftests/bpf/test_progs --name=for_each_cpu
  #84/1    for_each_cpu/psi_system:OK
  #84/2    for_each_cpu/psi_cgroup:OK
  #84/3    for_each_cpu/invalid_cpumask:OK
  #84      for_each_cpu:OK
  Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/for_each_cpu.c        | 137 +++++++++++++++++++++
 .../selftests/bpf/progs/test_for_each_cpu.c        |  63 ++++++++++
 2 files changed, 200 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c

diff --git a/tools/testing/selftests/bpf/prog_tests/for_each_cpu.c b/tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
new file mode 100644
index 0000000..b0eaaec
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include "cgroup_helpers.h"
+#include "test_for_each_cpu.skel.h"
+
+static void verify_percpu_psi_value(struct test_for_each_cpu *skel, int fd, __u32 running, int res)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	int len, iter_fd, result;
+	struct bpf_link *link;
+	static char buf[128];
+	__u32 nr_running;
+	size_t left;
+	char *p;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = fd;
+	linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.psi_cgroup, &opts);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "iter_fd"))
+		goto free_link;
+
+	memset(buf, 0, sizeof(buf));
+	left = ARRAY_SIZE(buf);
+	p = buf;
+	while ((len = read(iter_fd, p, left)) > 0) {
+		p += len;
+		left -= len;
+	}
+
+	ASSERT_EQ(sscanf(buf, "nr_running %u ret %d\n", &nr_running, &result), 2, "seq_format");
+	ASSERT_EQ(result, res, "for_each_cpu_result");
+	if (running)
+		ASSERT_GE(nr_running, running, "nr_running");
+	else
+		ASSERT_EQ(nr_running, running, "nr_running");
+
+	/* read() after iter finishes should be ok. */
+	if (len == 0)
+		ASSERT_OK(read(iter_fd, buf, sizeof(buf)), "second_read");
+	close(iter_fd);
+free_link:
+	bpf_link__destroy(link);
+}
+
+void test_root_cgroup(struct test_for_each_cpu *skel)
+{
+	int cgrp_fd, nr_cpus;
+
+	cgrp_fd = get_root_cgroup();
+	if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
+		return;
+
+	skel->bss->cpu_mask = CPU_MASK_POSSIBLE;
+	skel->bss->pid = 0;
+	nr_cpus = bpf_num_possible_cpus();
+	/* At least current is running */
+	verify_percpu_psi_value(skel, cgrp_fd, 1, nr_cpus);
+	close(cgrp_fd);
+}
+
+void test_child_cgroup(struct test_for_each_cpu *skel)
+{
+	int cgrp_fd, nr_cpus;
+
+	cgrp_fd = create_and_get_cgroup("for_each_cpu");
+	if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
+		return;
+
+	skel->bss->cpu_mask = CPU_MASK_POSSIBLE;
+	skel->bss->pid = 0;
+	nr_cpus = bpf_num_possible_cpus();
+	/* No tasks in the cgroup */
+	verify_percpu_psi_value(skel, cgrp_fd, 0, nr_cpus);
+	close(cgrp_fd);
+	remove_cgroup("for_each_cpu");
+}
+
+void verify_invalid_cpumask(struct test_for_each_cpu *skel, int fd, __u32 cpumask, __u32 pid)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+
+	skel->bss->cpu_mask = cpumask;
+	skel->bss->pid = pid;
+	verify_percpu_psi_value(skel, fd, 0, -EINVAL);
+}
+
+void test_invalid_cpumask(struct test_for_each_cpu *skel)
+{
+	int cgrp_fd;
+
+	cgrp_fd = create_and_get_cgroup("for_each_cpu");
+	if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
+		return;
+
+	verify_invalid_cpumask(skel, cgrp_fd, CPU_MASK_POSSIBLE, 1);
+	verify_invalid_cpumask(skel, cgrp_fd, CPU_MASK_PRESENT, 1);
+	verify_invalid_cpumask(skel, cgrp_fd, CPU_MASK_ONLINE, 1);
+	verify_invalid_cpumask(skel, cgrp_fd, CPU_MASK_TASK, 0);
+	verify_invalid_cpumask(skel, cgrp_fd, -1, 0);
+	verify_invalid_cpumask(skel, cgrp_fd, -1, 1);
+	close(cgrp_fd);
+	remove_cgroup("for_each_cpu");
+}
+
+void test_for_each_cpu(void)
+{
+	struct test_for_each_cpu *skel = NULL;
+
+	skel = test_for_each_cpu__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load"))
+		return;
+
+	if (setup_cgroup_environment())
+		return;
+
+	if (test__start_subtest("psi_system"))
+		test_root_cgroup(skel);
+	if (test__start_subtest("psi_cgroup"))
+		test_child_cgroup(skel);
+	if (test__start_subtest("invalid_cpumask"))
+		test_invalid_cpumask(skel);
+
+	test_for_each_cpu__destroy(skel);
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_for_each_cpu.c b/tools/testing/selftests/bpf/progs/test_for_each_cpu.c
new file mode 100644
index 0000000..1554895
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_for_each_cpu.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2023 Yafang Shao <laoar.shao@gmail.com> */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define __percpu __attribute__((btf_type_tag("percpu")))
+
+enum bpf_cpu_mask_type cpu_mask;
+__u32 pid;
+
+struct callback_ctx {
+	__u32 nr_running;
+	__u32 id;
+};
+
+static uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+static int callback(__u32 cpu, void *ctx, const void *ptr)
+{
+	unsigned int tasks[NR_PSI_TASK_COUNTS];
+	const struct psi_group_cpu *groupc = ptr;
+	struct callback_ctx *data = ctx;
+
+	bpf_probe_read_kernel(&tasks, sizeof(tasks), &groupc->tasks);
+	data->nr_running += tasks[NR_RUNNING];
+	return 0;
+}
+
+SEC("iter.s/cgroup")
+int BPF_PROG(psi_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp)
+{
+	struct seq_file *seq = (struct seq_file *)meta->seq;
+	struct psi_group_cpu __percpu *pcpu_ptr;
+	struct callback_ctx data;
+	struct psi_group *psi;
+	__u64 cg_id;
+	int ret;
+
+	cg_id = cgrp ? cgroup_id(cgrp) : 0;
+	if (!cg_id)
+		return 1;
+
+	psi = cgrp->psi;
+	if (!psi)
+		return 1;
+
+	pcpu_ptr = psi->pcpu;
+	if (!pcpu_ptr)
+		return 1;
+
+	data.nr_running = 0;
+	data.id = cg_id;
+	ret = bpf_for_each_cpu(callback, &data, pcpu_ptr, cpu_mask, pid);
+	BPF_SEQ_PRINTF(seq, "nr_running %d ret %d\n", data.nr_running, ret);
+
+	return ret ? 1 : 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-01 14:29 [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yafang Shao
                   ` (2 preceding siblings ...)
  2023-08-01 14:29 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add selftest for for_each_cpu Yafang Shao
@ 2023-08-01 17:53 ` Yonghong Song
  2023-08-02  2:33   ` Yafang Shao
  3 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2023-08-01 17:53 UTC (permalink / raw)
  To: Yafang Shao, ast, daniel, john.fastabend, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa
  Cc: bpf



On 8/1/23 7:29 AM, Yafang Shao wrote:
> Some statistic data is stored in percpu pointer but the kernel doesn't
> aggregate it into a single value, for example, the data in struct
> psi_group_cpu.
> 
> Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr:
> 
>    for_loop(nr_cpus, callback_fn, callback_ctx, 0)
> 
> In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr().
> The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be
> rejected by the verifier, hindering deployment, as servers may have
> different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal.
> 
> Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask.
> However, it requires creating a bpf_cpumask, copying the cpumask from the
> task, and then parsing the CPU IDs from it, resulting in low efficiency.
> Introducing other kfuncs like bpf_cpumask_next might be necessary.
> 
> A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse
> percpu data, covering all scenarios. It includes
> for_each_{possible, present, online}_cpu. The user can also traverse CPUs
> from a specific task, such as walking the CPUs of a cpuset cgroup when the
> task is in that cgroup.

The bpf subsystem has adopted kfunc approach. So there is no bpf helper
any more. You need to have a bpf_for_each_cpu kfunc instead.

But I am wondering whether we should use open coded iterator loops
    06accc8779c1  bpf: add support for open-coded iterator loops

In kernel, we have a global variable
    nr_cpu_ids (also in kernel/bpf/helpers.c)
which is used in numerous places for per cpu data struct access.

I am wondering whether we could have bpf code like
    int nr_cpu_ids __ksym;

    struct bpf_iter_num it;
    int i = 0;

    // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
    bpf_iter_num_new(&it, 1, nr_cpu_ids);
    while ((v = bpf_iter_num_next(&it))) {
           /* access cpu i data */
           i++;
    }
    bpf_iter_num_destroy(&it);

 From all existing open coded iterator loops, looks like
upper bound has to be a constant. We might need to extend support
to bounded scalar upper bound if not there.
> 
> In our use case, we utilize this new helper to traverse percpu psi data.
> This aids in understanding why CPU, Memory, and IO pressure data are high
> on a server or a container.
> 
> Due to the __percpu annotation, clang-14+ and pahole-1.23+ are required.
> 
> Yafang Shao (3):
>    bpf: Add bpf_for_each_cpu helper
>    cgroup, psi: Init root cgroup psi to psi_system
>    selftests/bpf: Add selftest for for_each_cpu
> 
>   include/linux/bpf.h                                |   1 +
>   include/linux/psi.h                                |   2 +-
>   include/uapi/linux/bpf.h                           |  32 +++++
>   kernel/bpf/bpf_iter.c                              |  72 +++++++++++
>   kernel/bpf/helpers.c                               |   2 +
>   kernel/bpf/verifier.c                              |  29 ++++-
>   kernel/cgroup/cgroup.c                             |   5 +-
>   tools/include/uapi/linux/bpf.h                     |  32 +++++
>   .../selftests/bpf/prog_tests/for_each_cpu.c        | 137 +++++++++++++++++++++
>   .../selftests/bpf/progs/test_for_each_cpu.c        |  63 ++++++++++
>   10 files changed, 372 insertions(+), 3 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-01 17:53 ` [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yonghong Song
@ 2023-08-02  2:33   ` Yafang Shao
  2023-08-02  2:45     ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Yafang Shao @ 2023-08-02  2:33 UTC (permalink / raw)
  To: yonghong.song
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, bpf

On Wed, Aug 2, 2023 at 1:53 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/1/23 7:29 AM, Yafang Shao wrote:
> > Some statistic data is stored in percpu pointer but the kernel doesn't
> > aggregate it into a single value, for example, the data in struct
> > psi_group_cpu.
> >
> > Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr:
> >
> >    for_loop(nr_cpus, callback_fn, callback_ctx, 0)
> >
> > In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr().
> > The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be
> > rejected by the verifier, hindering deployment, as servers may have
> > different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal.
> >
> > Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask.
> > However, it requires creating a bpf_cpumask, copying the cpumask from the
> > task, and then parsing the CPU IDs from it, resulting in low efficiency.
> > Introducing other kfuncs like bpf_cpumask_next might be necessary.
> >
> > A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse
> > percpu data, covering all scenarios. It includes
> > for_each_{possible, present, online}_cpu. The user can also traverse CPUs
> > from a specific task, such as walking the CPUs of a cpuset cgroup when the
> > task is in that cgroup.
>
> The bpf subsystem has adopted kfunc approach. So there is no bpf helper
> any more.

Could you pls. share some background why we made this decision ?

> You need to have a bpf_for_each_cpu kfunc instead.
> But I am wondering whether we should use open coded iterator loops
>     06accc8779c1  bpf: add support for open-coded iterator loops

The usage of open-coded iterator for-loop is almost the same with
bpf_loop, except that it can start from a non-zero value.

>
> In kernel, we have a global variable
>     nr_cpu_ids (also in kernel/bpf/helpers.c)
> which is used in numerous places for per cpu data struct access.
>
> I am wondering whether we could have bpf code like
>     int nr_cpu_ids __ksym;
>
>     struct bpf_iter_num it;
>     int i = 0;
>
>     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
>     bpf_iter_num_new(&it, 1, nr_cpu_ids);
>     while ((v = bpf_iter_num_next(&it))) {
>            /* access cpu i data */
>            i++;
>     }
>     bpf_iter_num_destroy(&it);
>
>  From all existing open coded iterator loops, looks like
> upper bound has to be a constant. We might need to extend support
> to bounded scalar upper bound if not there.

Currently the upper bound is required by both the open-coded for-loop
and the bpf_loop. I think we can extend it.

It can't handle the cpumask case either.

    for_each_cpu(cpu, mask)

In the 'mask', the CPU IDs might not be continuous. In our container
environment, we always use the cpuset cgroup for some critical tasks,
but it is not so convenient to traverse the percpu data of this cpuset
cgroup.  We have to do it as follows for this case :

That's why we prefer to introduce a bpf_for_each_cpu helper. It is
fine if it can be implemented as a kfunc.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02  2:33   ` Yafang Shao
@ 2023-08-02  2:45     ` Alexei Starovoitov
  2023-08-02  2:57       ` Yafang Shao
  2023-08-02  3:29       ` David Vernet
  0 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2023-08-02  2:45 UTC (permalink / raw)
  To: Yafang Shao, David Vernet
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf

On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> >
> > In kernel, we have a global variable
> >     nr_cpu_ids (also in kernel/bpf/helpers.c)
> > which is used in numerous places for per cpu data struct access.
> >
> > I am wondering whether we could have bpf code like
> >     int nr_cpu_ids __ksym;
> >
> >     struct bpf_iter_num it;
> >     int i = 0;
> >
> >     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
> >     bpf_iter_num_new(&it, 1, nr_cpu_ids);
> >     while ((v = bpf_iter_num_next(&it))) {
> >            /* access cpu i data */
> >            i++;
> >     }
> >     bpf_iter_num_destroy(&it);
> >
> >  From all existing open coded iterator loops, looks like
> > upper bound has to be a constant. We might need to extend support
> > to bounded scalar upper bound if not there.
>
> Currently the upper bound is required by both the open-coded for-loop
> and the bpf_loop. I think we can extend it.
>
> It can't handle the cpumask case either.
>
>     for_each_cpu(cpu, mask)
>
> In the 'mask', the CPU IDs might not be continuous. In our container
> environment, we always use the cpuset cgroup for some critical tasks,
> but it is not so convenient to traverse the percpu data of this cpuset
> cgroup.  We have to do it as follows for this case :
>
> That's why we prefer to introduce a bpf_for_each_cpu helper. It is
> fine if it can be implemented as a kfunc.

I think open-coded-iterators is the only acceptable path forward here.
Since existing bpf_iter_num doesn't fit due to sparse cpumask,
let's introduce bpf_iter_cpumask and few additional kfuncs
that return cpu_possible_mask and others.

We already have some cpumask support in kernel/bpf/cpumask.c
bpf_iter_cpumask will be a natural follow up.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02  2:45     ` Alexei Starovoitov
@ 2023-08-02  2:57       ` Yafang Shao
  2023-08-02  3:29       ` David Vernet
  1 sibling, 0 replies; 18+ messages in thread
From: Yafang Shao @ 2023-08-02  2:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Vernet, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf

On Wed, Aug 2, 2023 at 10:46 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > >
> > > In kernel, we have a global variable
> > >     nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > which is used in numerous places for per cpu data struct access.
> > >
> > > I am wondering whether we could have bpf code like
> > >     int nr_cpu_ids __ksym;
> > >
> > >     struct bpf_iter_num it;
> > >     int i = 0;
> > >
> > >     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
> > >     bpf_iter_num_new(&it, 1, nr_cpu_ids);
> > >     while ((v = bpf_iter_num_next(&it))) {
> > >            /* access cpu i data */
> > >            i++;
> > >     }
> > >     bpf_iter_num_destroy(&it);
> > >
> > >  From all existing open coded iterator loops, looks like
> > > upper bound has to be a constant. We might need to extend support
> > > to bounded scalar upper bound if not there.
> >
> > Currently the upper bound is required by both the open-coded for-loop
> > and the bpf_loop. I think we can extend it.
> >
> > It can't handle the cpumask case either.
> >
> >     for_each_cpu(cpu, mask)
> >
> > In the 'mask', the CPU IDs might not be continuous. In our container
> > environment, we always use the cpuset cgroup for some critical tasks,
> > but it is not so convenient to traverse the percpu data of this cpuset
> > cgroup.  We have to do it as follows for this case :
> >
> > That's why we prefer to introduce a bpf_for_each_cpu helper. It is
> > fine if it can be implemented as a kfunc.
>
> I think open-coded-iterators is the only acceptable path forward here.
> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
> let's introduce bpf_iter_cpumask and few additional kfuncs
> that return cpu_possible_mask and others.
>
> We already have some cpumask support in kernel/bpf/cpumask.c
> bpf_iter_cpumask will be a natural follow up.

I will think about it. Thanks for your suggestion.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02  2:45     ` Alexei Starovoitov
  2023-08-02  2:57       ` Yafang Shao
@ 2023-08-02  3:29       ` David Vernet
  2023-08-02  6:54         ` Yonghong Song
  2023-08-02 16:33         ` Alexei Starovoitov
  1 sibling, 2 replies; 18+ messages in thread
From: David Vernet @ 2023-08-02  3:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yafang Shao, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf

On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > >
> > > In kernel, we have a global variable
> > >     nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > which is used in numerous places for per cpu data struct access.
> > >
> > > I am wondering whether we could have bpf code like
> > >     int nr_cpu_ids __ksym;

I think this would be useful in general, though any __ksym variable like
this would have to be const and mapped in .rodata, right? But yeah,
being able to R/O map global variables like this which have static
lifetimes would be nice.

It's not quite the same thing as nr_cpu_ids, but FWIW, you could
accomplish something close to this by doing something like this in your
BPF prog:

/* Set in user space to libbpf_num_possible_cpus() */
const volatile __u32 nr_cpus;

...
	__u32 i;

	bpf_for(i, 0, nr_cpus)
		bpf_printk("Iterating over cpu %u", i);

...

> > >     struct bpf_iter_num it;
> > >     int i = 0;
> > >
> > >     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
> > >     bpf_iter_num_new(&it, 1, nr_cpu_ids);
> > >     while ((v = bpf_iter_num_next(&it))) {
> > >            /* access cpu i data */
> > >            i++;
> > >     }
> > >     bpf_iter_num_destroy(&it);
> > >
> > >  From all existing open coded iterator loops, looks like
> > > upper bound has to be a constant. We might need to extend support
> > > to bounded scalar upper bound if not there.
> >
> > Currently the upper bound is required by both the open-coded for-loop
> > and the bpf_loop. I think we can extend it.
> >
> > It can't handle the cpumask case either.
> >
> >     for_each_cpu(cpu, mask)
> >
> > In the 'mask', the CPU IDs might not be continuous. In our container
> > environment, we always use the cpuset cgroup for some critical tasks,
> > but it is not so convenient to traverse the percpu data of this cpuset
> > cgroup.  We have to do it as follows for this case :
> >
> > That's why we prefer to introduce a bpf_for_each_cpu helper. It is
> > fine if it can be implemented as a kfunc.
> 
> I think open-coded-iterators is the only acceptable path forward here.
> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
> let's introduce bpf_iter_cpumask and few additional kfuncs
> that return cpu_possible_mask and others.

I agree that this is the correct way to generalize this. The only thing
that we'll have to figure out is how to generalize treating const struct
cpumask * objects as kptrs. In sched_ext [0] we export
scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
return trusted global cpumask kptrs that can then be "released" in
scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
exists only to appease the verifier that the trusted cpumask kptrs
aren't being leaked and are having their references "dropped".

[0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/

I'd imagine that we have 2 ways forward if we want to enable progs to
fetch other global cpumasks with static lifetimes (e.g.
__cpu_possible_mask or nohz.idle_cpus_mask):

1. The most straightforward thing to do would be to add a new kfunc in
   kernel/bpf/cpumask.c that's a drop-in replacment for
   scx_bpf_put_idle_cpumask():

void bpf_global_cpumask_drop(const struct cpumask *cpumask)
{}

2. Another would be to implement something resembling what Yonghong
   suggested in [1], where progs can link against global allocated kptrs
   like:

const struct cpumask *__cpu_possible_mask __ksym;

[1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t

In my opinion (1) is more straightforward, (2) is a better UX.

Note again that both approaches only works for cpumasks with static
lifetimes.  I can't think of a way to treat dynamically allocated struct
cpumask *objects as kptrs as there's nowhere to put a reference. If
someone wants to track a dynamically allocated cpumask, they'd have to
create a kptr out of its container object, and then pass that object's
cpumask as a const struct cpumask * to other BPF cpumask kfuncs
(including e.g. the proposed iterator).

> We already have some cpumask support in kernel/bpf/cpumask.c
> bpf_iter_cpumask will be a natural follow up.

Yes, this should be easy to add.

- David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02  3:29       ` David Vernet
@ 2023-08-02  6:54         ` Yonghong Song
  2023-08-02 15:46           ` David Vernet
  2023-08-02 16:33         ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2023-08-02  6:54 UTC (permalink / raw)
  To: David Vernet, Alexei Starovoitov
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf



On 8/1/23 8:29 PM, David Vernet wrote:
> On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>>>
>>>> In kernel, we have a global variable
>>>>      nr_cpu_ids (also in kernel/bpf/helpers.c)
>>>> which is used in numerous places for per cpu data struct access.
>>>>
>>>> I am wondering whether we could have bpf code like
>>>>      int nr_cpu_ids __ksym;
> 
> I think this would be useful in general, though any __ksym variable like
> this would have to be const and mapped in .rodata, right? But yeah,
> being able to R/O map global variables like this which have static
> lifetimes would be nice.

No. There is no map here. __ksym symbol will have a ld_imm64 insn
to load the value in the bpf code. The address will be the kernel
address patched by libbpf.

> 
> It's not quite the same thing as nr_cpu_ids, but FWIW, you could
> accomplish something close to this by doing something like this in your
> BPF prog:
> 
> /* Set in user space to libbpf_num_possible_cpus() */
> const volatile __u32 nr_cpus;

This is another approach. In this case, nr_cpus will be
stored in a map.

I don't know the difference between kernel nr_cpu_ids vs.
libbpf_num_possible_cpus(). I am choosing nr_cpu_ids
because it is the one used inside the kernel. If
libbpf_num_possible_cpus() effectively nr_cpu_ids,
then happy to use libbpf_num_possible_cpus() which
is more user/libbpf friendly.

> 
> ...
> 	__u32 i;
> 
> 	bpf_for(i, 0, nr_cpus)
> 		bpf_printk("Iterating over cpu %u", i);
> 
> ...
> 
>>>>      struct bpf_iter_num it;
>>>>      int i = 0;
>>>>
>>>>      // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
>>>>      bpf_iter_num_new(&it, 1, nr_cpu_ids);
>>>>      while ((v = bpf_iter_num_next(&it))) {
>>>>             /* access cpu i data */
>>>>             i++;
>>>>      }
>>>>      bpf_iter_num_destroy(&it);
>>>>
>>>>   From all existing open coded iterator loops, looks like
>>>> upper bound has to be a constant. We might need to extend support
>>>> to bounded scalar upper bound if not there.
>>>
>>> Currently the upper bound is required by both the open-coded for-loop
>>> and the bpf_loop. I think we can extend it.
>>>
>>> It can't handle the cpumask case either.
>>>
>>>      for_each_cpu(cpu, mask)
>>>
>>> In the 'mask', the CPU IDs might not be continuous. In our container
>>> environment, we always use the cpuset cgroup for some critical tasks,
>>> but it is not so convenient to traverse the percpu data of this cpuset
>>> cgroup.  We have to do it as follows for this case :
>>>
>>> That's why we prefer to introduce a bpf_for_each_cpu helper. It is
>>> fine if it can be implemented as a kfunc.
>>
>> I think open-coded-iterators is the only acceptable path forward here.
>> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
>> let's introduce bpf_iter_cpumask and few additional kfuncs
>> that return cpu_possible_mask and others.
> 
> I agree that this is the correct way to generalize this. The only thing
> that we'll have to figure out is how to generalize treating const struct
> cpumask * objects as kptrs. In sched_ext [0] we export
> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> return trusted global cpumask kptrs that can then be "released" in
> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> exists only to appease the verifier that the trusted cpumask kptrs
> aren't being leaked and are having their references "dropped".
> 
> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> 
> I'd imagine that we have 2 ways forward if we want to enable progs to
> fetch other global cpumasks with static lifetimes (e.g.
> __cpu_possible_mask or nohz.idle_cpus_mask):
> 
> 1. The most straightforward thing to do would be to add a new kfunc in
>     kernel/bpf/cpumask.c that's a drop-in replacment for
>     scx_bpf_put_idle_cpumask():
> 
> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> {}
> 
> 2. Another would be to implement something resembling what Yonghong
>     suggested in [1], where progs can link against global allocated kptrs
>     like:
> 
> const struct cpumask *__cpu_possible_mask __ksym;
> 
> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
> 
> In my opinion (1) is more straightforward, (2) is a better UX.
> 
> Note again that both approaches only works for cpumasks with static
> lifetimes.  I can't think of a way to treat dynamically allocated struct
> cpumask *objects as kptrs as there's nowhere to put a reference. If
> someone wants to track a dynamically allocated cpumask, they'd have to
> create a kptr out of its container object, and then pass that object's
> cpumask as a const struct cpumask * to other BPF cpumask kfuncs
> (including e.g. the proposed iterator).
> 
>> We already have some cpumask support in kernel/bpf/cpumask.c
>> bpf_iter_cpumask will be a natural follow up.
> 
> Yes, this should be easy to add.
> 
> - David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02  6:54         ` Yonghong Song
@ 2023-08-02 15:46           ` David Vernet
  2023-08-02 16:23             ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: David Vernet @ 2023-08-02 15:46 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf

On Tue, Aug 01, 2023 at 11:54:18PM -0700, Yonghong Song wrote:
> 
> 
> On 8/1/23 8:29 PM, David Vernet wrote:
> > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > 
> > > > > 
> > > > > In kernel, we have a global variable
> > > > >      nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > > > which is used in numerous places for per cpu data struct access.
> > > > > 
> > > > > I am wondering whether we could have bpf code like
> > > > >      int nr_cpu_ids __ksym;
> > 
> > I think this would be useful in general, though any __ksym variable like
> > this would have to be const and mapped in .rodata, right? But yeah,
> > being able to R/O map global variables like this which have static
> > lifetimes would be nice.
> 
> No. There is no map here. __ksym symbol will have a ld_imm64 insn
> to load the value in the bpf code. The address will be the kernel
> address patched by libbpf.

ld_imm64 is fine. I'm talking about stores. BPF progs should not be able
to mutate these variables.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02 15:46           ` David Vernet
@ 2023-08-02 16:23             ` Alexei Starovoitov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2023-08-02 16:23 UTC (permalink / raw)
  To: David Vernet
  Cc: Yonghong Song, Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf

On Wed, Aug 2, 2023 at 8:46 AM David Vernet <void@manifault.com> wrote:
>
> On Tue, Aug 01, 2023 at 11:54:18PM -0700, Yonghong Song wrote:
> >
> >
> > On 8/1/23 8:29 PM, David Vernet wrote:
> > > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
> > > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > >
> > > > > > In kernel, we have a global variable
> > > > > >      nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > > > > which is used in numerous places for per cpu data struct access.
> > > > > >
> > > > > > I am wondering whether we could have bpf code like
> > > > > >      int nr_cpu_ids __ksym;
> > >
> > > I think this would be useful in general, though any __ksym variable like
> > > this would have to be const and mapped in .rodata, right? But yeah,
> > > being able to R/O map global variables like this which have static
> > > lifetimes would be nice.
> >
> > No. There is no map here. __ksym symbol will have a ld_imm64 insn
> > to load the value in the bpf code. The address will be the kernel
> > address patched by libbpf.
>
> ld_imm64 is fine. I'm talking about stores. BPF progs should not be able
> to mutate these variables.

ksyms are readonly and support for them is already in the kernel and libbpf.
The only reason:
extern int nr_cpu_ids __ksym;
doesn't work today because nr_cpu_ids is not in vmlinux BTF.
pahole adds only per-cpu vars to BTF.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02  3:29       ` David Vernet
  2023-08-02  6:54         ` Yonghong Song
@ 2023-08-02 16:33         ` Alexei Starovoitov
  2023-08-02 17:06           ` David Vernet
  2023-08-03  8:21           ` Alan Maguire
  1 sibling, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2023-08-02 16:33 UTC (permalink / raw)
  To: David Vernet, Alan Maguire
  Cc: Yafang Shao, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf

On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
> I agree that this is the correct way to generalize this. The only thing
> that we'll have to figure out is how to generalize treating const struct
> cpumask * objects as kptrs. In sched_ext [0] we export
> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> return trusted global cpumask kptrs that can then be "released" in
> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> exists only to appease the verifier that the trusted cpumask kptrs
> aren't being leaked and are having their references "dropped".

why is it KF_ACQUIRE ?
I think it can just return a trusted pointer without acquire.

> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>
> I'd imagine that we have 2 ways forward if we want to enable progs to
> fetch other global cpumasks with static lifetimes (e.g.
> __cpu_possible_mask or nohz.idle_cpus_mask):
>
> 1. The most straightforward thing to do would be to add a new kfunc in
>    kernel/bpf/cpumask.c that's a drop-in replacment for
>    scx_bpf_put_idle_cpumask():
>
> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> {}
>
> 2. Another would be to implement something resembling what Yonghong
>    suggested in [1], where progs can link against global allocated kptrs
>    like:
>
> const struct cpumask *__cpu_possible_mask __ksym;
>
> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>
> In my opinion (1) is more straightforward, (2) is a better UX.

1 = adding few kfuncs.
2 = teaching pahole to emit certain global vars.

nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
1998

imo BTF increase trade off is acceptable.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02 16:33         ` Alexei Starovoitov
@ 2023-08-02 17:06           ` David Vernet
  2023-08-02 18:13             ` Alexei Starovoitov
  2023-08-03  8:21           ` Alan Maguire
  1 sibling, 1 reply; 18+ messages in thread
From: David Vernet @ 2023-08-02 17:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alan Maguire, Yafang Shao, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf

On Wed, Aug 02, 2023 at 09:33:18AM -0700, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
> > I agree that this is the correct way to generalize this. The only thing
> > that we'll have to figure out is how to generalize treating const struct
> > cpumask * objects as kptrs. In sched_ext [0] we export
> > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> > return trusted global cpumask kptrs that can then be "released" in
> > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> > exists only to appease the verifier that the trusted cpumask kptrs
> > aren't being leaked and are having their references "dropped".
> 
> why is it KF_ACQUIRE ?
> I think it can just return a trusted pointer without acquire.

I don't think there's a way to do this yet without hard-coding the kfuncs as
special, right? That's why we do stuff like this:


11479                 } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
11480                         mark_reg_known_zero(env, regs, BPF_REG_0);
11481                         regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
11482                         regs[BPF_REG_0].btf = desc_btf;
11483                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;

We could continue to do that, but I wonder if it would be useful to add
a kfunc flag that allowed a kfunc to specify that? Something like
KF_ALWAYS_TRUSTED? In general though, yes, we can teach the verifier to
not require KF_ACQUIRE if we want. It's just that what we have now
doesn't really scale to the kernel for any global cpumask.

> > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> >
> > I'd imagine that we have 2 ways forward if we want to enable progs to
> > fetch other global cpumasks with static lifetimes (e.g.
> > __cpu_possible_mask or nohz.idle_cpus_mask):
> >
> > 1. The most straightforward thing to do would be to add a new kfunc in
> >    kernel/bpf/cpumask.c that's a drop-in replacment for
> >    scx_bpf_put_idle_cpumask():
> >
> > void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> > {}
> >
> > 2. Another would be to implement something resembling what Yonghong
> >    suggested in [1], where progs can link against global allocated kptrs
> >    like:
> >
> > const struct cpumask *__cpu_possible_mask __ksym;
> >
> > [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
> >
> > In my opinion (1) is more straightforward, (2) is a better UX.
> 
> 1 = adding few kfuncs.
> 2 = teaching pahole to emit certain global vars.
> 
> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
> 1998
> 
> imo BTF increase trade off is acceptable.

Agreed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02 17:06           ` David Vernet
@ 2023-08-02 18:13             ` Alexei Starovoitov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2023-08-02 18:13 UTC (permalink / raw)
  To: David Vernet
  Cc: Alan Maguire, Yafang Shao, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf

On Wed, Aug 2, 2023 at 10:06 AM David Vernet <void@manifault.com> wrote:
>
> On Wed, Aug 02, 2023 at 09:33:18AM -0700, Alexei Starovoitov wrote:
> > On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
> > > I agree that this is the correct way to generalize this. The only thing
> > > that we'll have to figure out is how to generalize treating const struct
> > > cpumask * objects as kptrs. In sched_ext [0] we export
> > > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> > > return trusted global cpumask kptrs that can then be "released" in
> > > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> > > exists only to appease the verifier that the trusted cpumask kptrs
> > > aren't being leaked and are having their references "dropped".
> >
> > why is it KF_ACQUIRE ?
> > I think it can just return a trusted pointer without acquire.
>
> I don't think there's a way to do this yet without hard-coding the kfuncs as
> special, right? That's why we do stuff like this:
>
>
> 11479                 } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
> 11480                         mark_reg_known_zero(env, regs, BPF_REG_0);
> 11481                         regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
> 11482                         regs[BPF_REG_0].btf = desc_btf;
> 11483                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;
>
> We could continue to do that, but I wonder if it would be useful to add
> a kfunc flag that allowed a kfunc to specify that? Something like
> KF_ALWAYS_TRUSTED? In general though, yes, we can teach the verifier to
> not require KF_ACQUIRE if we want. It's just that what we have now
> doesn't really scale to the kernel for any global cpumask.

We should probably just do this:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e7b1af016841..f0c3f69ee5a8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11567,7 +11567,7 @@ static int check_kfunc_call(struct
bpf_verifier_env *env, struct bpf_insn *insn,
                } else {
                        mark_reg_known_zero(env, regs, BPF_REG_0);
                        regs[BPF_REG_0].btf = desc_btf;
-                       regs[BPF_REG_0].type = PTR_TO_BTF_ID;
+                       regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
                        regs[BPF_REG_0].btf_id = ptr_type_id;

A quick audit for kfuncs shows that this code is never used.
Everywhere where ptr_to_btf is returned the kfunc is maked with KF_ACQUIRE.
We'd need to have careful code review to make sure future kfuncs
return trusted pointers.
That would simplify scx_bpf_get_idle_cpumask(). No need for get and nop put.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-02 16:33         ` Alexei Starovoitov
  2023-08-02 17:06           ` David Vernet
@ 2023-08-03  8:21           ` Alan Maguire
  2023-08-03 15:22             ` Yonghong Song
  1 sibling, 1 reply; 18+ messages in thread
From: Alan Maguire @ 2023-08-03  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov, David Vernet
  Cc: Yafang Shao, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf, Stephen Brennan

On 02/08/2023 17:33, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
>> I agree that this is the correct way to generalize this. The only thing
>> that we'll have to figure out is how to generalize treating const struct
>> cpumask * objects as kptrs. In sched_ext [0] we export
>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>> return trusted global cpumask kptrs that can then be "released" in
>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>> exists only to appease the verifier that the trusted cpumask kptrs
>> aren't being leaked and are having their references "dropped".
> 
> why is it KF_ACQUIRE ?
> I think it can just return a trusted pointer without acquire.
> 
>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>
>> I'd imagine that we have 2 ways forward if we want to enable progs to
>> fetch other global cpumasks with static lifetimes (e.g.
>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>
>> 1. The most straightforward thing to do would be to add a new kfunc in
>>    kernel/bpf/cpumask.c that's a drop-in replacment for
>>    scx_bpf_put_idle_cpumask():
>>
>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>> {}
>>
>> 2. Another would be to implement something resembling what Yonghong
>>    suggested in [1], where progs can link against global allocated kptrs
>>    like:
>>
>> const struct cpumask *__cpu_possible_mask __ksym;
>>
>> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>>
>> In my opinion (1) is more straightforward, (2) is a better UX.
> 
> 1 = adding few kfuncs.
> 2 = teaching pahole to emit certain global vars.
> 
> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
> 1998
> 
> imo BTF increase trade off is acceptable.

Agreed, Stephen's numbers on BTF size increase were pretty modest [1].

What was gating that work in my mind was previous discussion around
splitting aspects of BTF into a "vmlinux-extra". Experiments with this
seemed to show it's hard to support, and worse, tooling would have to
learn about its existence. We have to come up with a CONFIG convention
about specifying what ends up in -extra versus core vmlinux BTF, what do
we do about modules, etc. All feels like over-complication.

I think a better path would be to support BTF in a vmlinux BTF module
(controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
separately loadable, but puts vmlinux in the same place for tools -
/sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
for embedded use cases that have come up a few times, and lessens
concerns about BTF size for other users, while it all works with
existing tooling. I have a basic proof-of-concept but it will take time
to hammer into shape.

Because variable-related size increases are pretty modest, so should we
proceed with the BTF variable support anyway? We can modularize BTF
separately later on for those concerned about BTF size.

[1]
https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-03  8:21           ` Alan Maguire
@ 2023-08-03 15:22             ` Yonghong Song
  2023-08-03 16:10               ` Alan Maguire
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2023-08-03 15:22 UTC (permalink / raw)
  To: Alan Maguire, Alexei Starovoitov, David Vernet
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	Stephen Brennan



On 8/3/23 1:21 AM, Alan Maguire wrote:
> On 02/08/2023 17:33, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
>>> I agree that this is the correct way to generalize this. The only thing
>>> that we'll have to figure out is how to generalize treating const struct
>>> cpumask * objects as kptrs. In sched_ext [0] we export
>>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>>> return trusted global cpumask kptrs that can then be "released" in
>>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>>> exists only to appease the verifier that the trusted cpumask kptrs
>>> aren't being leaked and are having their references "dropped".
>>
>> why is it KF_ACQUIRE ?
>> I think it can just return a trusted pointer without acquire.
>>
>>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>
>>> I'd imagine that we have 2 ways forward if we want to enable progs to
>>> fetch other global cpumasks with static lifetimes (e.g.
>>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>>
>>> 1. The most straightforward thing to do would be to add a new kfunc in
>>>     kernel/bpf/cpumask.c that's a drop-in replacment for
>>>     scx_bpf_put_idle_cpumask():
>>>
>>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>>> {}
>>>
>>> 2. Another would be to implement something resembling what Yonghong
>>>     suggested in [1], where progs can link against global allocated kptrs
>>>     like:
>>>
>>> const struct cpumask *__cpu_possible_mask __ksym;
>>>
>>> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>>>
>>> In my opinion (1) is more straightforward, (2) is a better UX.
>>
>> 1 = adding few kfuncs.
>> 2 = teaching pahole to emit certain global vars.
>>
>> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
>> 1998
>>
>> imo BTF increase trade off is acceptable.
> 
> Agreed, Stephen's numbers on BTF size increase were pretty modest [1].
> 
> What was gating that work in my mind was previous discussion around
> splitting aspects of BTF into a "vmlinux-extra". Experiments with this
> seemed to show it's hard to support, and worse, tooling would have to
> learn about its existence. We have to come up with a CONFIG convention
> about specifying what ends up in -extra versus core vmlinux BTF, what do
> we do about modules, etc. All feels like over-complication.
> 
> I think a better path would be to support BTF in a vmlinux BTF module
> (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
> separately loadable, but puts vmlinux in the same place for tools -
> /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
> for embedded use cases that have come up a few times, and lessens
> concerns about BTF size for other users, while it all works with
> existing tooling. I have a basic proof-of-concept but it will take time
> to hammer into shape.
> 
> Because variable-related size increases are pretty modest, so should we
> proceed with the BTF variable support anyway? We can modularize BTF
> separately later on for those concerned about BTF size.

Alan, it seems a consensus has reached that we should include
global variables (excluding special kernel made ones like
__SCK_ and __tracepoint_) in vmlinux BTF.
please go ahead and propose a patch. Thanks!

> 
> [1]
> https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
  2023-08-03 15:22             ` Yonghong Song
@ 2023-08-03 16:10               ` Alan Maguire
  0 siblings, 0 replies; 18+ messages in thread
From: Alan Maguire @ 2023-08-03 16:10 UTC (permalink / raw)
  To: yonghong.song, Alexei Starovoitov, David Vernet
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	Stephen Brennan

On 03/08/2023 16:22, Yonghong Song wrote:
> 
> 
> On 8/3/23 1:21 AM, Alan Maguire wrote:
>> On 02/08/2023 17:33, Alexei Starovoitov wrote:
>>> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
>>>> I agree that this is the correct way to generalize this. The only thing
>>>> that we'll have to figure out is how to generalize treating const
>>>> struct
>>>> cpumask * objects as kptrs. In sched_ext [0] we export
>>>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>>>> return trusted global cpumask kptrs that can then be "released" in
>>>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>>>> exists only to appease the verifier that the trusted cpumask kptrs
>>>> aren't being leaked and are having their references "dropped".
>>>
>>> why is it KF_ACQUIRE ?
>>> I think it can just return a trusted pointer without acquire.
>>>
>>>> [0]:
>>>> https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>>
>>>> I'd imagine that we have 2 ways forward if we want to enable progs to
>>>> fetch other global cpumasks with static lifetimes (e.g.
>>>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>>>
>>>> 1. The most straightforward thing to do would be to add a new kfunc in
>>>>     kernel/bpf/cpumask.c that's a drop-in replacment for
>>>>     scx_bpf_put_idle_cpumask():
>>>>
>>>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>>>> {}
>>>>
>>>> 2. Another would be to implement something resembling what Yonghong
>>>>     suggested in [1], where progs can link against global allocated
>>>> kptrs
>>>>     like:
>>>>
>>>> const struct cpumask *__cpu_possible_mask __ksym;
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>>>>
>>>> In my opinion (1) is more straightforward, (2) is a better UX.
>>>
>>> 1 = adding few kfuncs.
>>> 2 = teaching pahole to emit certain global vars.
>>>
>>> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
>>> 1998
>>>
>>> imo BTF increase trade off is acceptable.
>>
>> Agreed, Stephen's numbers on BTF size increase were pretty modest [1].
>>
>> What was gating that work in my mind was previous discussion around
>> splitting aspects of BTF into a "vmlinux-extra". Experiments with this
>> seemed to show it's hard to support, and worse, tooling would have to
>> learn about its existence. We have to come up with a CONFIG convention
>> about specifying what ends up in -extra versus core vmlinux BTF, what do
>> we do about modules, etc. All feels like over-complication.
>>
>> I think a better path would be to support BTF in a vmlinux BTF module
>> (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
>> separately loadable, but puts vmlinux in the same place for tools -
>> /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
>> for embedded use cases that have come up a few times, and lessens
>> concerns about BTF size for other users, while it all works with
>> existing tooling. I have a basic proof-of-concept but it will take time
>> to hammer into shape.
>>
>> Because variable-related size increases are pretty modest, so should we
>> proceed with the BTF variable support anyway? We can modularize BTF
>> separately later on for those concerned about BTF size.
> 
> Alan, it seems a consensus has reached that we should include
> global variables (excluding special kernel made ones like
> __SCK_ and __tracepoint_) in vmlinux BTF.
> please go ahead and propose a patch. Thanks!
>

Sounds good! Stephen and I will hopefully have something ready soon;
the changes are mostly in dwarves plus a kernel selftest. And
good idea on the filtering; this will eliminate a lot of unneeded
variables and cut down on size overheads. Thanks!

Alan

>>
>> [1]
>> https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-08-03 16:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-01 14:29 [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yafang Shao
2023-08-01 14:29 ` [RFC PATCH bpf-next 1/3] bpf: Add bpf_for_each_cpu helper Yafang Shao
2023-08-01 14:29 ` [RFC PATCH bpf-next 2/3] cgroup, psi: Init root cgroup psi to psi_system Yafang Shao
2023-08-01 14:29 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add selftest for for_each_cpu Yafang Shao
2023-08-01 17:53 ` [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yonghong Song
2023-08-02  2:33   ` Yafang Shao
2023-08-02  2:45     ` Alexei Starovoitov
2023-08-02  2:57       ` Yafang Shao
2023-08-02  3:29       ` David Vernet
2023-08-02  6:54         ` Yonghong Song
2023-08-02 15:46           ` David Vernet
2023-08-02 16:23             ` Alexei Starovoitov
2023-08-02 16:33         ` Alexei Starovoitov
2023-08-02 17:06           ` David Vernet
2023-08-02 18:13             ` Alexei Starovoitov
2023-08-03  8:21           ` Alan Maguire
2023-08-03 15:22             ` Yonghong Song
2023-08-03 16:10               ` Alan Maguire

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox