[RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data

BPF List
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data
@ 2025-01-13 15:24 Leon Hwang
  2025-01-13 15:24 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Leon Hwang @ 2025-01-13 15:24 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, yonghong.song, song, eddyz87, leon.hwang,
	kernel-patches-bot

This patch set introduces global per-CPU data, similar to commit
6316f78306c1 ("Merge branch 'support-global-data'"), to reduce restrictions
in C for BPF programs.

With this enhancement, it becomes possible to define and use global per-CPU
variables, much like the DEFINE_PER_CPU() macro in the kernel[0].

The idea stems from the bpflbr project[1], which itself was inspired by
retsnoop[2]. During testing of bpflbr on the v6.6 kernel, two LBR
(Last Branch Record) entries were observed related to the
bpf_get_smp_processor_id() helper.

Since commit 1ae6921009e5 ("bpf: inline bpf_get_smp_processor_id() helper"),
the bpf_get_smp_processor_id() helper has been inlined on x86_64, reducing
the overhead and consequently minimizing these two LBR records.

However, the introduction of global per-CPU data offers a more robust
solution. By leveraging the percpu_array map and percpu instructions,
global per-CPU data can be implemented intrinsically.

This feature also facilitates sharing per-CPU information between tail
callers and callees or between freplace callers and callees through a
shared global per-CPU variable. Previously, this was achieved using a
1-entry percpu map, which this patch set aims to improve upon.

Links:
[0] https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/percpu-defs.h#L114
[1] https://github.com/Asphaltt/bpflbr
[2] https://github.com/anakryiko/retsnoop

Leon Hwang (2):
  bpf: Introduce global percpu data
  selftests/bpf: Add a case to test global percpu data

 kernel/bpf/arraymap.c                         |  39 +++++-
 kernel/bpf/verifier.c                         |  45 +++++++
 tools/lib/bpf/libbpf.c                        | 112 ++++++++++++++----
 .../bpf/prog_tests/global_data_init.c         |  83 ++++++++++++-
 .../bpf/progs/test_global_percpu_data.c       |  21 ++++
 5 files changed, 274 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_global_percpu_data.c

-- 
2.47.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH bpf-next 1/2] bpf: Introduce global percpu data
  2025-01-13 15:24 [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data Leon Hwang
@ 2025-01-13 15:24 ` Leon Hwang
  2025-01-14 23:10   ` Andrii Nakryiko
  2025-01-13 15:24 ` [RFC PATCH bpf-next 2/2] selftests/bpf: Add a case to test " Leon Hwang
  2025-01-13 16:58 ` [RFC PATCH bpf-next 0/2] bpf: Introduce " Daniel Xu
  2 siblings, 1 reply; 9+ messages in thread
From: Leon Hwang @ 2025-01-13 15:24 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, yonghong.song, song, eddyz87, leon.hwang,
	kernel-patches-bot

This patch introduces global per-CPU data, inspired by commit
6316f78306c1 ("Merge branch 'support-global-data'"). It enables the
definition of global per-CPU variables in BPF, similar to the
DEFINE_PER_CPU() macro in the kernel[0].

For example, in BPF, it is able to define a global per-CPU variable like
this:

int percpu_data SEC(".data..percpu");

With this patch, tools like retsnoop[1] and bpflbr[2] can simplify their
BPF code for handling LBRs. The code can be updated from

static struct perf_branch_entry lbrs[1][MAX_LBR_ENTRIES] SEC(".data.lbrs");

to

static struct perf_branch_entry lbrs[MAX_LBR_ENTRIES] SEC(".data..percpu.lbrs");

This eliminates the need to retrieve the CPU ID using the
bpf_get_smp_processor_id() helper.

Additionally, by reusing global per-CPU variables, sharing information
between tail callers and callees or freplace callers and callees becomes
simpler compared to using percpu_array maps.

Links:
[0] https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/percpu-defs.h#L114
[1] https://github.com/anakryiko/retsnoop
[2] https://github.com/Asphaltt/bpflbr

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 kernel/bpf/arraymap.c  |  39 +++++++++++++-
 kernel/bpf/verifier.c  |  45 +++++++++++++++++
 tools/lib/bpf/libbpf.c | 112 ++++++++++++++++++++++++++++++++---------
 3 files changed, 171 insertions(+), 25 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index eb28c0f219ee4..f8c60d8331975 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -249,6 +249,40 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
 	return this_cpu_ptr(array->pptrs[index & array->index_mask]);
 }
 
+static int percpu_array_map_direct_value_addr(const struct bpf_map *map,
+					      u64 *imm, u32 off)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+	if (map->max_entries != 1)
+		return -EOPNOTSUPP;
+	if (off >= map->value_size)
+		return -EINVAL;
+	if (!bpf_jit_supports_percpu_insn())
+		return -EOPNOTSUPP;
+
+	*imm = (u64) array->pptrs[0];
+	return 0;
+}
+
+static int percpu_array_map_direct_value_meta(const struct bpf_map *map,
+					      u64 imm, u32 *off)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	u64 base = (u64) array->pptrs[0];
+	u64 range = array->elem_size;
+
+	if (map->max_entries != 1)
+		return -EOPNOTSUPP;
+	if (imm < base || imm >= base + range)
+		return -ENOENT;
+	if (!bpf_jit_supports_percpu_insn())
+		return -EOPNOTSUPP;
+
+	*off = imm - base;
+	return 0;
+}
+
 /* emit BPF instructions equivalent to C code of percpu_array_map_lookup_elem() */
 static int percpu_array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 {
@@ -534,7 +568,8 @@ static int array_map_check_btf(const struct bpf_map *map,
 
 	/* One exception for keyless BTF: .bss/.data/.rodata map */
 	if (btf_type_is_void(key_type)) {
-		if (map->map_type != BPF_MAP_TYPE_ARRAY ||
+		if ((map->map_type != BPF_MAP_TYPE_ARRAY &&
+		     map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY) ||
 		    map->max_entries != 1)
 			return -EINVAL;
 
@@ -815,6 +850,8 @@ const struct bpf_map_ops percpu_array_map_ops = {
 	.map_get_next_key = array_map_get_next_key,
 	.map_lookup_elem = percpu_array_map_lookup_elem,
 	.map_gen_lookup = percpu_array_map_gen_lookup,
+	.map_direct_value_addr = percpu_array_map_direct_value_addr,
+	.map_direct_value_meta = percpu_array_map_direct_value_meta,
 	.map_update_elem = array_map_update_elem,
 	.map_delete_elem = array_map_delete_elem,
 	.map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b8ca227c78af1..94ce02a48ddc1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6809,6 +6809,8 @@ static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val,
 	u64 addr;
 	int err;
 
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
+		return -EINVAL;
 	err = map->ops->map_direct_value_addr(map, &addr, off);
 	if (err)
 		return err;
@@ -7324,6 +7326,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			/* if map is read-only, track its contents as scalars */
 			if (tnum_is_const(reg->var_off) &&
 			    bpf_map_is_rdonly(map) &&
+			    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY &&
 			    map->ops->map_direct_value_addr) {
 				int map_off = off + reg->var_off.value;
 				u64 val = 0;
@@ -9140,6 +9143,11 @@ static int check_reg_const_str(struct bpf_verifier_env *env,
 		return -EACCES;
 	}
 
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+		verbose(env, "percpu_array map does not support direct string value access\n");
+		return -EINVAL;
+	}
+
 	err = check_map_access(env, regno, reg->off,
 			       map->value_size - reg->off, false,
 			       ACCESS_HELPER);
@@ -10751,6 +10759,11 @@ static int check_bpf_snprintf_call(struct bpf_verifier_env *env,
 		return -EINVAL;
 	num_args = data_len_reg->var_off.value / 8;
 
+	if (fmt_map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+		verbose(env, "percpu_array map does not support snprintf\n");
+		return -EINVAL;
+	}
+
 	/* fmt being ARG_PTR_TO_CONST_STR guarantees that var_off is const
 	 * and map_direct_value_addr is set.
 	 */
@@ -21304,6 +21317,38 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
+#ifdef CONFIG_SMP
+		if (insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
+		    (insn->src_reg == BPF_PSEUDO_MAP_VALUE ||
+		     insn->src_reg == BPF_PSEUDO_MAP_IDX_VALUE)) {
+			struct bpf_map *map;
+
+			aux = &env->insn_aux_data[i + delta];
+			map = env->used_maps[aux->map_index];
+			if (map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY)
+				goto next_insn;
+
+			/* Reuse the original ld_imm64 insn. And add one
+			 * mov64_percpu_reg insn.
+			 */
+
+			insn_buf[0] = insn[1];
+			insn_buf[1] = BPF_MOV64_PERCPU_REG(insn->dst_reg, insn->dst_reg);
+			cnt = 2;
+
+			i++;
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+
+			goto next_insn;
+		}
+#endif
+
 		if (insn->code != (BPF_JMP | BPF_CALL))
 			goto next_insn;
 		if (insn->src_reg == BPF_PSEUDO_CALL)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6c262d0152f81..881174f4f90a4 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -516,6 +516,7 @@ struct bpf_struct_ops {
 };
 
 #define DATA_SEC ".data"
+#define PERCPU_DATA_SEC ".data..percpu"
 #define BSS_SEC ".bss"
 #define RODATA_SEC ".rodata"
 #define KCONFIG_SEC ".kconfig"
@@ -562,6 +563,8 @@ struct bpf_map {
 	__u32 btf_value_type_id;
 	__u32 btf_vmlinux_value_type_id;
 	enum libbpf_map_type libbpf_type;
+	int num_cpus;
+	void *data;
 	void *mmaped;
 	struct bpf_struct_ops *st_ops;
 	struct bpf_map *inner_map;
@@ -1923,11 +1926,35 @@ static bool map_is_mmapable(struct bpf_object *obj, struct bpf_map *map)
 	return false;
 }
 
+static bool map_is_percpu_data(struct bpf_map *map)
+{
+	return str_has_pfx(map->real_name, PERCPU_DATA_SEC);
+}
+
+static void map_copy_data(struct bpf_map *map, const void *data)
+{
+	bool is_percpu_data = map_is_percpu_data(map);
+	size_t data_sz = map->def.value_size;
+	size_t elem_sz = roundup(data_sz, 8);
+	int i;
+
+	if (!data)
+		return;
+
+	if (!is_percpu_data)
+		memcpy(map->mmaped, data, data_sz);
+	else
+		for (i = 0; i < map->num_cpus; i++)
+			memcpy(map->data + i*elem_sz, data, data_sz);
+}
+
 static int
 bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 			      const char *real_name, int sec_idx, void *data, size_t data_sz)
 {
+	bool is_percpu_data = str_has_pfx(real_name, PERCPU_DATA_SEC);
 	struct bpf_map_def *def;
+	const char *data_desc;
 	struct bpf_map *map;
 	size_t mmap_sz;
 	int err;
@@ -1948,7 +1975,8 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	}
 
 	def = &map->def;
-	def->type = BPF_MAP_TYPE_ARRAY;
+	def->type = is_percpu_data ? BPF_MAP_TYPE_PERCPU_ARRAY
+				   : BPF_MAP_TYPE_ARRAY;
 	def->key_size = sizeof(int);
 	def->value_size = data_sz;
 	def->max_entries = 1;
@@ -1958,29 +1986,57 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	/* failures are fine because of maps like .rodata.str1.1 */
 	(void) map_fill_btf_type_info(obj, map);
 
-	if (map_is_mmapable(obj, map))
-		def->map_flags |= BPF_F_MMAPABLE;
+	data_desc = is_percpu_data ? "percpu " : "";
+	pr_debug("map '%s' (global %sdata): at sec_idx %d, offset %zu, flags %x.\n",
+		 map->name, data_desc, map->sec_idx, map->sec_offset,
+		 def->map_flags);
 
-	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
-		 map->name, map->sec_idx, map->sec_offset, def->map_flags);
+	if (is_percpu_data) {
+		map->num_cpus = libbpf_num_possible_cpus();
+		if (map->num_cpus < 0) {
+			err = errno;
+			pr_warn("failed to get possible cpus\n");
+			goto free_name;
+		}
 
-	mmap_sz = bpf_map_mmap_sz(map);
-	map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
-			   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-	if (map->mmaped == MAP_FAILED) {
-		err = -errno;
-		map->mmaped = NULL;
-		pr_warn("failed to alloc map '%s' content buffer: %s\n", map->name, errstr(err));
-		zfree(&map->real_name);
-		zfree(&map->name);
-		return err;
-	}
+		map->data = calloc(map->num_cpus, roundup(data_sz, 8));
+		if (!map->data) {
+			err = -ENOMEM;
+			pr_warn("failed to alloc percpu map '%s' content buffer: %s\n",
+				map->name, errstr(err));
+			goto free_name;
+		}
 
-	if (data)
-		memcpy(map->mmaped, data, data_sz);
+		if (data)
+			map_copy_data(map, data);
+		else
+			memset(map->data, 0, map->num_cpus*roundup(data_sz, 8));
+	} else {
+		if (map_is_mmapable(obj, map))
+			def->map_flags |= BPF_F_MMAPABLE;
+
+		mmap_sz = bpf_map_mmap_sz(map);
+		map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
+				   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+		if (map->mmaped == MAP_FAILED) {
+			err = -errno;
+			map->mmaped = NULL;
+			pr_warn("failed to alloc map '%s' content buffer: %s\n",
+				map->name, errstr(err));
+			goto free_name;
+		}
+
+		if (data)
+			memcpy(map->mmaped, data, data_sz);
+	}
 
 	pr_debug("map %td is \"%s\"\n", map - obj->maps, map->name);
 	return 0;
+
+free_name:
+	zfree(&map->real_name);
+	zfree(&map->name);
+	return err;
 }
 
 static int bpf_object__init_global_data_maps(struct bpf_object *obj)
@@ -5127,16 +5183,21 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)
 	enum libbpf_map_type map_type = map->libbpf_type;
 	int err, zero = 0;
 	size_t mmap_sz;
+	size_t data_sz;
+	void *data;
 
+	data_sz = map_is_percpu_data(map) ? roundup(map->def.value_size, 8)*map->num_cpus
+					  : map->def.value_size;
+	data = map_is_percpu_data(map) ? map->data : map->mmaped;
 	if (obj->gen_loader) {
 		bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps,
-					 map->mmaped, map->def.value_size);
+					 data, data_sz);
 		if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG)
 			bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);
 		return 0;
 	}
 
-	err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);
+	err = bpf_map_update_elem(map->fd, &zero, data, 0);
 	if (err) {
 		err = -errno;
 		pr_warn("map '%s': failed to set initial contents: %s\n",
@@ -9041,6 +9102,8 @@ static void bpf_map__destroy(struct bpf_map *map)
 	if (map->mmaped && map->mmaped != map->obj->arena_data)
 		munmap(map->mmaped, bpf_map_mmap_sz(map));
 	map->mmaped = NULL;
+	if (map->data)
+		zfree(&map->data);
 
 	if (map->st_ops) {
 		zfree(&map->st_ops->data);
@@ -10348,7 +10411,8 @@ int bpf_map__set_initial_value(struct bpf_map *map,
 	if (map->obj->loaded || map->reused)
 		return libbpf_err(-EBUSY);
 
-	if (!map->mmaped || map->libbpf_type == LIBBPF_MAP_KCONFIG)
+	if ((!map->mmaped && !map->data) ||
+	    map->libbpf_type == LIBBPF_MAP_KCONFIG)
 		return libbpf_err(-EINVAL);
 
 	if (map->def.type == BPF_MAP_TYPE_ARENA)
@@ -10358,7 +10422,7 @@ int bpf_map__set_initial_value(struct bpf_map *map,
 	if (size != actual_sz)
 		return libbpf_err(-EINVAL);
 
-	memcpy(map->mmaped, data, size);
+	map_copy_data(map, data);
 	return 0;
 }
 
@@ -10370,7 +10434,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
 		return map->st_ops->data;
 	}
 
-	if (!map->mmaped)
+	if ((!map->mmaped && !map->data))
 		return NULL;
 
 	if (map->def.type == BPF_MAP_TYPE_ARENA)
@@ -10378,7 +10442,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
 	else
 		*psize = map->def.value_size;
 
-	return map->mmaped;
+	return map->def.type == BPF_MAP_TYPE_PERCPU_ARRAY ? map->data : map->mmaped;
 }
 
 bool bpf_map__is_internal(const struct bpf_map *map)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH bpf-next 2/2] selftests/bpf: Add a case to test global percpu data
  2025-01-13 15:24 [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data Leon Hwang
  2025-01-13 15:24 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
@ 2025-01-13 15:24 ` Leon Hwang
  2025-01-13 16:58 ` [RFC PATCH bpf-next 0/2] bpf: Introduce " Daniel Xu
  2 siblings, 0 replies; 9+ messages in thread
From: Leon Hwang @ 2025-01-13 15:24 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, yonghong.song, song, eddyz87, leon.hwang,
	kernel-patches-bot

If the arch, like s390x, does not support percpu insn, this case won't
test global percpu data by checking -EOPNOTSUPP when load prog.

cd tools/testing/selftests/bpf; ./test_progs -t global_percpu_data
124     global_percpu_data_init:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 .../bpf/prog_tests/global_data_init.c         | 83 ++++++++++++++++++-
 .../bpf/progs/test_global_percpu_data.c       | 21 +++++
 2 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_global_percpu_data.c

diff --git a/tools/testing/selftests/bpf/prog_tests/global_data_init.c b/tools/testing/selftests/bpf/prog_tests/global_data_init.c
index 8466332d7406f..634ff531bf5a4 100644
--- a/tools/testing/selftests/bpf/prog_tests/global_data_init.c
+++ b/tools/testing/selftests/bpf/prog_tests/global_data_init.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
+#include "test_global_percpu_data.skel.h"
 
 void test_global_data_init(void)
 {
@@ -8,7 +9,7 @@ void test_global_data_init(void)
 	__u8 *buff = NULL, *newval = NULL;
 	struct bpf_object *obj;
 	struct bpf_map *map;
-        __u32 duration = 0;
+	__u32 duration = 0;
 	size_t sz;
 
 	obj = bpf_object__open_file(file, NULL);
@@ -60,3 +61,83 @@ void test_global_data_init(void)
 	free(newval);
 	bpf_object__close(obj);
 }
+
+void test_global_percpu_data_init(void)
+{
+	struct test_global_percpu_data *skel = NULL;
+	u64 *percpu_data = NULL;
+	struct bpf_map *map;
+	size_t init_data_sz;
+	char buff[128] = {};
+	int init_value = 2;
+	int key, value_sz;
+	int prog_fd, err;
+	int *init_data;
+	int num_cpus;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		    .data_in = buff,
+		    .data_size_in = sizeof(buff),
+		    .repeat = 1,
+	);
+
+	num_cpus = libbpf_num_possible_cpus();
+	if (!ASSERT_GT(num_cpus, 0, "libbpf_num_possible_cpus"))
+		return;
+
+	percpu_data = calloc(num_cpus, sizeof(*percpu_data));
+	if (!ASSERT_FALSE(percpu_data == NULL, "calloc percpu_data"))
+		return;
+
+	value_sz = sizeof(*percpu_data) * num_cpus;
+	memset(percpu_data, 0, value_sz);
+
+	skel = test_global_percpu_data__open();
+	if (!ASSERT_OK_PTR(skel, "test_global_percpu_data__open"))
+		goto out;
+
+	map = skel->maps.data__percpu;
+	err = bpf_map__set_initial_value(map, &init_value,
+					 sizeof(init_value));
+	if (!ASSERT_OK(err, "bpf_map__set_initial_value"))
+		goto out;
+
+	init_data = bpf_map__initial_value(map, &init_data_sz);
+	if (!ASSERT_OK_PTR(init_data, "bpf_map__initial_value"))
+		goto out;
+
+	ASSERT_EQ(*init_data, init_value, "initial_value");
+	ASSERT_EQ(init_data_sz, sizeof(init_value), "initial_value size");
+
+	err = test_global_percpu_data__load(skel);
+	if (err == -EOPNOTSUPP)
+		goto out;
+	if (!ASSERT_OK(err, "test_global_percpu_data__load"))
+		goto out;
+
+	ASSERT_EQ(bpf_map__type(map), BPF_MAP_TYPE_PERCPU_ARRAY,
+		  "bpf_map__type");
+
+	prog_fd = bpf_program__fd(skel->progs.update_percpu_data);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "update_percpu_data");
+	ASSERT_EQ(topts.retval, 0, "update_percpu_data retval");
+
+	key = 0;
+	err = bpf_map__lookup_elem(map, &key, sizeof(key), percpu_data,
+				   value_sz, 0);
+	if (!ASSERT_OK(err, "bpf_map__lookup_elem"))
+		goto out;
+
+	if (!ASSERT_LT(skel->bss->curr_cpu, num_cpus, "curr_cpu"))
+		goto out;
+	ASSERT_EQ(percpu_data[skel->bss->curr_cpu], 1, "percpu_data");
+	if (num_cpus > 1)
+		ASSERT_EQ(percpu_data[(skel->bss->curr_cpu+1)%num_cpus],
+			  init_value, "init_value");
+
+out:
+	test_global_percpu_data__destroy(skel);
+	if (percpu_data)
+		free(percpu_data);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_global_percpu_data.c b/tools/testing/selftests/bpf/progs/test_global_percpu_data.c
new file mode 100644
index 0000000000000..2f01051b782c6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_global_percpu_data.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Leon Hwang */
+
+#include <linux/bpf.h>
+#include <linux/pkt_cls.h>
+
+#include <bpf/bpf_helpers.h>
+
+int percpu_data SEC(".data..percpu");
+int curr_cpu;
+
+SEC("tc")
+int update_percpu_data(struct __sk_buff *skb)
+{
+	curr_cpu = bpf_get_smp_processor_id();
+	percpu_data = 1;
+
+	return TC_ACT_OK;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data
  2025-01-13 15:24 [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data Leon Hwang
  2025-01-13 15:24 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
  2025-01-13 15:24 ` [RFC PATCH bpf-next 2/2] selftests/bpf: Add a case to test " Leon Hwang
@ 2025-01-13 16:58 ` Daniel Xu
  2025-01-14  6:35   ` Leon Hwang
  2 siblings, 1 reply; 9+ messages in thread
From: Daniel Xu @ 2025-01-13 16:58 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, daniel, andrii, yonghong.song, song, eddyz87,
	kernel-patches-bot

Hi Leon,

On Mon, Jan 13, 2025 at 11:24:35PM +0800, Leon Hwang wrote:
> This patch set introduces global per-CPU data, similar to commit
> 6316f78306c1 ("Merge branch 'support-global-data'"), to reduce restrictions
> in C for BPF programs.
> 
> With this enhancement, it becomes possible to define and use global per-CPU
> variables, much like the DEFINE_PER_CPU() macro in the kernel[0].
> 
> The idea stems from the bpflbr project[1], which itself was inspired by
> retsnoop[2]. During testing of bpflbr on the v6.6 kernel, two LBR
> (Last Branch Record) entries were observed related to the
> bpf_get_smp_processor_id() helper.
> 
> Since commit 1ae6921009e5 ("bpf: inline bpf_get_smp_processor_id() helper"),
> the bpf_get_smp_processor_id() helper has been inlined on x86_64, reducing
> the overhead and consequently minimizing these two LBR records.
> 
> However, the introduction of global per-CPU data offers a more robust
> solution. By leveraging the percpu_array map and percpu instructions,
> global per-CPU data can be implemented intrinsically.
> 
> This feature also facilitates sharing per-CPU information between tail
> callers and callees or between freplace callers and callees through a
> shared global per-CPU variable. Previously, this was achieved using a
> 1-entry percpu map, which this patch set aims to improve upon.

I think this would be great to have. bpftrace would've liked to use this
for its recent big string support, but instead had to simulate a percpu
global through regular globals.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data
  2025-01-13 16:58 ` [RFC PATCH bpf-next 0/2] bpf: Introduce " Daniel Xu
@ 2025-01-14  6:35   ` Leon Hwang
  0 siblings, 0 replies; 9+ messages in thread
From: Leon Hwang @ 2025-01-14  6:35 UTC (permalink / raw)
  To: Daniel Xu
  Cc: bpf, ast, daniel, andrii, yonghong.song, song, eddyz87,
	kernel-patches-bot

Hi Daniel Xu,

On 14/1/25 00:58, Daniel Xu wrote:
> Hi Leon,
> 
> On Mon, Jan 13, 2025 at 11:24:35PM +0800, Leon Hwang wrote:
>> This patch set introduces global per-CPU data, similar to commit
>> 6316f78306c1 ("Merge branch 'support-global-data'"), to reduce restrictions
>> in C for BPF programs.
>>
>> With this enhancement, it becomes possible to define and use global per-CPU
>> variables, much like the DEFINE_PER_CPU() macro in the kernel[0].
>>
>> The idea stems from the bpflbr project[1], which itself was inspired by
>> retsnoop[2]. During testing of bpflbr on the v6.6 kernel, two LBR
>> (Last Branch Record) entries were observed related to the
>> bpf_get_smp_processor_id() helper.
>>
>> Since commit 1ae6921009e5 ("bpf: inline bpf_get_smp_processor_id() helper"),
>> the bpf_get_smp_processor_id() helper has been inlined on x86_64, reducing
>> the overhead and consequently minimizing these two LBR records.
>>
>> However, the introduction of global per-CPU data offers a more robust
>> solution. By leveraging the percpu_array map and percpu instructions,
>> global per-CPU data can be implemented intrinsically.
>>
>> This feature also facilitates sharing per-CPU information between tail
>> callers and callees or between freplace callers and callees through a
>> shared global per-CPU variable. Previously, this was achieved using a
>> 1-entry percpu map, which this patch set aims to improve upon.
> 
> I think this would be great to have. bpftrace would've liked to use this
> for its recent big string support, but instead had to simulate a percpu
> global through regular globals.
> 

Thank you for the feedback! I'm glad to hear that this feature could
help simplify bpftrace, especially for use cases like the recent big
string support. It's great to know this feature has potential to address
more real-world challenges.

Thanks,
Leon



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] bpf: Introduce global percpu data
  2025-01-13 15:24 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
@ 2025-01-14 23:10   ` Andrii Nakryiko
  2025-01-16  7:22     ` Leon Hwang
  0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2025-01-14 23:10 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, daniel, andrii, yonghong.song, song, eddyz87,
	kernel-patches-bot

On Mon, Jan 13, 2025 at 7:25 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> This patch introduces global per-CPU data, inspired by commit
> 6316f78306c1 ("Merge branch 'support-global-data'"). It enables the
> definition of global per-CPU variables in BPF, similar to the
> DEFINE_PER_CPU() macro in the kernel[0].
>
> For example, in BPF, it is able to define a global per-CPU variable like
> this:
>
> int percpu_data SEC(".data..percpu");
>
> With this patch, tools like retsnoop[1] and bpflbr[2] can simplify their
> BPF code for handling LBRs. The code can be updated from
>
> static struct perf_branch_entry lbrs[1][MAX_LBR_ENTRIES] SEC(".data.lbrs");
>
> to
>
> static struct perf_branch_entry lbrs[MAX_LBR_ENTRIES] SEC(".data..percpu.lbrs");
>
> This eliminates the need to retrieve the CPU ID using the
> bpf_get_smp_processor_id() helper.
>
> Additionally, by reusing global per-CPU variables, sharing information
> between tail callers and callees or freplace callers and callees becomes
> simpler compared to using percpu_array maps.
>
> Links:
> [0] https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/percpu-defs.h#L114
> [1] https://github.com/anakryiko/retsnoop
> [2] https://github.com/Asphaltt/bpflbr
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>  kernel/bpf/arraymap.c  |  39 +++++++++++++-
>  kernel/bpf/verifier.c  |  45 +++++++++++++++++
>  tools/lib/bpf/libbpf.c | 112 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 171 insertions(+), 25 deletions(-)
>

So I think the feature overall makes sense, but we need to think
through at least libbpf's side of things some more. Unlike .data,
per-cpu .data section is not mmapable, and so that has implication on
BPF skeleton and we should make sure all that makes sense on BPF
skeleton side. In that sense, per-cpu global data is more akin to
struct_ops initialization image, which can be accessed by user before
skeleton is loaded to initialize the image.

There are a few things to consider. What's the BPF skeleton interface?
Do we expose it as single struct and use that struct as initial image
for each CPU (which means user won't be able to initialize different
CPU data differently, at least not through BPF skeleton facilities)?
Or do we expose this as an array of structs and let user set each CPU
data independently?

I feel like keeping it simple and having one image for all CPUs would
cover most cases. And users can still access the underlying
PERCPU_ARRAY map if they need more control.

But either way, we need tests for skeleton, making sure we NULL-out
this per-cpu global data, but take it into account before the load.

Also, this huge calloc for possible CPUs, I'd like to avoid it
altogether for the (probably very common) zero-initialized case.

So in short, needs a bit of iteration to figure out all the
interfacing issues, but makes sense overall. See some more low-level
remarks below.

pw-bot: cr


[...]

> @@ -815,6 +850,8 @@ const struct bpf_map_ops percpu_array_map_ops = {
>         .map_get_next_key = array_map_get_next_key,
>         .map_lookup_elem = percpu_array_map_lookup_elem,
>         .map_gen_lookup = percpu_array_map_gen_lookup,
> +       .map_direct_value_addr = percpu_array_map_direct_value_addr,
> +       .map_direct_value_meta = percpu_array_map_direct_value_meta,
>         .map_update_elem = array_map_update_elem,
>         .map_delete_elem = array_map_delete_elem,
>         .map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b8ca227c78af1..94ce02a48ddc1 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6809,6 +6809,8 @@ static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val,
>         u64 addr;
>         int err;
>
> +       if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
> +               return -EINVAL;

I'd invert the condition and reject anything by BPF_MAP_TYPE_ARRAY. I
don't see how any other possible map type can support this magic that
we do with ARRAY. So to be on the safe side, let's just reject
everything that's non-ARRAY (here and elsewhere)?

>         err = map->ops->map_direct_value_addr(map, &addr, off);
>         if (err)
>                 return err;

[...]

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 6c262d0152f81..881174f4f90a4 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

definitely let's split out libbpf changes from kernel-side changes

> @@ -516,6 +516,7 @@ struct bpf_struct_ops {
>  };
>
>  #define DATA_SEC ".data"
> +#define PERCPU_DATA_SEC ".data..percpu"

I don't like this prefix, even if that's what we have in the kernel.
Something like just ".percpu" or ".percpu_data" or ".data_percpu" is
better, IMO.

>  #define BSS_SEC ".bss"
>  #define RODATA_SEC ".rodata"
>  #define KCONFIG_SEC ".kconfig"
> @@ -562,6 +563,8 @@ struct bpf_map {
>         __u32 btf_value_type_id;
>         __u32 btf_vmlinux_value_type_id;
>         enum libbpf_map_type libbpf_type;
> +       int num_cpus;
> +       void *data;
>         void *mmaped;
>         struct bpf_struct_ops *st_ops;
>         struct bpf_map *inner_map;
> @@ -1923,11 +1926,35 @@ static bool map_is_mmapable(struct bpf_object *obj, struct bpf_map *map)
>         return false;
>  }
>
> +static bool map_is_percpu_data(struct bpf_map *map)
> +{
> +       return str_has_pfx(map->real_name, PERCPU_DATA_SEC);
> +}

we have enum libbpf_map_type which is used to distinguish BSS vs
RODATA vs DATA vs others, let's stick to that

> +
> +static void map_copy_data(struct bpf_map *map, const void *data)
> +{
> +       bool is_percpu_data = map_is_percpu_data(map);
> +       size_t data_sz = map->def.value_size;
> +       size_t elem_sz = roundup(data_sz, 8);
> +       int i;
> +
> +       if (!data)
> +               return;
> +
> +       if (!is_percpu_data)
> +               memcpy(map->mmaped, data, data_sz);
> +       else
> +               for (i = 0; i < map->num_cpus; i++)
> +                       memcpy(map->data + i*elem_sz, data, data_sz);
> +}
> +
>  static int
>  bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>                               const char *real_name, int sec_idx, void *data, size_t data_sz)
>  {
> +       bool is_percpu_data = str_has_pfx(real_name, PERCPU_DATA_SEC);
>         struct bpf_map_def *def;
> +       const char *data_desc;
>         struct bpf_map *map;
>         size_t mmap_sz;
>         int err;
> @@ -1948,7 +1975,8 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>         }
>
>         def = &map->def;
> -       def->type = BPF_MAP_TYPE_ARRAY;
> +       def->type = is_percpu_data ? BPF_MAP_TYPE_PERCPU_ARRAY
> +                                  : BPF_MAP_TYPE_ARRAY;
>         def->key_size = sizeof(int);
>         def->value_size = data_sz;
>         def->max_entries = 1;
> @@ -1958,29 +1986,57 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>         /* failures are fine because of maps like .rodata.str1.1 */
>         (void) map_fill_btf_type_info(obj, map);
>
> -       if (map_is_mmapable(obj, map))
> -               def->map_flags |= BPF_F_MMAPABLE;
> +       data_desc = is_percpu_data ? "percpu " : "";

nit: just inline the logic in that single place you use it

> +       pr_debug("map '%s' (global %sdata): at sec_idx %d, offset %zu, flags %x.\n",
> +                map->name, data_desc, map->sec_idx, map->sec_offset,
> +                def->map_flags);
>
> -       pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
> -                map->name, map->sec_idx, map->sec_offset, def->map_flags);
> +       if (is_percpu_data) {
> +               map->num_cpus = libbpf_num_possible_cpus();
> +               if (map->num_cpus < 0) {
> +                       err = errno;

map->num_cpus should have actual error value already, avoid using
errno unnecessarily

> +                       pr_warn("failed to get possible cpus\n");
> +                       goto free_name;
> +               }
>
> -       mmap_sz = bpf_map_mmap_sz(map);
> -       map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
> -                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> -       if (map->mmaped == MAP_FAILED) {
> -               err = -errno;
> -               map->mmaped = NULL;
> -               pr_warn("failed to alloc map '%s' content buffer: %s\n", map->name, errstr(err));
> -               zfree(&map->real_name);
> -               zfree(&map->name);
> -               return err;
> -       }
> +               map->data = calloc(map->num_cpus, roundup(data_sz, 8));
> +               if (!map->data) {
> +                       err = -ENOMEM;
> +                       pr_warn("failed to alloc percpu map '%s' content buffer: %s\n",
> +                               map->name, errstr(err));

please stick to establish reporting style (i.e., for map operations we
always use "map '%s': " prefix)

> +                       goto free_name;
> +               }

[...]

> @@ -10370,7 +10434,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
>                 return map->st_ops->data;
>         }
>
> -       if (!map->mmaped)
> +       if ((!map->mmaped && !map->data))

nit: why unnecessary extra () ?


>                 return NULL;
>
>         if (map->def.type == BPF_MAP_TYPE_ARENA)
> @@ -10378,7 +10442,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
>         else
>                 *psize = map->def.value_size;
>
> -       return map->mmaped;
> +       return map->def.type == BPF_MAP_TYPE_PERCPU_ARRAY ? map->data : map->mmaped;
>  }
>
>  bool bpf_map__is_internal(const struct bpf_map *map)
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] bpf: Introduce global percpu data
  2025-01-14 23:10   ` Andrii Nakryiko
@ 2025-01-16  7:22     ` Leon Hwang
  2025-01-16 23:37       ` Andrii Nakryiko
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Hwang @ 2025-01-16  7:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, daniel, andrii, yonghong.song, song, eddyz87,
	kernel-patches-bot

Hi,

On 15/1/25 07:10, Andrii Nakryiko wrote:
> On Mon, Jan 13, 2025 at 7:25 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> This patch introduces global per-CPU data, inspired by commit
>> 6316f78306c1 ("Merge branch 'support-global-data'"). It enables the
>> definition of global per-CPU variables in BPF, similar to the
>> DEFINE_PER_CPU() macro in the kernel[0].
>>
>> For example, in BPF, it is able to define a global per-CPU variable like
>> this:
>>
>> int percpu_data SEC(".data..percpu");
>>
>> With this patch, tools like retsnoop[1] and bpflbr[2] can simplify their
>> BPF code for handling LBRs. The code can be updated from
>>
>> static struct perf_branch_entry lbrs[1][MAX_LBR_ENTRIES] SEC(".data.lbrs");
>>
>> to
>>
>> static struct perf_branch_entry lbrs[MAX_LBR_ENTRIES] SEC(".data..percpu.lbrs");
>>
>> This eliminates the need to retrieve the CPU ID using the
>> bpf_get_smp_processor_id() helper.
>>
>> Additionally, by reusing global per-CPU variables, sharing information
>> between tail callers and callees or freplace callers and callees becomes
>> simpler compared to using percpu_array maps.
>>
>> Links:
>> [0] https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/percpu-defs.h#L114
>> [1] https://github.com/anakryiko/retsnoop
>> [2] https://github.com/Asphaltt/bpflbr
>>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>>  kernel/bpf/arraymap.c  |  39 +++++++++++++-
>>  kernel/bpf/verifier.c  |  45 +++++++++++++++++
>>  tools/lib/bpf/libbpf.c | 112 ++++++++++++++++++++++++++++++++---------
>>  3 files changed, 171 insertions(+), 25 deletions(-)
>>
> 
> So I think the feature overall makes sense, but we need to think
> through at least libbpf's side of things some more. Unlike .data,
> per-cpu .data section is not mmapable, and so that has implication on
> BPF skeleton and we should make sure all that makes sense on BPF
> skeleton side. In that sense, per-cpu global data is more akin to
> struct_ops initialization image, which can be accessed by user before
> skeleton is loaded to initialize the image.
> 
> There are a few things to consider. What's the BPF skeleton interface?
> Do we expose it as single struct and use that struct as initial image
> for each CPU (which means user won't be able to initialize different
> CPU data differently, at least not through BPF skeleton facilities)?
> Or do we expose this as an array of structs and let user set each CPU
> data independently?
> 
> I feel like keeping it simple and having one image for all CPUs would
> cover most cases. And users can still access the underlying
> PERCPU_ARRAY map if they need more control.

Agree. It is necessary to keep it simple.

> 
> But either way, we need tests for skeleton, making sure we NULL-out
> this per-cpu global data, but take it into account before the load.
> 
> Also, this huge calloc for possible CPUs, I'd like to avoid it
> altogether for the (probably very common) zero-initialized case.

Ack.

> 
> So in short, needs a bit of iteration to figure out all the
> interfacing issues, but makes sense overall. See some more low-level
> remarks below.
> 

It is challenging to figure out them. I'll do my best to achieve it.

> pw-bot: cr
> 
> 
> [...]
> 
>> @@ -815,6 +850,8 @@ const struct bpf_map_ops percpu_array_map_ops = {
>>         .map_get_next_key = array_map_get_next_key,
>>         .map_lookup_elem = percpu_array_map_lookup_elem,
>>         .map_gen_lookup = percpu_array_map_gen_lookup,
>> +       .map_direct_value_addr = percpu_array_map_direct_value_addr,
>> +       .map_direct_value_meta = percpu_array_map_direct_value_meta,
>>         .map_update_elem = array_map_update_elem,
>>         .map_delete_elem = array_map_delete_elem,
>>         .map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index b8ca227c78af1..94ce02a48ddc1 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -6809,6 +6809,8 @@ static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val,
>>         u64 addr;
>>         int err;
>>
>> +       if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
>> +               return -EINVAL;
> 
> I'd invert the condition and reject anything by BPF_MAP_TYPE_ARRAY. I
> don't see how any other possible map type can support this magic that
> we do with ARRAY. So to be on the safe side, let's just reject
> everything that's non-ARRAY (here and elsewhere)?

Ack.

> 
>>         err = map->ops->map_direct_value_addr(map, &addr, off);
>>         if (err)
>>                 return err;
> 
> [...]
> 
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 6c262d0152f81..881174f4f90a4 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
> 
> definitely let's split out libbpf changes from kernel-side changes

Ack.

> 
>> @@ -516,6 +516,7 @@ struct bpf_struct_ops {
>>  };
>>
>>  #define DATA_SEC ".data"
>> +#define PERCPU_DATA_SEC ".data..percpu"
> 
> I don't like this prefix, even if that's what we have in the kernel.
> Something like just ".percpu" or ".percpu_data" or ".data_percpu" is
> better, IMO.

I tested ".percpu". It is OK to use it. But we have to update "bpftool
gen" too, which relies on these section names.

Is it better to keep ".data" prefix, like ".data.percpu", ".data_percpu"?
Can keeping ".data" prefix reduce some works on bpftool, go-ebpf and
akin bpf loaders?

> 
>>  #define BSS_SEC ".bss"
>>  #define RODATA_SEC ".rodata"
>>  #define KCONFIG_SEC ".kconfig"
>> @@ -562,6 +563,8 @@ struct bpf_map {
>>         __u32 btf_value_type_id;
>>         __u32 btf_vmlinux_value_type_id;
>>         enum libbpf_map_type libbpf_type;
>> +       int num_cpus;
>> +       void *data;
>>         void *mmaped;
>>         struct bpf_struct_ops *st_ops;
>>         struct bpf_map *inner_map;
>> @@ -1923,11 +1926,35 @@ static bool map_is_mmapable(struct bpf_object *obj, struct bpf_map *map)
>>         return false;
>>  }
>>
>> +static bool map_is_percpu_data(struct bpf_map *map)
>> +{
>> +       return str_has_pfx(map->real_name, PERCPU_DATA_SEC);
>> +}
> 
> we have enum libbpf_map_type which is used to distinguish BSS vs
> RODATA vs DATA vs others, let's stick to that
> 

Ack.

>> +
>> +static void map_copy_data(struct bpf_map *map, const void *data)
>> +{
>> +       bool is_percpu_data = map_is_percpu_data(map);
>> +       size_t data_sz = map->def.value_size;
>> +       size_t elem_sz = roundup(data_sz, 8);
>> +       int i;
>> +
>> +       if (!data)
>> +               return;
>> +
>> +       if (!is_percpu_data)
>> +               memcpy(map->mmaped, data, data_sz);
>> +       else
>> +               for (i = 0; i < map->num_cpus; i++)
>> +                       memcpy(map->data + i*elem_sz, data, data_sz);
>> +}
>> +
>>  static int
>>  bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>>                               const char *real_name, int sec_idx, void *data, size_t data_sz)
>>  {
>> +       bool is_percpu_data = str_has_pfx(real_name, PERCPU_DATA_SEC);
>>         struct bpf_map_def *def;
>> +       const char *data_desc;
>>         struct bpf_map *map;
>>         size_t mmap_sz;
>>         int err;
>> @@ -1948,7 +1975,8 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>>         }
>>
>>         def = &map->def;
>> -       def->type = BPF_MAP_TYPE_ARRAY;
>> +       def->type = is_percpu_data ? BPF_MAP_TYPE_PERCPU_ARRAY
>> +                                  : BPF_MAP_TYPE_ARRAY;
>>         def->key_size = sizeof(int);
>>         def->value_size = data_sz;
>>         def->max_entries = 1;
>> @@ -1958,29 +1986,57 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>>         /* failures are fine because of maps like .rodata.str1.1 */
>>         (void) map_fill_btf_type_info(obj, map);
>>
>> -       if (map_is_mmapable(obj, map))
>> -               def->map_flags |= BPF_F_MMAPABLE;
>> +       data_desc = is_percpu_data ? "percpu " : "";
> 
> nit: just inline the logic in that single place you use it
> 

Ack.

>> +       pr_debug("map '%s' (global %sdata): at sec_idx %d, offset %zu, flags %x.\n",
>> +                map->name, data_desc, map->sec_idx, map->sec_offset,
>> +                def->map_flags);
>>
>> -       pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
>> -                map->name, map->sec_idx, map->sec_offset, def->map_flags);
>> +       if (is_percpu_data) {
>> +               map->num_cpus = libbpf_num_possible_cpus();
>> +               if (map->num_cpus < 0) {
>> +                       err = errno;
> 
> map->num_cpus should have actual error value already, avoid using
> errno unnecessarily
> 

Ack.

>> +                       pr_warn("failed to get possible cpus\n");
>> +                       goto free_name;
>> +               }
>>
>> -       mmap_sz = bpf_map_mmap_sz(map);
>> -       map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
>> -                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>> -       if (map->mmaped == MAP_FAILED) {
>> -               err = -errno;
>> -               map->mmaped = NULL;
>> -               pr_warn("failed to alloc map '%s' content buffer: %s\n", map->name, errstr(err));
>> -               zfree(&map->real_name);
>> -               zfree(&map->name);
>> -               return err;
>> -       }
>> +               map->data = calloc(map->num_cpus, roundup(data_sz, 8));
>> +               if (!map->data) {
>> +                       err = -ENOMEM;
>> +                       pr_warn("failed to alloc percpu map '%s' content buffer: %s\n",
>> +                               map->name, errstr(err));
> 
> please stick to establish reporting style (i.e., for map operations we
> always use "map '%s': " prefix)
> 

Ack.

>> +                       goto free_name;
>> +               }
> 
> [...]
> 
>> @@ -10370,7 +10434,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
>>                 return map->st_ops->data;
>>         }
>>
>> -       if (!map->mmaped)
>> +       if ((!map->mmaped && !map->data))
> 
> nit: why unnecessary extra () ?
> 

My bad. I'll remove it.

> 
>>                 return NULL;
>>
>>         if (map->def.type == BPF_MAP_TYPE_ARENA)
>> @@ -10378,7 +10442,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
>>         else
>>                 *psize = map->def.value_size;
>>
>> -       return map->mmaped;
>> +       return map->def.type == BPF_MAP_TYPE_PERCPU_ARRAY ? map->data : map->mmaped;
>>  }
>>
>>  bool bpf_map__is_internal(const struct bpf_map *map)
>> --
>> 2.47.1
>>

Thanks,
Leon


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] bpf: Introduce global percpu data
  2025-01-16  7:22     ` Leon Hwang
@ 2025-01-16 23:37       ` Andrii Nakryiko
  2025-01-17  6:24         ` Leon Hwang
  0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2025-01-16 23:37 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, daniel, andrii, yonghong.song, song, eddyz87,
	kernel-patches-bot

On Wed, Jan 15, 2025 at 11:22 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> Hi,
>
> On 15/1/25 07:10, Andrii Nakryiko wrote:
> > On Mon, Jan 13, 2025 at 7:25 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> This patch introduces global per-CPU data, inspired by commit
> >> 6316f78306c1 ("Merge branch 'support-global-data'"). It enables the
> >> definition of global per-CPU variables in BPF, similar to the
> >> DEFINE_PER_CPU() macro in the kernel[0].
> >>
> >> For example, in BPF, it is able to define a global per-CPU variable like
> >> this:
> >>
> >> int percpu_data SEC(".data..percpu");
> >>
> >> With this patch, tools like retsnoop[1] and bpflbr[2] can simplify their
> >> BPF code for handling LBRs. The code can be updated from
> >>
> >> static struct perf_branch_entry lbrs[1][MAX_LBR_ENTRIES] SEC(".data.lbrs");
> >>
> >> to
> >>
> >> static struct perf_branch_entry lbrs[MAX_LBR_ENTRIES] SEC(".data..percpu.lbrs");
> >>
> >> This eliminates the need to retrieve the CPU ID using the
> >> bpf_get_smp_processor_id() helper.
> >>
> >> Additionally, by reusing global per-CPU variables, sharing information
> >> between tail callers and callees or freplace callers and callees becomes
> >> simpler compared to using percpu_array maps.
> >>
> >> Links:
> >> [0] https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/percpu-defs.h#L114
> >> [1] https://github.com/anakryiko/retsnoop
> >> [2] https://github.com/Asphaltt/bpflbr
> >>
> >> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> >> ---
> >>  kernel/bpf/arraymap.c  |  39 +++++++++++++-
> >>  kernel/bpf/verifier.c  |  45 +++++++++++++++++
> >>  tools/lib/bpf/libbpf.c | 112 ++++++++++++++++++++++++++++++++---------
> >>  3 files changed, 171 insertions(+), 25 deletions(-)
> >>
> >
> > So I think the feature overall makes sense, but we need to think
> > through at least libbpf's side of things some more. Unlike .data,
> > per-cpu .data section is not mmapable, and so that has implication on
> > BPF skeleton and we should make sure all that makes sense on BPF
> > skeleton side. In that sense, per-cpu global data is more akin to
> > struct_ops initialization image, which can be accessed by user before
> > skeleton is loaded to initialize the image.
> >
> > There are a few things to consider. What's the BPF skeleton interface?
> > Do we expose it as single struct and use that struct as initial image
> > for each CPU (which means user won't be able to initialize different
> > CPU data differently, at least not through BPF skeleton facilities)?
> > Or do we expose this as an array of structs and let user set each CPU
> > data independently?
> >
> > I feel like keeping it simple and having one image for all CPUs would
> > cover most cases. And users can still access the underlying
> > PERCPU_ARRAY map if they need more control.
>
> Agree. It is necessary to keep it simple.
>
> >
> > But either way, we need tests for skeleton, making sure we NULL-out
> > this per-cpu global data, but take it into account before the load.
> >
> > Also, this huge calloc for possible CPUs, I'd like to avoid it
> > altogether for the (probably very common) zero-initialized case.
>
> Ack.
>
> >
> > So in short, needs a bit of iteration to figure out all the
> > interfacing issues, but makes sense overall. See some more low-level
> > remarks below.
> >
>
> It is challenging to figure out them. I'll do my best to achieve it.
>
> > pw-bot: cr
> >
> >

[...]

> >
> >> @@ -516,6 +516,7 @@ struct bpf_struct_ops {
> >>  };
> >>
> >>  #define DATA_SEC ".data"
> >> +#define PERCPU_DATA_SEC ".data..percpu"
> >
> > I don't like this prefix, even if that's what we have in the kernel.
> > Something like just ".percpu" or ".percpu_data" or ".data_percpu" is
> > better, IMO.
>
> I tested ".percpu". It is OK to use it. But we have to update "bpftool
> gen" too, which relies on these section names.
>
> Is it better to keep ".data" prefix, like ".data.percpu", ".data_percpu"?
> Can keeping ".data" prefix reduce some works on bpftool, go-ebpf and
> akin bpf loaders?

It's literally two lines of code in gen.c, and that should actually be
a common array of known prefixes. Even if someone uses this new
.percpu section with old bpftool nothing will break, they just won't
have structure representing the initial per-CPU image. They will still
have the generic map pointer in the skeleton. So I think this is
acceptable.

I'd definitely go with a simple and less error-prone ".percpu" prefix.

[...]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] bpf: Introduce global percpu data
  2025-01-16 23:37       ` Andrii Nakryiko
@ 2025-01-17  6:24         ` Leon Hwang
  0 siblings, 0 replies; 9+ messages in thread
From: Leon Hwang @ 2025-01-17  6:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, daniel, andrii, yonghong.song, song, eddyz87,
	kernel-patches-bot



On 17/1/25 07:37, Andrii Nakryiko wrote:
> On Wed, Jan 15, 2025 at 11:22 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> Hi,
>>
>> On 15/1/25 07:10, Andrii Nakryiko wrote:
>>> On Mon, Jan 13, 2025 at 7:25 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>>

[...]

>>>
>>> So I think the feature overall makes sense, but we need to think
>>> through at least libbpf's side of things some more. Unlike .data,
>>> per-cpu .data section is not mmapable, and so that has implication on
>>> BPF skeleton and we should make sure all that makes sense on BPF
>>> skeleton side. In that sense, per-cpu global data is more akin to
>>> struct_ops initialization image, which can be accessed by user before
>>> skeleton is loaded to initialize the image.
>>>
>>> There are a few things to consider. What's the BPF skeleton interface?
>>> Do we expose it as single struct and use that struct as initial image
>>> for each CPU (which means user won't be able to initialize different
>>> CPU data differently, at least not through BPF skeleton facilities)?
>>> Or do we expose this as an array of structs and let user set each CPU
>>> data independently?
>>>
>>> I feel like keeping it simple and having one image for all CPUs would
>>> cover most cases. And users can still access the underlying
>>> PERCPU_ARRAY map if they need more control.
>>
>> Agree. It is necessary to keep it simple.
>>
>>>
>>> But either way, we need tests for skeleton, making sure we NULL-out
>>> this per-cpu global data, but take it into account before the load.
>>>
>>> Also, this huge calloc for possible CPUs, I'd like to avoid it
>>> altogether for the (probably very common) zero-initialized case.
>>
>> Ack.
>>
>>>
>>> So in short, needs a bit of iteration to figure out all the
>>> interfacing issues, but makes sense overall. See some more low-level
>>> remarks below.
>>>
>>
>> It is challenging to figure out them. I'll do my best to achieve it.
>>
>>> pw-bot: cr
>>>
>>>
> 
> [...]
> 
>>>
>>>> @@ -516,6 +516,7 @@ struct bpf_struct_ops {
>>>>  };
>>>>
>>>>  #define DATA_SEC ".data"
>>>> +#define PERCPU_DATA_SEC ".data..percpu"
>>>
>>> I don't like this prefix, even if that's what we have in the kernel.
>>> Something like just ".percpu" or ".percpu_data" or ".data_percpu" is
>>> better, IMO.
>>
>> I tested ".percpu". It is OK to use it. But we have to update "bpftool
>> gen" too, which relies on these section names.
>>
>> Is it better to keep ".data" prefix, like ".data.percpu", ".data_percpu"?
>> Can keeping ".data" prefix reduce some works on bpftool, go-ebpf and
>> akin bpf loaders?
> 
> It's literally two lines of code in gen.c, and that should actually be
> a common array of known prefixes. Even if someone uses this new
> .percpu section with old bpftool nothing will break, they just won't
> have structure representing the initial per-CPU image. They will still
> have the generic map pointer in the skeleton. So I think this is
> acceptable.
> 
> I'd definitely go with a simple and less error-prone ".percpu" prefix.
> 

Being simple and less error-prone is indeed important. Let's proceed
with the ".percpu" prefix as suggested.

Thanks,
Leon



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-17  6:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-13 15:24 [RFC PATCH bpf-next 0/2] bpf: Introduce global percpu data Leon Hwang
2025-01-13 15:24 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
2025-01-14 23:10   ` Andrii Nakryiko
2025-01-16  7:22     ` Leon Hwang
2025-01-16 23:37       ` Andrii Nakryiko
2025-01-17  6:24         ` Leon Hwang
2025-01-13 15:24 ` [RFC PATCH bpf-next 2/2] selftests/bpf: Add a case to test " Leon Hwang
2025-01-13 16:58 ` [RFC PATCH bpf-next 0/2] bpf: Introduce " Daniel Xu
2025-01-14  6:35   ` Leon Hwang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox