[PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma
@ 2023-12-15  0:11 Yonghong Song
  2023-12-15  0:11 ` [PATCH bpf-next v2 1/6] bpf: Refactor to have a memalloc cache destroying function Yonghong Song
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:11 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Currently when a bpf program intends to allocate memory for percpu kptr,
the verifier will call bpf_mem_alloc_init() to prefill all supported
unit sizes and this caused memory consumption very big for large number
of cpus. For example, for 128-cpu system, the total memory consumption
with initial prefill is ~175MB. Things will become worse for systems
with even more cpus.

Patch 1 is a preparatory patch.
Patch 2 avoids unnecessary extra percpu memory allocation.
Patch 3 addresses memory consumption issue by avoiding to prefill
with all unit sizes, i.e. only prefilling with user specified size.
Patch 4 further reduces memory consumption by limiting the
number of prefill entries for percpu memory allocation.
Patch 5 rejects percpu memory allocation with bpf_global_percpu_ma
when allocation size is greater than 512 bytes.
Patch 6 fixed one test due to Patch 5 and added one test to
show the verification failure log message.

Changelogs:
  v1 -> v2:
    . Avoid unnecessary extra percpu memory allocation.
    . Add a separate function to do bpf_global_percpu_ma initialization
    . promote.
    . Promote function static 'sizes' array to file static.
    . Add comments to explain to refill only one item for percpu alloc.

Yonghong Song (6):
  bpf: Refactor to have a memalloc cache destroying function
  bpf: Avoid unnecessary extra percpu memory allocation
  bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  bpf: Refill only one percpu element in memalloc
  bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
  selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma

 include/linux/bpf.h                           |   2 +-
 include/linux/bpf_mem_alloc.h                 |   7 ++
 kernel/bpf/core.c                             |   8 +-
 kernel/bpf/memalloc.c                         | 100 +++++++++++++++---
 kernel/bpf/verifier.c                         |  36 ++++---
 .../selftests/bpf/progs/percpu_alloc_fail.c   |  18 ++++
 .../testing/selftests/bpf/progs/test_bpf_ma.c |   9 --
 7 files changed, 138 insertions(+), 42 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH bpf-next v2 1/6] bpf: Refactor to have a memalloc cache destroying function
  2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
@ 2023-12-15  0:11 ` Yonghong Song
  2023-12-15  0:12 ` [PATCH bpf-next v2 2/6] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:11 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

The function, named as bpf_mem_alloc_destroy_cache(), will be used
in the subsequent patch.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 6a51cfe4c2d6..75068167e745 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -618,6 +618,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	free_all(llist_del_all(&c->waiting_for_gp), percpu);
 }
 
+static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c)
+{
+	WRITE_ONCE(c->draining, true);
+	irq_work_sync(&c->refill_work);
+	drain_mem_cache(c);
+}
+
 static void check_mem_cache(struct bpf_mem_cache *c)
 {
 	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
@@ -723,9 +730,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		rcu_in_progress = 0;
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(ma->cache, cpu);
-			WRITE_ONCE(c->draining, true);
-			irq_work_sync(&c->refill_work);
-			drain_mem_cache(c);
+			bpf_mem_alloc_destroy_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 		}
@@ -740,9 +745,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			cc = per_cpu_ptr(ma->caches, cpu);
 			for (i = 0; i < NUM_CACHES; i++) {
 				c = &cc->cache[i];
-				WRITE_ONCE(c->draining, true);
-				irq_work_sync(&c->refill_work);
-				drain_mem_cache(c);
+				bpf_mem_alloc_destroy_cache(c);
 				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 			}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next v2 2/6] bpf: Avoid unnecessary extra percpu memory allocation
  2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
  2023-12-15  0:11 ` [PATCH bpf-next v2 1/6] bpf: Refactor to have a memalloc cache destroying function Yonghong Song
@ 2023-12-15  0:12 ` Yonghong Song
  2023-12-15  3:40   ` Hou Tao
  2023-12-15  0:12 ` [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:12 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.

Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.

Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 75068167e745..472158f1fb08 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -874,7 +874,9 @@ void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
 	if (!size)
 		return ZERO_SIZE_PTR;
 
-	idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
+	if (!ma->percpu)
+		size += LLIST_NODE_SZ;
+	idx = bpf_mem_cache_idx(size);
 	if (idx < 0)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 2/6] bpf: Avoid unnecessary extra percpu memory allocation
  2023-12-15  0:12 ` [PATCH bpf-next v2 2/6] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
@ 2023-12-15  3:40   ` Hou Tao
  0 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-12-15  3:40 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau



On 12/15/2023 8:12 AM, Yonghong Song wrote:
> Currently, for percpu memory allocation, say if the user
> requests allocation size to be 32 bytes, the actually
> calculated size will be 40 bytes and it further rounds
> to 64 bytes, and eventually 64 bytes are allocated,
> wasting 32-byte memory.
>
> Change bpf_mem_alloc() to calculate the cache index
> based on the user-provided allocation size so unnecessary
> extra memory can be avoided.
>
> Suggested-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>

Acked-by: Hou Tao <houtao1@huawei.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
  2023-12-15  0:11 ` [PATCH bpf-next v2 1/6] bpf: Refactor to have a memalloc cache destroying function Yonghong Song
  2023-12-15  0:12 ` [PATCH bpf-next v2 2/6] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
@ 2023-12-15  0:12 ` Yonghong Song
  2023-12-15  2:45   ` Yonghong Song
  2023-12-15  3:19   ` Hou Tao
  2023-12-15  0:12 ` [PATCH bpf-next v2 4/6] bpf: Refill only one percpu element in memalloc Yonghong Song
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:12 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)

In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
  * 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.

Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.

Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.

To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h           |  2 +-
 include/linux/bpf_mem_alloc.h |  7 ++++
 kernel/bpf/core.c             |  8 +++--
 kernel/bpf/memalloc.c         | 68 ++++++++++++++++++++++++++++++++++-
 kernel/bpf/verifier.c         | 28 ++++++---------
 5 files changed, 91 insertions(+), 22 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c87c608a3689..f1f16449fbc4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -60,7 +60,7 @@ extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
 extern struct kobject *btf_kobj;
 extern struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
-extern bool bpf_global_ma_set;
+extern bool bpf_global_ma_set, bpf_global_percpu_ma_set;
 
 typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64);
 typedef int (*bpf_iter_init_seq_priv_t)(void *private_data,
diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index bb1223b21308..43e635c67150 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -21,8 +21,15 @@ struct bpf_mem_alloc {
  * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects.
  * Alloc and free are done with bpf_mem_{alloc,free}() and the size of
  * the returned object is given by the size argument of bpf_mem_alloc().
+ * If percpu equals true, error will be returned in order to avoid
+ * large memory consumption and the below bpf_mem_alloc_percpu_unit_init()
+ * should be used to do on-demand per-cpu allocation for each size.
  */
 int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
+/* Initialize a non-fix-size percpu memory allocator */
+int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma);
+/* The percpu allocation with a specific unit size. */
+int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size);
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c34513d645c4..4a9177770f93 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -64,8 +64,8 @@
 #define OFF	insn->off
 #define IMM	insn->imm
 
-struct bpf_mem_alloc bpf_global_ma;
-bool bpf_global_ma_set;
+struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
+bool bpf_global_ma_set, bpf_global_percpu_ma_set;
 
 /* No hurry in this branch
  *
@@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void)
 
 	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
 	bpf_global_ma_set = !ret;
-	return ret;
+	ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
+	bpf_global_percpu_ma_set = !ret;
+	return !bpf_global_ma_set || !bpf_global_percpu_ma_set;
 }
 late_initcall(bpf_global_ma_init);
 #endif
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 472158f1fb08..aea4cd07c7b6 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -121,6 +121,8 @@ struct bpf_mem_caches {
 	struct bpf_mem_cache cache[NUM_CACHES];
 };
 
+static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+
 static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 {
 	struct llist_node *entry, *next;
@@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
  */
 int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 {
-	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	int cpu, i, err, unit_size, percpu_size = 0;
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
 	struct obj_cgroup *objcg = NULL;
 
+	if (percpu && size == 0)
+		return -EINVAL;
+
 	/* room for llist_node and per-cpu pointer */
 	if (percpu)
 		percpu_size = LLIST_NODE_SZ + sizeof(void *);
@@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c)
 	drain_mem_cache(c);
 }
 
+int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches __percpu *pcc;
+
+	pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO);
+	if (!pcc)
+		return -ENOMEM;
+
+	ma->caches = pcc;
+	ma->percpu = true;
+	return 0;
+}
+
+int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
+{
+	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+	int cpu, i, err, unit_size, percpu_size = 0;
+	struct bpf_mem_caches *cc, __percpu *pcc;
+	struct obj_cgroup *objcg = NULL;
+	struct bpf_mem_cache *c;
+
+	/* room for llist_node and per-cpu pointer */
+	percpu_size = LLIST_NODE_SZ + sizeof(void *);
+
+	i = bpf_mem_cache_idx(size);
+	if (i < 0)
+		return -EINVAL;
+
+	err = 0;
+	pcc = ma->caches;
+	unit_size = sizes[i];
+
+#ifdef CONFIG_MEMCG_KMEM
+	objcg = get_obj_cgroup_from_current();
+#endif
+	for_each_possible_cpu(cpu) {
+		cc = per_cpu_ptr(pcc, cpu);
+		c = &cc->cache[i];
+		if (cpu == 0 && c->unit_size)
+			goto out;
+
+		c->unit_size = unit_size;
+		c->objcg = objcg;
+		c->percpu_size = percpu_size;
+		c->tgt = c;
+
+		init_refill_work(c);
+		prefill_mem_cache(c, cpu);
+
+		if (cpu == 0) {
+			err = check_obj_size(c, i);
+			if (err) {
+				bpf_mem_alloc_destroy_cache(c);
+				goto out;
+			}
+		}
+	}
+
+out:
+	return err;
+}
+
 static void check_mem_cache(struct bpf_mem_cache *c)
 {
 	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1863826a4ac3..ce62ee0cc8f6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -42,9 +42,6 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
 #undef BPF_LINK_TYPE
 };
 
-struct bpf_mem_alloc bpf_global_percpu_ma;
-static bool bpf_global_percpu_ma_set;
-
 /* bpf_check() is a static code analyzer that walks eBPF program
  * instruction by instruction and updates register/stack state.
  * All paths of conditional branches are analyzed until 'bpf_exit' insn.
@@ -12062,20 +12059,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
 					return -ENOMEM;
 
-				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
-					if (!bpf_global_percpu_ma_set) {
-						mutex_lock(&bpf_percpu_ma_lock);
-						if (!bpf_global_percpu_ma_set) {
-							err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
-							if (!err)
-								bpf_global_percpu_ma_set = true;
-						}
-						mutex_unlock(&bpf_percpu_ma_lock);
-						if (err)
-							return err;
-					}
-				}
-
 				if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
 					verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
 					return -EINVAL;
@@ -12096,6 +12079,17 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 					return -EINVAL;
 				}
 
+				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
+					if (!bpf_global_percpu_ma_set)
+						return -ENOMEM;
+
+					mutex_lock(&bpf_percpu_ma_lock);
+					err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
+					mutex_unlock(&bpf_percpu_ma_lock);
+					if (err)
+						return err;
+				}
+
 				struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id);
 				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
 					if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  0:12 ` [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
@ 2023-12-15  2:45   ` Yonghong Song
  2023-12-15  3:19   ` Hou Tao
  1 sibling, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  2:45 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/14/23 4:12 PM, Yonghong Song wrote:
> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
> added support for non-fix-size percpu memory allocation.
> Such allocation will allocate percpu memory for all buckets on all
> cpus and the memory consumption is in the order to quadratic.
> For example, let us say, 4 cpus, unit size 16 bytes, so each
> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
> Then let us say, 8 cpus with the same unit size, each cpu
> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
> So if the number of cpus doubles, the number of memory consumption
> will be 4 times. So for a system with large number of cpus, the
> memory consumption goes up quickly with quadratic order.
> For example, for 4KB percpu allocation, 128 cpus. The total memory
> consumption will 4KB * 128 * 128 = 64MB. Things will become
> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>
> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
> done in boot time, so for system with large number of cpus, the initial
> percpu memory consumption is very visible. For example, for 128 cpu
> system, the total percpu memory allocation will be at least
> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>    * 128 * 128 = ~138MB.
> which is pretty big. It will be even bigger for larger number of cpus.
>
> Note that the current prefill also allocates 4 entries if the unit size
> is less than 256. So on top of 138MB memory consumption, this will
> add more consumption with
> 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
> Next patch will try to reduce this memory consumption.
>
> Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory
> at init stage") moved the non-fix-size percpu memory allocation
> to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
> is called by bpf program, the memory allocator will try to fill in
> the cache with all sizes, causing the same amount of percpu memory
> consumption as in the boot stage.
>
> To reduce the initial percpu memory consumption for non-fix-size
> percpu memory allocation, instead of filling the cache with all
> supported allocation sizes, this patch intends to fill the cache
> only for the requested size. As typically users will not use large
> percpu data structure, this can save memory significantly.
> For example, the allocation size is 64 bytes with 128 cpus.
> Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
> much less than previous 138MB.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>   include/linux/bpf.h           |  2 +-
>   include/linux/bpf_mem_alloc.h |  7 ++++
>   kernel/bpf/core.c             |  8 +++--
>   kernel/bpf/memalloc.c         | 68 ++++++++++++++++++++++++++++++++++-
>   kernel/bpf/verifier.c         | 28 ++++++---------
>   5 files changed, 91 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c87c608a3689..f1f16449fbc4 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -60,7 +60,7 @@ extern struct idr btf_idr;
>   extern spinlock_t btf_idr_lock;
>   extern struct kobject *btf_kobj;
>   extern struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
> -extern bool bpf_global_ma_set;
> +extern bool bpf_global_ma_set, bpf_global_percpu_ma_set;
>   
>   typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64);
>   typedef int (*bpf_iter_init_seq_priv_t)(void *private_data,
> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
> index bb1223b21308..43e635c67150 100644
> --- a/include/linux/bpf_mem_alloc.h
> +++ b/include/linux/bpf_mem_alloc.h
> @@ -21,8 +21,15 @@ struct bpf_mem_alloc {
>    * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects.
>    * Alloc and free are done with bpf_mem_{alloc,free}() and the size of
>    * the returned object is given by the size argument of bpf_mem_alloc().
> + * If percpu equals true, error will be returned in order to avoid
> + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init()
> + * should be used to do on-demand per-cpu allocation for each size.
>    */
>   int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
> +/* Initialize a non-fix-size percpu memory allocator */
> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma);
> +/* The percpu allocation with a specific unit size. */
> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size);
>   void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
>   
>   /* kmalloc/kfree equivalent: */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index c34513d645c4..4a9177770f93 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -64,8 +64,8 @@
>   #define OFF	insn->off
>   #define IMM	insn->imm
>   
> -struct bpf_mem_alloc bpf_global_ma;
> -bool bpf_global_ma_set;
> +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
> +bool bpf_global_ma_set, bpf_global_percpu_ma_set;
>   
>   /* No hurry in this branch
>    *
> @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void)
>   
>   	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
>   	bpf_global_ma_set = !ret;
> -	return ret;
> +	ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
> +	bpf_global_percpu_ma_set = !ret;
> +	return !bpf_global_ma_set || !bpf_global_percpu_ma_set;
>   }
>   late_initcall(bpf_global_ma_init);
>   #endif
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 472158f1fb08..aea4cd07c7b6 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -121,6 +121,8 @@ struct bpf_mem_caches {
>   	struct bpf_mem_cache cache[NUM_CACHES];
>   };
>   
> +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
> +
>   static struct llist_node notrace *__llist_del_first(struct llist_head *head)
>   {
>   	struct llist_node *entry, *next;
> @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
>    */
>   int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
>   {
> -	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
>   	int cpu, i, err, unit_size, percpu_size = 0;
>   	struct bpf_mem_caches *cc, __percpu *pcc;
>   	struct bpf_mem_cache *c, __percpu *pc;
>   	struct obj_cgroup *objcg = NULL;
>   
> +	if (percpu && size == 0)
> +		return -EINVAL;
> +
>   	/* room for llist_node and per-cpu pointer */
>   	if (percpu)
>   		percpu_size = LLIST_NODE_SZ + sizeof(void *);
> @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c)
>   	drain_mem_cache(c);
>   }
>   
> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
> +{
> +	struct bpf_mem_caches __percpu *pcc;
> +
> +	pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO);
> +	if (!pcc)
> +		return -ENOMEM;
> +
> +	ma->caches = pcc;
> +	ma->percpu = true;
> +	return 0;
> +}
> +
> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
> +{
> +	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};

Sorry, a oversight here. The above should be removed. Will fix in the next revision.

> +	int cpu, i, err, unit_size, percpu_size = 0;
> +	struct bpf_mem_caches *cc, __percpu *pcc;
> +	struct obj_cgroup *objcg = NULL;
> +	struct bpf_mem_cache *c;
> +
> +	/* room for llist_node and per-cpu pointer */
> +	percpu_size = LLIST_NODE_SZ + sizeof(void *);
> +
> +	i = bpf_mem_cache_idx(size);
> +	if (i < 0)
> +		return -EINVAL;
> +
> +	err = 0;
> +	pcc = ma->caches;
> +	unit_size = sizes[i];
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +	objcg = get_obj_cgroup_from_current();
> +#endif
> +	for_each_possible_cpu(cpu) {
> +		cc = per_cpu_ptr(pcc, cpu);
> +		c = &cc->cache[i];
> +		if (cpu == 0 && c->unit_size)
> +			goto out;
> +
> +		c->unit_size = unit_size;
> +		c->objcg = objcg;
> +		c->percpu_size = percpu_size;
> +		c->tgt = c;
> +
> +		init_refill_work(c);
> +		prefill_mem_cache(c, cpu);
> +
> +		if (cpu == 0) {
> +			err = check_obj_size(c, i);
> +			if (err) {
> +				bpf_mem_alloc_destroy_cache(c);
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return err;
> +}
> +
>   static void check_mem_cache(struct bpf_mem_cache *c)
>   {
>   	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  0:12 ` [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
  2023-12-15  2:45   ` Yonghong Song
@ 2023-12-15  3:19   ` Hou Tao
  2023-12-15  6:50     ` Yonghong Song
  1 sibling, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-12-15  3:19 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau



On 12/15/2023 8:12 AM, Yonghong Song wrote:
> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
> added support for non-fix-size percpu memory allocation.
> Such allocation will allocate percpu memory for all buckets on all
> cpus and the memory consumption is in the order to quadratic.
> For example, let us say, 4 cpus, unit size 16 bytes, so each
> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
> Then let us say, 8 cpus with the same unit size, each cpu
> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
> So if the number of cpus doubles, the number of memory consumption
> will be 4 times. So for a system with large number of cpus, the
> memory consumption goes up quickly with quadratic order.
> For example, for 4KB percpu allocation, 128 cpus. The total memory
> consumption will 4KB * 128 * 128 = 64MB. Things will become
> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>
> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
> done in boot time, so for system with large number of cpus, the initial
> percpu memory consumption is very visible. For example, for 128 cpu
> system, the total percpu memory allocation will be at least
> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>   * 128 * 128 = ~138MB.
> which is pretty big. It will be even bigger for larger number of cpus.
>
SNIP
> index bb1223b21308..43e635c67150 100644
> --- a/include/linux/bpf_mem_alloc.h
> +++ b/include/linux/bpf_mem_alloc.h
> @@ -21,8 +21,15 @@ struct bpf_mem_alloc {
>   * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects.
>   * Alloc and free are done with bpf_mem_{alloc,free}() and the size of
>   * the returned object is given by the size argument of bpf_mem_alloc().
> + * If percpu equals true, error will be returned in order to avoid
> + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init()
> + * should be used to do on-demand per-cpu allocation for each size.
>   */
>  int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
> +/* Initialize a non-fix-size percpu memory allocator */
> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma);
> +/* The percpu allocation with a specific unit size. */
> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size);
>  void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
>  
>  /* kmalloc/kfree equivalent: */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index c34513d645c4..4a9177770f93 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -64,8 +64,8 @@
>  #define OFF	insn->off
>  #define IMM	insn->imm
>  
> -struct bpf_mem_alloc bpf_global_ma;
> -bool bpf_global_ma_set;
> +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
> +bool bpf_global_ma_set, bpf_global_percpu_ma_set;
>  
>  /* No hurry in this branch
>   *
> @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void)
>  
>  	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
>  	bpf_global_ma_set = !ret;
> -	return ret;
> +	ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
> +	bpf_global_percpu_ma_set = !ret;
> +	return !bpf_global_ma_set || !bpf_global_percpu_ma_set;
>  }
>  late_initcall(bpf_global_ma_init);
>  #endif
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 472158f1fb08..aea4cd07c7b6 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -121,6 +121,8 @@ struct bpf_mem_caches {
>  	struct bpf_mem_cache cache[NUM_CACHES];
>  };
>  
> +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};

Is it better to make it being const ?
> +
>  static struct llist_node notrace *__llist_del_first(struct llist_head *head)
>  {
>  	struct llist_node *entry, *next;
> @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
>   */
>  int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
>  {
> -	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
>  	int cpu, i, err, unit_size, percpu_size = 0;
>  	struct bpf_mem_caches *cc, __percpu *pcc;
>  	struct bpf_mem_cache *c, __percpu *pc;
>  	struct obj_cgroup *objcg = NULL;
>  
> +	if (percpu && size == 0)
> +		return -EINVAL;
> +
>  	/* room for llist_node and per-cpu pointer */
>  	if (percpu)
>  		percpu_size = LLIST_NODE_SZ + sizeof(void *);
> @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c)
>  	drain_mem_cache(c);
>  }
>  
> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
> +{
> +	struct bpf_mem_caches __percpu *pcc;
> +
> +	pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO);
> +	if (!pcc)
> +		return -ENOMEM;

__GFP_ZERO is not needed. __alloc_percpu_gfp() will zero the returned
area by default.
> +
> +	ma->caches = pcc;
> +	ma->percpu = true;
> +	return 0;
> +}
> +
> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
> +{
> +	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
> +	int cpu, i, err, unit_size, percpu_size = 0;
> +	struct bpf_mem_caches *cc, __percpu *pcc;
> +	struct obj_cgroup *objcg = NULL;
> +	struct bpf_mem_cache *c;
> +
> +	/* room for llist_node and per-cpu pointer */
> +	percpu_size = LLIST_NODE_SZ + sizeof(void *);
> +
> +	i = bpf_mem_cache_idx(size);
> +	if (i < 0)
> +		return -EINVAL;
> +
> +	err = 0;
> +	pcc = ma->caches;
> +	unit_size = sizes[i];
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +	objcg = get_obj_cgroup_from_current();
> +#endif
> +	for_each_possible_cpu(cpu) {
> +		cc = per_cpu_ptr(pcc, cpu);
> +		c = &cc->cache[i];
> +		if (cpu == 0 && c->unit_size)
> +			goto out;
> +
> +		c->unit_size = unit_size;
> +		c->objcg = objcg;
> +		c->percpu_size = percpu_size;
> +		c->tgt = c;
> +
> +		init_refill_work(c);
> +		prefill_mem_cache(c, cpu);
> +
> +		if (cpu == 0) {
> +			err = check_obj_size(c, i);
> +			if (err) {
> +				bpf_mem_alloc_destroy_cache(c);

It seems drain_mem_cache() will be enough. Have you considered setting
low_watermark as 0 to prevent potential refill in unit_alloc() if the
initialization of the current unit fails ?
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return err;
> +}
> +
>  static void check_mem_cache(struct bpf_mem_cache *c)
>  {
>  	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
>
.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  3:19   ` Hou Tao
@ 2023-12-15  6:50     ` Yonghong Song
  2023-12-15  7:27       ` Yonghong Song
  0 siblings, 1 reply; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  6:50 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/14/23 7:19 PM, Hou Tao wrote:
>
> On 12/15/2023 8:12 AM, Yonghong Song wrote:
>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
>> added support for non-fix-size percpu memory allocation.
>> Such allocation will allocate percpu memory for all buckets on all
>> cpus and the memory consumption is in the order to quadratic.
>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
>> Then let us say, 8 cpus with the same unit size, each cpu
>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
>> So if the number of cpus doubles, the number of memory consumption
>> will be 4 times. So for a system with large number of cpus, the
>> memory consumption goes up quickly with quadratic order.
>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>
>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>> done in boot time, so for system with large number of cpus, the initial
>> percpu memory consumption is very visible. For example, for 128 cpu
>> system, the total percpu memory allocation will be at least
>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>    * 128 * 128 = ~138MB.
>> which is pretty big. It will be even bigger for larger number of cpus.
>>
> SNIP
>> index bb1223b21308..43e635c67150 100644
>> --- a/include/linux/bpf_mem_alloc.h
>> +++ b/include/linux/bpf_mem_alloc.h
>> @@ -21,8 +21,15 @@ struct bpf_mem_alloc {
>>    * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects.
>>    * Alloc and free are done with bpf_mem_{alloc,free}() and the size of
>>    * the returned object is given by the size argument of bpf_mem_alloc().
>> + * If percpu equals true, error will be returned in order to avoid
>> + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init()
>> + * should be used to do on-demand per-cpu allocation for each size.
>>    */
>>   int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
>> +/* Initialize a non-fix-size percpu memory allocator */
>> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma);
>> +/* The percpu allocation with a specific unit size. */
>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size);
>>   void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
>>   
>>   /* kmalloc/kfree equivalent: */
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index c34513d645c4..4a9177770f93 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -64,8 +64,8 @@
>>   #define OFF	insn->off
>>   #define IMM	insn->imm
>>   
>> -struct bpf_mem_alloc bpf_global_ma;
>> -bool bpf_global_ma_set;
>> +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
>> +bool bpf_global_ma_set, bpf_global_percpu_ma_set;
>>   
>>   /* No hurry in this branch
>>    *
>> @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void)
>>   
>>   	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
>>   	bpf_global_ma_set = !ret;
>> -	return ret;
>> +	ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
>> +	bpf_global_percpu_ma_set = !ret;
>> +	return !bpf_global_ma_set || !bpf_global_percpu_ma_set;
>>   }
>>   late_initcall(bpf_global_ma_init);
>>   #endif
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 472158f1fb08..aea4cd07c7b6 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -121,6 +121,8 @@ struct bpf_mem_caches {
>>   	struct bpf_mem_cache cache[NUM_CACHES];
>>   };
>>   
>> +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
> Is it better to make it being const ?

Right. We can make it as const.

>> +
>>   static struct llist_node notrace *__llist_del_first(struct llist_head *head)
>>   {
>>   	struct llist_node *entry, *next;
>> @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
>>    */
>>   int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
>>   {
>> -	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
>>   	int cpu, i, err, unit_size, percpu_size = 0;
>>   	struct bpf_mem_caches *cc, __percpu *pcc;
>>   	struct bpf_mem_cache *c, __percpu *pc;
>>   	struct obj_cgroup *objcg = NULL;
>>   
>> +	if (percpu && size == 0)
>> +		return -EINVAL;
>> +
>>   	/* room for llist_node and per-cpu pointer */
>>   	if (percpu)
>>   		percpu_size = LLIST_NODE_SZ + sizeof(void *);
>> @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c)
>>   	drain_mem_cache(c);
>>   }
>>   
>> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
>> +{
>> +	struct bpf_mem_caches __percpu *pcc;
>> +
>> +	pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO);
>> +	if (!pcc)
>> +		return -ENOMEM;
> __GFP_ZERO is not needed. __alloc_percpu_gfp() will zero the returned
> area by default.

Thanks. Checked the comments in __alloc_percpu_gfp() and indeed, the returned
buffer has been zeroed.

>> +
>> +	ma->caches = pcc;
>> +	ma->percpu = true;
>> +	return 0;
>> +}
>> +
>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
>> +{
>> +	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
>> +	int cpu, i, err, unit_size, percpu_size = 0;
>> +	struct bpf_mem_caches *cc, __percpu *pcc;
>> +	struct obj_cgroup *objcg = NULL;
>> +	struct bpf_mem_cache *c;
>> +
>> +	/* room for llist_node and per-cpu pointer */
>> +	percpu_size = LLIST_NODE_SZ + sizeof(void *);
>> +
>> +	i = bpf_mem_cache_idx(size);
>> +	if (i < 0)
>> +		return -EINVAL;
>> +
>> +	err = 0;
>> +	pcc = ma->caches;
>> +	unit_size = sizes[i];
>> +
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	objcg = get_obj_cgroup_from_current();
>> +#endif
>> +	for_each_possible_cpu(cpu) {
>> +		cc = per_cpu_ptr(pcc, cpu);
>> +		c = &cc->cache[i];
>> +		if (cpu == 0 && c->unit_size)
>> +			goto out;
>> +
>> +		c->unit_size = unit_size;
>> +		c->objcg = objcg;
>> +		c->percpu_size = percpu_size;
>> +		c->tgt = c;
>> +
>> +		init_refill_work(c);
>> +		prefill_mem_cache(c, cpu);
>> +
>> +		if (cpu == 0) {
>> +			err = check_obj_size(c, i);
>> +			if (err) {
>> +				bpf_mem_alloc_destroy_cache(c);
> It seems drain_mem_cache() will be enough. Have you considered setting

At prefill stage, looks like the following is enough:
     free_all(__llist_del_all(&c->free_llist), percpu);
But I agree that drain_mem_cache() is simpler and is
easier for future potential code change.

> low_watermark as 0 to prevent potential refill in unit_alloc() if the
> initialization of the current unit fails ?

I think it does make sense. For non-fix-size non-percpu prefill,
if check_obj_size() failed, the prefill will fail, which include
all buckets.

In this case, if it fails for a particular bucket, we should
make sure that bucket always return NULL ptr, so setting the
low_watermark to 0 does make sense.

>> +				goto out;
>> +			}
>> +		}
>> +	}
>> +
>> +out:
>> +	return err;
>> +}
>> +
>>   static void check_mem_cache(struct bpf_mem_cache *c)
>>   {
>>   	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
>>
> .
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  6:50     ` Yonghong Song
@ 2023-12-15  7:27       ` Yonghong Song
  2023-12-15  7:40         ` Hou Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  7:27 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/14/23 10:50 PM, Yonghong Song wrote:
>
> On 12/14/23 7:19 PM, Hou Tao wrote:
>>
>> On 12/15/2023 8:12 AM, Yonghong Song wrote:
>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem 
>>> allocation")
>>> added support for non-fix-size percpu memory allocation.
>>> Such allocation will allocate percpu memory for all buckets on all
>>> cpus and the memory consumption is in the order to quadratic.
>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 
>>> bytes.
>>> Then let us say, 8 cpus with the same unit size, each cpu
>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 
>>> bytes.
>>> So if the number of cpus doubles, the number of memory consumption
>>> will be 4 times. So for a system with large number of cpus, the
>>> memory consumption goes up quickly with quadratic order.
>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>>
>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>>> done in boot time, so for system with large number of cpus, the initial
>>> percpu memory consumption is very visible. For example, for 128 cpu
>>> system, the total percpu memory allocation will be at least
>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>>    * 128 * 128 = ~138MB.
>>> which is pretty big. It will be even bigger for larger number of cpus.
>>>
>> SNIP
>>> index bb1223b21308..43e635c67150 100644
>>> --- a/include/linux/bpf_mem_alloc.h
>>> +++ b/include/linux/bpf_mem_alloc.h
>>> @@ -21,8 +21,15 @@ struct bpf_mem_alloc {
>>>    * 'size = 0' is for bpf_mem_alloc which manages many fixed-size 
>>> objects.
>>>    * Alloc and free are done with bpf_mem_{alloc,free}() and the 
>>> size of
>>>    * the returned object is given by the size argument of 
>>> bpf_mem_alloc().
>>> + * If percpu equals true, error will be returned in order to avoid
>>> + * large memory consumption and the below 
>>> bpf_mem_alloc_percpu_unit_init()
>>> + * should be used to do on-demand per-cpu allocation for each size.
>>>    */
>>>   int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool 
>>> percpu);
>>> +/* Initialize a non-fix-size percpu memory allocator */
>>> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma);
>>> +/* The percpu allocation with a specific unit size. */
>>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int 
>>> size);
>>>   void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
>>>     /* kmalloc/kfree equivalent: */
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index c34513d645c4..4a9177770f93 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -64,8 +64,8 @@
>>>   #define OFF    insn->off
>>>   #define IMM    insn->imm
>>>   -struct bpf_mem_alloc bpf_global_ma;
>>> -bool bpf_global_ma_set;
>>> +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma;
>>> +bool bpf_global_ma_set, bpf_global_percpu_ma_set;
>>>     /* No hurry in this branch
>>>    *
>>> @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void)
>>>         ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
>>>       bpf_global_ma_set = !ret;
>>> -    return ret;
>>> +    ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
>>> +    bpf_global_percpu_ma_set = !ret;
>>> +    return !bpf_global_ma_set || !bpf_global_percpu_ma_set;
>>>   }
>>>   late_initcall(bpf_global_ma_init);
>>>   #endif
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 472158f1fb08..aea4cd07c7b6 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -121,6 +121,8 @@ struct bpf_mem_caches {
>>>       struct bpf_mem_cache cache[NUM_CACHES];
>>>   };
>>>   +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 
>>> 512, 1024, 2048, 4096};
>> Is it better to make it being const ?
>
> Right. We can make it as const.
>
>>> +
>>>   static struct llist_node notrace *__llist_del_first(struct 
>>> llist_head *head)
>>>   {
>>>       struct llist_node *entry, *next;
>>> @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache 
>>> *c, unsigned int idx)
>>>    */
>>>   int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool 
>>> percpu)
>>>   {
>>> -    static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 
>>> 512, 1024, 2048, 4096};
>>>       int cpu, i, err, unit_size, percpu_size = 0;
>>>       struct bpf_mem_caches *cc, __percpu *pcc;
>>>       struct bpf_mem_cache *c, __percpu *pc;
>>>       struct obj_cgroup *objcg = NULL;
>>>   +    if (percpu && size == 0)
>>> +        return -EINVAL;
>>> +
>>>       /* room for llist_node and per-cpu pointer */
>>>       if (percpu)
>>>           percpu_size = LLIST_NODE_SZ + sizeof(void *);
>>> @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct 
>>> bpf_mem_cache *c)
>>>       drain_mem_cache(c);
>>>   }
>>>   +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
>>> +{
>>> +    struct bpf_mem_caches __percpu *pcc;
>>> +
>>> +    pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, 
>>> GFP_KERNEL | __GFP_ZERO);
>>> +    if (!pcc)
>>> +        return -ENOMEM;
>> __GFP_ZERO is not needed. __alloc_percpu_gfp() will zero the returned
>> area by default.
>
> Thanks. Checked the comments in __alloc_percpu_gfp() and indeed, the 
> returned
> buffer has been zeroed.
>
>>> +
>>> +    ma->caches = pcc;
>>> +    ma->percpu = true;
>>> +    return 0;
>>> +}
>>> +
>>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
>>> +{
>>> +    static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 
>>> 512, 1024, 2048, 4096};
>>> +    int cpu, i, err, unit_size, percpu_size = 0;
>>> +    struct bpf_mem_caches *cc, __percpu *pcc;
>>> +    struct obj_cgroup *objcg = NULL;
>>> +    struct bpf_mem_cache *c;
>>> +
>>> +    /* room for llist_node and per-cpu pointer */
>>> +    percpu_size = LLIST_NODE_SZ + sizeof(void *);
>>> +
>>> +    i = bpf_mem_cache_idx(size);
>>> +    if (i < 0)
>>> +        return -EINVAL;
>>> +
>>> +    err = 0;
>>> +    pcc = ma->caches;
>>> +    unit_size = sizes[i];
>>> +
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> +    objcg = get_obj_cgroup_from_current();
>>> +#endif
>>> +    for_each_possible_cpu(cpu) {
>>> +        cc = per_cpu_ptr(pcc, cpu);
>>> +        c = &cc->cache[i];
>>> +        if (cpu == 0 && c->unit_size)
>>> +            goto out;
>>> +
>>> +        c->unit_size = unit_size;
>>> +        c->objcg = objcg;
>>> +        c->percpu_size = percpu_size;
>>> +        c->tgt = c;
>>> +
>>> +        init_refill_work(c);
>>> +        prefill_mem_cache(c, cpu);
>>> +
>>> +        if (cpu == 0) {
>>> +            err = check_obj_size(c, i);
>>> +            if (err) {
>>> +                bpf_mem_alloc_destroy_cache(c);
>> It seems drain_mem_cache() will be enough. Have you considered setting
>
> At prefill stage, looks like the following is enough:
>     free_all(__llist_del_all(&c->free_llist), percpu);
> But I agree that drain_mem_cache() is simpler and is
> easier for future potential code change.
>
>> low_watermark as 0 to prevent potential refill in unit_alloc() if the
>> initialization of the current unit fails ?
>
> I think it does make sense. For non-fix-size non-percpu prefill,
> if check_obj_size() failed, the prefill will fail, which include
> all buckets.
>
> In this case, if it fails for a particular bucket, we should
> make sure that bucket always return NULL ptr, so setting the
> low_watermark to 0 does make sense.

Thinking again. If the initialization of the current unit
failed, the verification will fail and the corresponding
bpf program will not be able to do memory alloc, so we
should be fine.

But it is totally possible that some prog later may
call bpf_mem_alloc_percpu_unit_init() again with the
same size/bucket. So we should simply reset bpf_mem_cache
to 0 during the previous failed bpf_mem_alloc_percpu_unit_init()
call. Is it possible that check_obj_size() may initially
returns an error but sometime later something in
the kernel changed and the check_obj_size() with the
same size could return true?


>
>>> +                goto out;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +out:
>>> +    return err;
>>> +}
>>> +
>>>   static void check_mem_cache(struct bpf_mem_cache *c)
>>>   {
>>> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
>>>
>> .
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  7:27       ` Yonghong Song
@ 2023-12-15  7:40         ` Hou Tao
  2023-12-15 14:20           ` Yonghong Song
  0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-12-15  7:40 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Hi,

On 12/15/2023 3:27 PM, Yonghong Song wrote:
>
> On 12/14/23 10:50 PM, Yonghong Song wrote:
>>
>> On 12/14/23 7:19 PM, Hou Tao wrote:
>>>
>>> On 12/15/2023 8:12 AM, Yonghong Song wrote:
>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem
>>>> allocation")
>>>> added support for non-fix-size percpu memory allocation.
>>>> Such allocation will allocate percpu memory for all buckets on all
>>>> cpus and the memory consumption is in the order to quadratic.
>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256
>>>> bytes.
>>>> Then let us say, 8 cpus with the same unit size, each cpu
>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024
>>>> bytes.
>>>> So if the number of cpus doubles, the number of memory consumption
>>>> will be 4 times. So for a system with large number of cpus, the
>>>> memory consumption goes up quickly with quadratic order.
>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)

SNIP
>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>> +    objcg = get_obj_cgroup_from_current();
>>>> +#endif
>>>> +    for_each_possible_cpu(cpu) {
>>>> +        cc = per_cpu_ptr(pcc, cpu);
>>>> +        c = &cc->cache[i];
>>>> +        if (cpu == 0 && c->unit_size)
>>>> +            goto out;
>>>> +
>>>> +        c->unit_size = unit_size;
>>>> +        c->objcg = objcg;
>>>> +        c->percpu_size = percpu_size;
>>>> +        c->tgt = c;
>>>> +
>>>> +        init_refill_work(c);
>>>> +        prefill_mem_cache(c, cpu);
>>>> +
>>>> +        if (cpu == 0) {
>>>> +            err = check_obj_size(c, i);
>>>> +            if (err) {
>>>> +                bpf_mem_alloc_destroy_cache(c);
>>> It seems drain_mem_cache() will be enough. Have you considered setting
>>
>> At prefill stage, looks like the following is enough:
>>     free_all(__llist_del_all(&c->free_llist), percpu);
>> But I agree that drain_mem_cache() is simpler and is
>> easier for future potential code change.
>>
>>> low_watermark as 0 to prevent potential refill in unit_alloc() if the
>>> initialization of the current unit fails ?
>>
>> I think it does make sense. For non-fix-size non-percpu prefill,
>> if check_obj_size() failed, the prefill will fail, which include
>> all buckets.
>>
>> In this case, if it fails for a particular bucket, we should
>> make sure that bucket always return NULL ptr, so setting the
>> low_watermark to 0 does make sense.
>
> Thinking again. If the initialization of the current unit
> failed, the verification will fail and the corresponding
> bpf program will not be able to do memory alloc, so we
> should be fine.
>
> But it is totally possible that some prog later may
> call bpf_mem_alloc_percpu_unit_init() again with the
> same size/bucket. So we should simply reset bpf_mem_cache
> to 0 during the previous failed bpf_mem_alloc_percpu_unit_init()
> call. Is it possible that check_obj_size() may initially
> returns an error but sometime later something in
> the kernel changed and the check_obj_size() with the
> same size could return true?

Resetting bpf_mem_cache as 0 is much simpler and easier to understand
than resetting low_watermark as 0. For per-cpu allocation, the return
value of pcpu_alloc_size() is stable and I don't think it will change
like ksize() does(), so it is not possible that the previous
check_obj_size() failed, but the new check_obj_size() for the same
unit_size succeeds.

>
>
>>
>>>> +                goto out;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +out:
>>>> +    return err;
>>>> +}
>>>> +
>>>>   static void check_mem_cache(struct bpf_mem_cache *c)
>>>>   {
>>>> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
>>>>
>>> .
>>>
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-15  7:40         ` Hou Tao
@ 2023-12-15 14:20           ` Yonghong Song
  0 siblings, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15 14:20 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/14/23 11:40 PM, Hou Tao wrote:
> Hi,
>
> On 12/15/2023 3:27 PM, Yonghong Song wrote:
>> On 12/14/23 10:50 PM, Yonghong Song wrote:
>>> On 12/14/23 7:19 PM, Hou Tao wrote:
>>>> On 12/15/2023 8:12 AM, Yonghong Song wrote:
>>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem
>>>>> allocation")
>>>>> added support for non-fix-size percpu memory allocation.
>>>>> Such allocation will allocate percpu memory for all buckets on all
>>>>> cpus and the memory consumption is in the order to quadratic.
>>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256
>>>>> bytes.
>>>>> Then let us say, 8 cpus with the same unit size, each cpu
>>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024
>>>>> bytes.
>>>>> So if the number of cpus doubles, the number of memory consumption
>>>>> will be 4 times. So for a system with large number of cpus, the
>>>>> memory consumption goes up quickly with quadratic order.
>>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
> SNIP
>>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>>> +    objcg = get_obj_cgroup_from_current();
>>>>> +#endif
>>>>> +    for_each_possible_cpu(cpu) {
>>>>> +        cc = per_cpu_ptr(pcc, cpu);
>>>>> +        c = &cc->cache[i];
>>>>> +        if (cpu == 0 && c->unit_size)
>>>>> +            goto out;
>>>>> +
>>>>> +        c->unit_size = unit_size;
>>>>> +        c->objcg = objcg;
>>>>> +        c->percpu_size = percpu_size;
>>>>> +        c->tgt = c;
>>>>> +
>>>>> +        init_refill_work(c);
>>>>> +        prefill_mem_cache(c, cpu);
>>>>> +
>>>>> +        if (cpu == 0) {
>>>>> +            err = check_obj_size(c, i);
>>>>> +            if (err) {
>>>>> +                bpf_mem_alloc_destroy_cache(c);
>>>> It seems drain_mem_cache() will be enough. Have you considered setting
>>> At prefill stage, looks like the following is enough:
>>>      free_all(__llist_del_all(&c->free_llist), percpu);
>>> But I agree that drain_mem_cache() is simpler and is
>>> easier for future potential code change.
>>>
>>>> low_watermark as 0 to prevent potential refill in unit_alloc() if the
>>>> initialization of the current unit fails ?
>>> I think it does make sense. For non-fix-size non-percpu prefill,
>>> if check_obj_size() failed, the prefill will fail, which include
>>> all buckets.
>>>
>>> In this case, if it fails for a particular bucket, we should
>>> make sure that bucket always return NULL ptr, so setting the
>>> low_watermark to 0 does make sense.
>> Thinking again. If the initialization of the current unit
>> failed, the verification will fail and the corresponding
>> bpf program will not be able to do memory alloc, so we
>> should be fine.
>>
>> But it is totally possible that some prog later may
>> call bpf_mem_alloc_percpu_unit_init() again with the
>> same size/bucket. So we should simply reset bpf_mem_cache
>> to 0 during the previous failed bpf_mem_alloc_percpu_unit_init()
>> call. Is it possible that check_obj_size() may initially
>> returns an error but sometime later something in
>> the kernel changed and the check_obj_size() with the
>> same size could return true?
> Resetting bpf_mem_cache as 0 is much simpler and easier to understand
> than resetting low_watermark as 0. For per-cpu allocation, the return
> value of pcpu_alloc_size() is stable and I don't think it will change
> like ksize() does(), so it is not possible that the previous
> check_obj_size() failed, but the new check_obj_size() for the same
> unit_size succeeds.

Thanks for clarification. Let me just do resetting bpf_mem_cache to 0 then.

>
>>
>>>>> +                goto out;
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +out:
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>>    static void check_mem_cache(struct bpf_mem_cache *c)
>>>>>    {
>>>>> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
>>>>>
>>>> .
>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH bpf-next v2 4/6] bpf: Refill only one percpu element in memalloc
  2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (2 preceding siblings ...)
  2023-12-15  0:12 ` [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
@ 2023-12-15  0:12 ` Yonghong Song
  2023-12-15  0:12 ` [PATCH bpf-next v2 5/6] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
  2023-12-15  0:12 ` [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
  5 siblings, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:12 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index aea4cd07c7b6..7b5bd1294b77 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -485,11 +485,16 @@ static void init_refill_work(struct bpf_mem_cache *c)
 
 static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 {
-	/* To avoid consuming memory assume that 1st run of bpf
-	 * prog won't be doing more than 4 map_update_elem from
-	 * irq disabled region
+	int cnt = 1;
+
+	/* To avoid consuming memory, for non-percpu allocation, assume that
+	 * 1st run of bpf prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region if unit size is less than or equal to 256.
+	 * For all other cases, let us just do one allocation.
 	 */
-	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false);
+	if (!c->percpu_size && c->unit_size <= 256)
+		cnt = 4;
+	alloc_bulk(c, cnt, cpu_to_node(cpu), false);
 }
 
 static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next v2 5/6] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
  2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (3 preceding siblings ...)
  2023-12-15  0:12 ` [PATCH bpf-next v2 4/6] bpf: Refill only one percpu element in memalloc Yonghong Song
@ 2023-12-15  0:12 ` Yonghong Song
  2023-12-15  0:12 ` [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
  5 siblings, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:12 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.

We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/verifier.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ce62ee0cc8f6..039d699a425d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -192,6 +192,8 @@ struct bpf_verifier_stack_elem {
 					  POISON_POINTER_DELTA))
 #define BPF_MAP_PTR(X)		((struct bpf_map *)((X) & ~BPF_MAP_PTR_UNPRIV))
 
+#define BPF_GLOBAL_PERCPU_MA_MAX_SIZE  512
+
 static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
 static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
 static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
@@ -12083,6 +12085,12 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 					if (!bpf_global_percpu_ma_set)
 						return -ENOMEM;
 
+					if (ret_t->size > BPF_GLOBAL_PERCPU_MA_MAX_SIZE) {
+						verbose(env, "bpf_percpu_obj_new type size (%d) is greater than %d\n",
+							ret_t->size, BPF_GLOBAL_PERCPU_MA_MAX_SIZE);
+						return -EINVAL;
+					}
+
 					mutex_lock(&bpf_percpu_ma_lock);
 					err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
 					mutex_unlock(&bpf_percpu_ma_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (4 preceding siblings ...)
  2023-12-15  0:12 ` [PATCH bpf-next v2 5/6] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
@ 2023-12-15  0:12 ` Yonghong Song
  2023-12-15  3:33   ` Hou Tao
  5 siblings, 1 reply; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  0:12 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

In the previous patch, the maximum data size for bpf_global_percpu_ma
is 512 bytes. This breaks selftest test_bpf_ma. Let us adjust it
accordingly. Also added a selftest to capture the verification failure
when the allocation size is greater than 512.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/progs/percpu_alloc_fail.c    | 18 ++++++++++++++++++
 .../testing/selftests/bpf/progs/test_bpf_ma.c  |  9 ---------
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
index 1a891d30f1fe..f2b8eb2ff76f 100644
--- a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
+++ b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
@@ -17,6 +17,10 @@ struct val_with_rb_root_t {
 	struct bpf_spin_lock lock;
 };
 
+struct val_600b_t {
+	char b[600];
+};
+
 struct elem {
 	long sum;
 	struct val_t __percpu_kptr *pc;
@@ -161,4 +165,18 @@ int BPF_PROG(test_array_map_7)
 	return 0;
 }
 
+SEC("?fentry.s/bpf_fentry_test1")
+__failure __msg("bpf_percpu_obj_new type size (600) is greater than 512")
+int BPF_PROG(test_array_map_8)
+{
+	struct val_600b_t __percpu_kptr *p;
+
+	p = bpf_percpu_obj_new(struct val_600b_t);
+	if (!p)
+		return 0;
+
+	bpf_percpu_obj_drop(p);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_ma.c b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
index b685a4aba6bd..68cba55eb828 100644
--- a/tools/testing/selftests/bpf/progs/test_bpf_ma.c
+++ b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
@@ -188,9 +188,6 @@ DEFINE_ARRAY_WITH_PERCPU_KPTR(128);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(192);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(256);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(512);
-DEFINE_ARRAY_WITH_PERCPU_KPTR(1024);
-DEFINE_ARRAY_WITH_PERCPU_KPTR(2048);
-DEFINE_ARRAY_WITH_PERCPU_KPTR(4096);
 
 SEC("?fentry/" SYS_PREFIX "sys_nanosleep")
 int test_batch_alloc_free(void *ctx)
@@ -259,9 +256,6 @@ int test_batch_percpu_alloc_free(void *ctx)
 	CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 6);
 	CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 7);
 	CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 8);
-	CALL_BATCH_PERCPU_ALLOC_FREE(1024, 32, 9);
-	CALL_BATCH_PERCPU_ALLOC_FREE(2048, 16, 10);
-	CALL_BATCH_PERCPU_ALLOC_FREE(4096, 8, 11);
 
 	return 0;
 }
@@ -283,9 +277,6 @@ int test_percpu_free_through_map_free(void *ctx)
 	CALL_BATCH_PERCPU_ALLOC(192, 128, 6);
 	CALL_BATCH_PERCPU_ALLOC(256, 128, 7);
 	CALL_BATCH_PERCPU_ALLOC(512, 64, 8);
-	CALL_BATCH_PERCPU_ALLOC(1024, 32, 9);
-	CALL_BATCH_PERCPU_ALLOC(2048, 16, 10);
-	CALL_BATCH_PERCPU_ALLOC(4096, 8, 11);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  2023-12-15  0:12 ` [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
@ 2023-12-15  3:33   ` Hou Tao
  2023-12-15  7:38     ` Yonghong Song
  0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-12-15  3:33 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Hi,

On 12/15/2023 8:12 AM, Yonghong Song wrote:
> In the previous patch, the maximum data size for bpf_global_percpu_ma
> is 512 bytes. This breaks selftest test_bpf_ma. Let us adjust it
> accordingly. Also added a selftest to capture the verification failure
> when the allocation size is greater than 512.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  .../selftests/bpf/progs/percpu_alloc_fail.c    | 18 ++++++++++++++++++
>  .../testing/selftests/bpf/progs/test_bpf_ma.c  |  9 ---------
>  2 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
> index 1a891d30f1fe..f2b8eb2ff76f 100644
> --- a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
> +++ b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
> @@ -17,6 +17,10 @@ struct val_with_rb_root_t {
>  	struct bpf_spin_lock lock;
>  };
>  
> +struct val_600b_t {
> +	char b[600];
> +};
> +
>  struct elem {
>  	long sum;
>  	struct val_t __percpu_kptr *pc;
> @@ -161,4 +165,18 @@ int BPF_PROG(test_array_map_7)
>  	return 0;
>  }
>  
> +SEC("?fentry.s/bpf_fentry_test1")
> +__failure __msg("bpf_percpu_obj_new type size (600) is greater than 512")
> +int BPF_PROG(test_array_map_8)
> +{
> +	struct val_600b_t __percpu_kptr *p;
> +
> +	p = bpf_percpu_obj_new(struct val_600b_t);
> +	if (!p)
> +		return 0;
> +
> +	bpf_percpu_obj_drop(p);
> +	return 0;
> +}
> +
>  char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/test_bpf_ma.c b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
> index b685a4aba6bd..68cba55eb828 100644
> --- a/tools/testing/selftests/bpf/progs/test_bpf_ma.c
> +++ b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
> @@ -188,9 +188,6 @@ DEFINE_ARRAY_WITH_PERCPU_KPTR(128);
>  DEFINE_ARRAY_WITH_PERCPU_KPTR(192);
>  DEFINE_ARRAY_WITH_PERCPU_KPTR(256);
>  DEFINE_ARRAY_WITH_PERCPU_KPTR(512);
> -DEFINE_ARRAY_WITH_PERCPU_KPTR(1024);
> -DEFINE_ARRAY_WITH_PERCPU_KPTR(2048);
> -DEFINE_ARRAY_WITH_PERCPU_KPTR(4096);

Considering the update in patch "bpf: Avoid unnecessary extra percpu
memory allocation", the definition of DEFINE_ARRAY_WITH_PERCPU_KPTR()
needs update as well, because for 512-sized per-cpu kptr, the tests only
allocate for (512 - sizeof(void *)) bytes. And we could do
DEFINE_ARRAY_WITH_PERCPU_KPTR(8) test after the update. I could do that
after the patch-set is landed if you don't have time to do that.

A bit of off-topic, but it is still relevant. I have a question about
how to forcibly generate BTF info for struct definition in the test ?
Currently, I have to include  bin_data_xx in the definition of
map_value, but I don't want to increase the size of map_value. I had
tried to use BTF_TYPE_EMIT() in prog just like in linux kernel, but it
didn't work.
>  
>  SEC("?fentry/" SYS_PREFIX "sys_nanosleep")
>  int test_batch_alloc_free(void *ctx)
> @@ -259,9 +256,6 @@ int test_batch_percpu_alloc_free(void *ctx)
>  	CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 6);
>  	CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 7);
>  	CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 8);
> -	CALL_BATCH_PERCPU_ALLOC_FREE(1024, 32, 9);
> -	CALL_BATCH_PERCPU_ALLOC_FREE(2048, 16, 10);
> -	CALL_BATCH_PERCPU_ALLOC_FREE(4096, 8, 11);
>  
>  	return 0;
>  }
> @@ -283,9 +277,6 @@ int test_percpu_free_through_map_free(void *ctx)
>  	CALL_BATCH_PERCPU_ALLOC(192, 128, 6);
>  	CALL_BATCH_PERCPU_ALLOC(256, 128, 7);
>  	CALL_BATCH_PERCPU_ALLOC(512, 64, 8);
> -	CALL_BATCH_PERCPU_ALLOC(1024, 32, 9);
> -	CALL_BATCH_PERCPU_ALLOC(2048, 16, 10);
> -	CALL_BATCH_PERCPU_ALLOC(4096, 8, 11);
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  2023-12-15  3:33   ` Hou Tao
@ 2023-12-15  7:38     ` Yonghong Song
  2023-12-15  7:51       ` Hou Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Yonghong Song @ 2023-12-15  7:38 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/14/23 7:33 PM, Hou Tao wrote:
> Hi,
>
> On 12/15/2023 8:12 AM, Yonghong Song wrote:
>> In the previous patch, the maximum data size for bpf_global_percpu_ma
>> is 512 bytes. This breaks selftest test_bpf_ma. Let us adjust it
>> accordingly. Also added a selftest to capture the verification failure
>> when the allocation size is greater than 512.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   .../selftests/bpf/progs/percpu_alloc_fail.c    | 18 ++++++++++++++++++
>>   .../testing/selftests/bpf/progs/test_bpf_ma.c  |  9 ---------
>>   2 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
>> index 1a891d30f1fe..f2b8eb2ff76f 100644
>> --- a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
>> +++ b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
>> @@ -17,6 +17,10 @@ struct val_with_rb_root_t {
>>   	struct bpf_spin_lock lock;
>>   };
>>   
>> +struct val_600b_t {
>> +	char b[600];
>> +};
>> +
>>   struct elem {
>>   	long sum;
>>   	struct val_t __percpu_kptr *pc;
>> @@ -161,4 +165,18 @@ int BPF_PROG(test_array_map_7)
>>   	return 0;
>>   }
>>   
>> +SEC("?fentry.s/bpf_fentry_test1")
>> +__failure __msg("bpf_percpu_obj_new type size (600) is greater than 512")
>> +int BPF_PROG(test_array_map_8)
>> +{
>> +	struct val_600b_t __percpu_kptr *p;
>> +
>> +	p = bpf_percpu_obj_new(struct val_600b_t);
>> +	if (!p)
>> +		return 0;
>> +
>> +	bpf_percpu_obj_drop(p);
>> +	return 0;
>> +}
>> +
>>   char _license[] SEC("license") = "GPL";
>> diff --git a/tools/testing/selftests/bpf/progs/test_bpf_ma.c b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
>> index b685a4aba6bd..68cba55eb828 100644
>> --- a/tools/testing/selftests/bpf/progs/test_bpf_ma.c
>> +++ b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
>> @@ -188,9 +188,6 @@ DEFINE_ARRAY_WITH_PERCPU_KPTR(128);
>>   DEFINE_ARRAY_WITH_PERCPU_KPTR(192);
>>   DEFINE_ARRAY_WITH_PERCPU_KPTR(256);
>>   DEFINE_ARRAY_WITH_PERCPU_KPTR(512);
>> -DEFINE_ARRAY_WITH_PERCPU_KPTR(1024);
>> -DEFINE_ARRAY_WITH_PERCPU_KPTR(2048);
>> -DEFINE_ARRAY_WITH_PERCPU_KPTR(4096);
> Considering the update in patch "bpf: Avoid unnecessary extra percpu
> memory allocation", the definition of DEFINE_ARRAY_WITH_PERCPU_KPTR()
> needs update as well, because for 512-sized per-cpu kptr, the tests only
> allocate for (512 - sizeof(void *)) bytes. And we could do
> DEFINE_ARRAY_WITH_PERCPU_KPTR(8) test after the update. I could do that
> after the patch-set is landed if you don't have time to do that.
>
> A bit of off-topic, but it is still relevant. I have a question about
> how to forcibly generate BTF info for struct definition in the test ?
> Currently, I have to include  bin_data_xx in the definition of
> map_value, but I don't want to increase the size of map_value. I had
> tried to use BTF_TYPE_EMIT() in prog just like in linux kernel, but it
> didn't work.

Since you mentioned the btf generation issue, I did some investigation.
To workaround btf generation issue, we can use the method in
prog_tests/local_kptr_stash.c:

====
/* This is necessary so that LLVM generates BTF for node_data struct
  * If it's not included, a fwd reference for node_data will be generated but
  * no struct. Example BTF of "node" field in map_value when not included:
  *
  * [10] PTR '(anon)' type_id=35
  * [34] FWD 'node_data' fwd_kind=struct
  * [35] TYPE_TAG 'kptr_ref' type_id=34
  *
  * (with no node_data struct defined)
  * Had to do the same w/ bpf_kfunc_call_test_release below
  */
struct node_data *just_here_because_btf_bug;
struct refcounted_node *just_here_because_btf_bug2;
====

I have hacked the test_bpf_ma.c files and something like below
should work to generate btf types:

         struct bin_data_##_size { \
                 char data[_size - sizeof(void *)]; \
         }; \
+       /* See Commit 5d8d6634ccc, force btf generation for type bin_data_##_size */    \
+       struct bin_data_##_size *__bin_data_##_size; \
         struct map_value_##_size { \
                 struct bin_data_##_size __kptr * data; \
-               /* To emit BTF info for bin_data_xx */ \
-               struct bin_data_##_size not_used; \
         }; \
         struct { \
                 __uint(type, BPF_MAP_TYPE_ARRAY); \
@@ -40,8 +43,12 @@ int pid = 0;
         } array_##_size SEC(".maps")
  
  #define DEFINE_ARRAY_WITH_PERCPU_KPTR(_size) \
+       struct percpu_bin_data_##_size { \
+               char data[_size]; \
+       }; \
+       struct percpu_bin_data_##_size *__percpu_bin_data_##_size; \
         struct map_value_percpu_##_size { \
-               struct bin_data_##_size __percpu_kptr * data; \
+               struct percpu_bin_data_##_size __percpu_kptr * data; \
         }; \
         struct { \
                 __uint(type, BPF_MAP_TYPE_ARRAY); \

I have a prototype to ensure the type (for percpu kptr) removing these
'- sizeof(void *)' and enabling DEFINE_ARRAY_WITH_PERCPU_KPTR().
Once we resolved the check_obj_size() issue, I can then post v3.

>>   
>>   SEC("?fentry/" SYS_PREFIX "sys_nanosleep")
>>   int test_batch_alloc_free(void *ctx)
>> @@ -259,9 +256,6 @@ int test_batch_percpu_alloc_free(void *ctx)
>>   	CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 6);
>>   	CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 7);
>>   	CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 8);
>> -	CALL_BATCH_PERCPU_ALLOC_FREE(1024, 32, 9);
>> -	CALL_BATCH_PERCPU_ALLOC_FREE(2048, 16, 10);
>> -	CALL_BATCH_PERCPU_ALLOC_FREE(4096, 8, 11);
>>   
>>   	return 0;
>>   }
>> @@ -283,9 +277,6 @@ int test_percpu_free_through_map_free(void *ctx)
>>   	CALL_BATCH_PERCPU_ALLOC(192, 128, 6);
>>   	CALL_BATCH_PERCPU_ALLOC(256, 128, 7);
>>   	CALL_BATCH_PERCPU_ALLOC(512, 64, 8);
>> -	CALL_BATCH_PERCPU_ALLOC(1024, 32, 9);
>> -	CALL_BATCH_PERCPU_ALLOC(2048, 16, 10);
>> -	CALL_BATCH_PERCPU_ALLOC(4096, 8, 11);
>>   
>>   	return 0;
>>   }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  2023-12-15  7:38     ` Yonghong Song
@ 2023-12-15  7:51       ` Hou Tao
  0 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-12-15  7:51 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Hi,

On 12/15/2023 3:38 PM, Yonghong Song wrote:
>
> On 12/14/23 7:33 PM, Hou Tao wrote:
>> Hi,
>>
>> On 12/15/2023 8:12 AM, Yonghong Song wrote:
>>> In the previous patch, the maximum data size for bpf_global_percpu_ma
>>> is 512 bytes. This breaks selftest test_bpf_ma. Let us adjust it
>>> accordingly. Also added a selftest to capture the verification failure
>>> when the allocation size is greater than 512.
>>>
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>

SNIP
>>>   DEFINE_ARRAY_WITH_PERCPU_KPTR(192);
>>>   DEFINE_ARRAY_WITH_PERCPU_KPTR(256);
>>>   DEFINE_ARRAY_WITH_PERCPU_KPTR(512);
>>> -DEFINE_ARRAY_WITH_PERCPU_KPTR(1024);
>>> -DEFINE_ARRAY_WITH_PERCPU_KPTR(2048);
>>> -DEFINE_ARRAY_WITH_PERCPU_KPTR(4096);
>> Considering the update in patch "bpf: Avoid unnecessary extra percpu
>> memory allocation", the definition of DEFINE_ARRAY_WITH_PERCPU_KPTR()
>> needs update as well, because for 512-sized per-cpu kptr, the tests only
>> allocate for (512 - sizeof(void *)) bytes. And we could do
>> DEFINE_ARRAY_WITH_PERCPU_KPTR(8) test after the update. I could do that
>> after the patch-set is landed if you don't have time to do that.
>>
>> A bit of off-topic, but it is still relevant. I have a question about
>> how to forcibly generate BTF info for struct definition in the test ?
>> Currently, I have to include  bin_data_xx in the definition of
>> map_value, but I don't want to increase the size of map_value. I had
>> tried to use BTF_TYPE_EMIT() in prog just like in linux kernel, but it
>> didn't work.
>
> Since you mentioned the btf generation issue, I did some investigation.
> To workaround btf generation issue, we can use the method in
> prog_tests/local_kptr_stash.c:
>
> ====
> /* This is necessary so that LLVM generates BTF for node_data struct
>  * If it's not included, a fwd reference for node_data will be
> generated but
>  * no struct. Example BTF of "node" field in map_value when not included:
>  *
>  * [10] PTR '(anon)' type_id=35
>  * [34] FWD 'node_data' fwd_kind=struct
>  * [35] TYPE_TAG 'kptr_ref' type_id=34
>  *
>  * (with no node_data struct defined)
>  * Had to do the same w/ bpf_kfunc_call_test_release below
>  */
> struct node_data *just_here_because_btf_bug;
> struct refcounted_node *just_here_because_btf_bug2;
> ====

Totally missed it. Thanks for pointing it out to me.
>
> I have hacked the test_bpf_ma.c files and something like below
> should work to generate btf types:
>
>         struct bin_data_##_size { \
>                 char data[_size - sizeof(void *)]; \
>         }; \
> +       /* See Commit 5d8d6634ccc, force btf generation for type
> bin_data_##_size */    \
> +       struct bin_data_##_size *__bin_data_##_size; \
>         struct map_value_##_size { \
>                 struct bin_data_##_size __kptr * data; \
> -               /* To emit BTF info for bin_data_xx */ \
> -               struct bin_data_##_size not_used; \
>         }; \
>         struct { \
>                 __uint(type, BPF_MAP_TYPE_ARRAY); \
> @@ -40,8 +43,12 @@ int pid = 0;
>         } array_##_size SEC(".maps")
>  
>  #define DEFINE_ARRAY_WITH_PERCPU_KPTR(_size) \
> +       struct percpu_bin_data_##_size { \
> +               char data[_size]; \
> +       }; \
> +       struct percpu_bin_data_##_size *__percpu_bin_data_##_size; \
>         struct map_value_percpu_##_size { \
> -               struct bin_data_##_size __percpu_kptr * data; \
> +               struct percpu_bin_data_##_size __percpu_kptr * data; \
>         }; \
>         struct { \
>                 __uint(type, BPF_MAP_TYPE_ARRAY); \
>
> I have a prototype to ensure the type (for percpu kptr) removing these
> '- sizeof(void *)' and enabling DEFINE_ARRAY_WITH_PERCPU_KPTR().
> Once we resolved the check_obj_size() issue, I can then post v3.

Thanks for the update and it looks fine to me. Looking forwards to v3.
>
>>>     SEC("?fentry/" SYS_PREFIX "sys_nanosleep")
>>>   int test_batch_alloc_free(void *ctx)
>>> @@ -259,9 +256,6 @@ int test_batch_percpu_alloc_free(void *ctx)
>>>       CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 6);
>>>       CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 7);
>>>       CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 8);
>>> -    CALL_BATCH_PERCPU_ALLOC_FREE(1024, 32, 9);
>>> -    CALL_BATCH_PERCPU_ALLOC_FREE(2048, 16, 10);
>>> -    CALL_BATCH_PERCPU_ALLOC_FREE(4096, 8, 11);
>>>         return 0;
>>>   }
>>> @@ -283,9 +277,6 @@ int test_percpu_free_through_map_free(void *ctx)
>>>       CALL_BATCH_PERCPU_ALLOC(192, 128, 6);
>>>       CALL_BATCH_PERCPU_ALLOC(256, 128, 7);
>>>       CALL_BATCH_PERCPU_ALLOC(512, 64, 8);
>>> -    CALL_BATCH_PERCPU_ALLOC(1024, 32, 9);
>>> -    CALL_BATCH_PERCPU_ALLOC(2048, 16, 10);
>>> -    CALL_BATCH_PERCPU_ALLOC(4096, 8, 11);
>>>         return 0;
>>>   }


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-12-15 14:20 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-15  0:11 [PATCH bpf-next v2 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
2023-12-15  0:11 ` [PATCH bpf-next v2 1/6] bpf: Refactor to have a memalloc cache destroying function Yonghong Song
2023-12-15  0:12 ` [PATCH bpf-next v2 2/6] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
2023-12-15  3:40   ` Hou Tao
2023-12-15  0:12 ` [PATCH bpf-next v2 3/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
2023-12-15  2:45   ` Yonghong Song
2023-12-15  3:19   ` Hou Tao
2023-12-15  6:50     ` Yonghong Song
2023-12-15  7:27       ` Yonghong Song
2023-12-15  7:40         ` Hou Tao
2023-12-15 14:20           ` Yonghong Song
2023-12-15  0:12 ` [PATCH bpf-next v2 4/6] bpf: Refill only one percpu element in memalloc Yonghong Song
2023-12-15  0:12 ` [PATCH bpf-next v2 5/6] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
2023-12-15  0:12 ` [PATCH bpf-next v2 6/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
2023-12-15  3:33   ` Hou Tao
2023-12-15  7:38     ` Yonghong Song
2023-12-15  7:51       ` Hou Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox