[PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma
@ 2023-12-21  4:59 Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 1/8] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  4:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Currently when a bpf program intends to allocate memory for percpu kptr,
the verifier will call bpf_mem_alloc_init() to prefill all supported
unit sizes and this caused memory consumption very big for large number
of cpus. For example, for 128-cpu system, the total memory consumption
with initial prefill is ~175MB. Things will become worse for systems
with even more cpus.

Patch 1 avoids unnecessary extra percpu memory allocation.
Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be
associated with root cgroup and objcg can be passed to later
bpf_mem_alloc_percpu_unit_init().
Patch 3 addresses memory consumption issue by avoiding to prefill
with all unit sizes, i.e. only prefilling with user specified size.
Patch 4 further reduces memory consumption by limiting the
number of prefill entries for percpu memory allocation.
Patch 5 has much smaller low/high watermarks for percpu allocation
to reduce memory consumption.
Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma
when allocation size is greater than 512 bytes.
Patch 7 fixed test_bpf_ma test due to Patch 5.
Patch 8 added one test to show the verification failure log message.

Changelogs:
  v4 -> v5:
    . Do not do bpf_global_percpu_ma initialization at init stage, instead
      doing initialization when the verifier knows it is going to be used
      by bpf prog.
    . Using much smaller low/high watermarks for percpu allocation.
  v3 -> v4:
    . Add objcg to bpf_mem_alloc during init stage.
    . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init().
    . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init().
  v2 -> v3:
    . Clear the bpf_mem_cache if prefill fails.
    . Change test_bpf_ma percpu allocation tests to use bucket_size
      as allocation size instead of bucket_size - 8.
    . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call.
  v1 -> v2:
    . Avoid unnecessary extra percpu memory allocation.
    . Add a separate function to do bpf_global_percpu_ma initialization
    . promote.
    . Promote function static 'sizes' array to file static.
    . Add comments to explain to refill only one item for percpu alloc.

Yonghong Song (8):
  bpf: Avoid unnecessary extra percpu memory allocation
  bpf: Add objcg to bpf_mem_alloc
  bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  bpf: Refill only one percpu element in memalloc
  bpf: Use smaller low/high marks for percpu allocation
  bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
  selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  selftests/bpf: Add a selftest with > 512-byte percpu allocation size

 include/linux/bpf_mem_alloc.h                 |  8 ++
 kernel/bpf/memalloc.c                         | 98 ++++++++++++++++---
 kernel/bpf/verifier.c                         | 42 +++++---
 .../selftests/bpf/prog_tests/test_bpf_ma.c    | 20 ++--
 .../selftests/bpf/progs/percpu_alloc_fail.c   | 18 ++++
 .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 ++++++-------
 6 files changed, 186 insertions(+), 66 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 1/8] bpf: Avoid unnecessary extra percpu memory allocation
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 2/8] bpf: Add objcg to bpf_mem_alloc Yonghong Song
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.

Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.

Suggested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index aa0fbf000a12..288ec4a967d0 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -833,7 +833,9 @@ void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
 	if (!size)
 		return NULL;
 
-	idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ);
+	if (!ma->percpu)
+		size += LLIST_NODE_SZ;
+	idx = bpf_mem_cache_idx(size);
 	if (idx < 0)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 2/8] bpf: Add objcg to bpf_mem_alloc
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 1/8] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's
are with the same objcg. This patch made such a property explicit.
The next patch will use this property to save and restore objcg
for percpu unit allocator.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf_mem_alloc.h |  1 +
 kernel/bpf/memalloc.c         | 11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index bb1223b21308..acef8c808599 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -11,6 +11,7 @@ struct bpf_mem_caches;
 struct bpf_mem_alloc {
 	struct bpf_mem_caches __percpu *caches;
 	struct bpf_mem_cache __percpu *cache;
+	struct obj_cgroup *objcg;
 	bool percpu;
 	struct work_struct work;
 };
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 288ec4a967d0..4a21050f0359 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -523,6 +523,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 		if (memcg_bpf_enabled())
 			objcg = get_obj_cgroup_from_current();
 #endif
+		ma->objcg = objcg;
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(pc, cpu);
 			c->unit_size = unit_size;
@@ -542,6 +543,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 #ifdef CONFIG_MEMCG_KMEM
 	objcg = get_obj_cgroup_from_current();
 #endif
+	ma->objcg = objcg;
 	for_each_possible_cpu(cpu) {
 		cc = per_cpu_ptr(pcc, cpu);
 		for (i = 0; i < NUM_CACHES; i++) {
@@ -691,9 +693,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 		}
-		/* objcg is the same across cpus */
-		if (c->objcg)
-			obj_cgroup_put(c->objcg);
+		if (ma->objcg)
+			obj_cgroup_put(ma->objcg);
 		destroy_mem_alloc(ma, rcu_in_progress);
 	}
 	if (ma->caches) {
@@ -709,8 +710,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 			}
 		}
-		if (c->objcg)
-			obj_cgroup_put(c->objcg);
+		if (ma->objcg)
+			obj_cgroup_put(ma->objcg);
 		destroy_mem_alloc(ma, rcu_in_progress);
 	}
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 1/8] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 2/8] bpf: Add objcg to bpf_mem_alloc Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  6:26   ` Hou Tao
  2023-12-21  5:00 ` [PATCH bpf-next v5 4/8] bpf: Refill only one percpu element in memalloc Yonghong Song
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)

In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
  * 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.

Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.

Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.

To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf_mem_alloc.h |  7 ++++
 kernel/bpf/memalloc.c         | 62 ++++++++++++++++++++++++++++++++++-
 kernel/bpf/verifier.c         | 34 +++++++++++--------
 3 files changed, 88 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index acef8c808599..d1403204379e 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -22,8 +22,15 @@ struct bpf_mem_alloc {
  * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects.
  * Alloc and free are done with bpf_mem_{alloc,free}() and the size of
  * the returned object is given by the size argument of bpf_mem_alloc().
+ * If percpu equals true, error will be returned in order to avoid
+ * large memory consumption and the below bpf_mem_alloc_percpu_unit_init()
+ * should be used to do on-demand per-cpu allocation for each size.
  */
 int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
+/* Initialize a non-fix-size percpu memory allocator */
+int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma);
+/* The percpu allocation with a specific unit size. */
+int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size);
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 4a21050f0359..a9c87ef4b89a 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -121,6 +121,8 @@ struct bpf_mem_caches {
 	struct bpf_mem_cache cache[NUM_CACHES];
 };
 
+static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+
 static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 {
 	struct llist_node *entry, *next;
@@ -499,12 +501,14 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  */
 int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 {
-	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
 	struct obj_cgroup *objcg = NULL;
 	int cpu, i, unit_size, percpu_size = 0;
 
+	if (percpu && size == 0)
+		return -EINVAL;
+
 	/* room for llist_node and per-cpu pointer */
 	if (percpu)
 		percpu_size = LLIST_NODE_SZ + sizeof(void *);
@@ -524,6 +528,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			objcg = get_obj_cgroup_from_current();
 #endif
 		ma->objcg = objcg;
+
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(pc, cpu);
 			c->unit_size = unit_size;
@@ -562,6 +567,61 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 	return 0;
 }
 
+int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches __percpu *pcc;
+
+	pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL);
+	if (!pcc)
+		return -ENOMEM;
+
+	ma->caches = pcc;
+	ma->percpu = true;
+
+#ifdef CONFIG_MEMCG_KMEM
+	ma->objcg = get_obj_cgroup_from_current();
+#else
+	ma->objcg = NULL;
+#endif
+	return 0;
+}
+
+int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
+{
+	struct bpf_mem_caches *cc, __percpu *pcc;
+	int cpu, i, unit_size, percpu_size;
+	struct obj_cgroup *objcg;
+	struct bpf_mem_cache *c;
+
+	i = bpf_mem_cache_idx(size);
+	if (i < 0)
+		return -EINVAL;
+
+	/* room for llist_node and per-cpu pointer */
+	percpu_size = LLIST_NODE_SZ + sizeof(void *);
+
+	unit_size = sizes[i];
+	objcg = ma->objcg;
+	pcc = ma->caches;
+
+	for_each_possible_cpu(cpu) {
+		cc = per_cpu_ptr(pcc, cpu);
+		c = &cc->cache[i];
+		if (cpu == 0 && c->unit_size)
+			break;
+
+		c->unit_size = unit_size;
+		c->objcg = objcg;
+		c->percpu_size = percpu_size;
+		c->tgt = c;
+
+		init_refill_work(c);
+		prefill_mem_cache(c, cpu);
+	}
+
+	return 0;
+}
+
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	bool percpu = !!c->percpu_size;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f13008d27f35..08f9a49cc11c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
 					return -ENOMEM;
 
-				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
-					if (!bpf_global_percpu_ma_set) {
-						mutex_lock(&bpf_percpu_ma_lock);
-						if (!bpf_global_percpu_ma_set) {
-							err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
-							if (!err)
-								bpf_global_percpu_ma_set = true;
-						}
-						mutex_unlock(&bpf_percpu_ma_lock);
-						if (err)
-							return err;
-					}
-				}
-
 				if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
 					verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
 					return -EINVAL;
@@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 					return -EINVAL;
 				}
 
+				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
+					if (!bpf_global_percpu_ma_set) {
+						mutex_lock(&bpf_percpu_ma_lock);
+						if (!bpf_global_percpu_ma_set) {
+							err = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
+							if (!err)
+								bpf_global_percpu_ma_set = true;
+						}
+						mutex_unlock(&bpf_percpu_ma_lock);
+						if (err)
+							return err;
+					}
+
+					mutex_lock(&bpf_percpu_ma_lock);
+					err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
+					mutex_unlock(&bpf_percpu_ma_lock);
+					if (err)
+						return err;
+				}
+
 				struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id);
 				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
 					if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 4/8] bpf: Refill only one percpu element in memalloc
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (2 preceding siblings ...)
  2023-12-21  5:00 ` [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 5/8] bpf: Use smaller low/high marks for percpu allocation Yonghong Song
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index a9c87ef4b89a..99fa201d350b 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -485,11 +485,16 @@ static void init_refill_work(struct bpf_mem_cache *c)
 
 static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 {
-	/* To avoid consuming memory assume that 1st run of bpf
-	 * prog won't be doing more than 4 map_update_elem from
-	 * irq disabled region
+	int cnt = 1;
+
+	/* To avoid consuming memory, for non-percpu allocation, assume that
+	 * 1st run of bpf prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region if unit size is less than or equal to 256.
+	 * For all other cases, let us just do one allocation.
 	 */
-	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false);
+	if (!c->percpu_size && c->unit_size <= 256)
+		cnt = 4;
+	alloc_bulk(c, cnt, cpu_to_node(cpu), false);
 }
 
 /* When size != 0 bpf_mem_cache for each cpu.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 5/8] bpf: Use smaller low/high marks for percpu allocation
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (3 preceding siblings ...)
  2023-12-21  5:00 ` [PATCH bpf-next v5 4/8] bpf: Refill only one percpu element in memalloc Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 6/8] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Currently, refill low/high marks are set with the assumption
of normal non-percpu memory allocation. For example, for
an allocation size 256, for non-percpu memory allocation,
low mark is 32 and high mark is 96, resulting in the
batch allocation of 48 elements and the allocated memory
will be 48 * 256 = 12KB for this particular cpu.
Assuming an 128-cpu system, the total memory consumption
across all cpus will be 12K * 128 = 1.5MB memory.

This might be okay for non-percpu allocation, but may not be
good for percpu allocation, which will consume 1.5MB * 128 = 192MB
memory in the worst case if every cpu has a chance of memory
allocation.

In practice, percpu allocation is very rare compared to
non-percpu allocation. So let us have smaller low/high marks
which can avoid unnecessary memory consumption.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 99fa201d350b..984c83ecace9 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -464,11 +464,17 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c)
  * consume ~ 11 Kbyte per cpu.
  * Typical case will be between 11K and 116K closer to 11K.
  * bpf progs can and should share bpf_mem_cache when possible.
+ *
+ * Percpu allocation is typically rare. To avoid potential unnecessary large
+ * memory consumption, set low_mark = 1 and high_mark = 3, resulting in c->batch = 1.
  */
 static void init_refill_work(struct bpf_mem_cache *c)
 {
 	init_irq_work(&c->refill_work, bpf_mem_refill);
-	if (c->unit_size <= 256) {
+	if (c->percpu_size) {
+		c->low_watermark = 1;
+		c->high_watermark = 3;
+	} else if (c->unit_size <= 256) {
 		c->low_watermark = 32;
 		c->high_watermark = 96;
 	} else {
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 6/8] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (4 preceding siblings ...)
  2023-12-21  5:00 ` [PATCH bpf-next v5 5/8] bpf: Use smaller low/high marks for percpu allocation Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 7/8] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 8/8] selftests/bpf: Add a selftest with > 512-byte percpu allocation size Yonghong Song
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.

We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/verifier.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 08f9a49cc11c..fd433b915c8e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -195,6 +195,8 @@ struct bpf_verifier_stack_elem {
 					  POISON_POINTER_DELTA))
 #define BPF_MAP_PTR(X)		((struct bpf_map *)((X) & ~BPF_MAP_PTR_UNPRIV))
 
+#define BPF_GLOBAL_PERCPU_MA_MAX_SIZE  512
+
 static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
 static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
 static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
@@ -12162,6 +12164,12 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				}
 
 				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
+					if (ret_t->size > BPF_GLOBAL_PERCPU_MA_MAX_SIZE) {
+						verbose(env, "bpf_percpu_obj_new type size (%d) is greater than %d\n",
+							ret_t->size, BPF_GLOBAL_PERCPU_MA_MAX_SIZE);
+						return -EINVAL;
+					}
+
 					if (!bpf_global_percpu_ma_set) {
 						mutex_lock(&bpf_percpu_ma_lock);
 						if (!bpf_global_percpu_ma_set) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 7/8] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (5 preceding siblings ...)
  2023-12-21  5:00 ` [PATCH bpf-next v5 6/8] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  2023-12-21  5:00 ` [PATCH bpf-next v5 8/8] selftests/bpf: Add a selftest with > 512-byte percpu allocation size Yonghong Song
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

In the previous patch, the maximum data size for bpf_global_percpu_ma
is 512 bytes. This breaks selftest test_bpf_ma. The test is adjusted
in two aspects:
  - Since the maximum allowed data size for bpf_global_percpu_ma is
    512, remove all tests beyond that, names sizes 1024, 2048 and 4096.
  - Previously the percpu data size is bucket_size - 8 in order to
    avoid percpu allocation into the next bucket. This patch removed
    such data size adjustment thanks to Patch 1.

Also, a better way to generate BTF type is used than adding
a member to the value struct.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/prog_tests/test_bpf_ma.c    | 20 ++++--
 .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 +++++++++----------
 2 files changed, 46 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c b/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c
index d3491a84b3b9..ccae0b31ac6c 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c
@@ -14,7 +14,8 @@ static void do_bpf_ma_test(const char *name)
 	struct test_bpf_ma *skel;
 	struct bpf_program *prog;
 	struct btf *btf;
-	int i, err;
+	int i, err, id;
+	char tname[32];
 
 	skel = test_bpf_ma__open();
 	if (!ASSERT_OK_PTR(skel, "open"))
@@ -25,16 +26,21 @@ static void do_bpf_ma_test(const char *name)
 		goto out;
 
 	for (i = 0; i < ARRAY_SIZE(skel->rodata->data_sizes); i++) {
-		char name[32];
-		int id;
-
-		snprintf(name, sizeof(name), "bin_data_%u", skel->rodata->data_sizes[i]);
-		id = btf__find_by_name_kind(btf, name, BTF_KIND_STRUCT);
-		if (!ASSERT_GT(id, 0, "bin_data"))
+		snprintf(tname, sizeof(tname), "bin_data_%u", skel->rodata->data_sizes[i]);
+		id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
+		if (!ASSERT_GT(id, 0, tname))
 			goto out;
 		skel->rodata->data_btf_ids[i] = id;
 	}
 
+	for (i = 0; i < ARRAY_SIZE(skel->rodata->percpu_data_sizes); i++) {
+		snprintf(tname, sizeof(tname), "percpu_bin_data_%u", skel->rodata->percpu_data_sizes[i]);
+		id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
+		if (!ASSERT_GT(id, 0, tname))
+			goto out;
+		skel->rodata->percpu_data_btf_ids[i] = id;
+	}
+
 	prog = bpf_object__find_program_by_name(skel->obj, name);
 	if (!ASSERT_OK_PTR(prog, "invalid prog name"))
 		goto out;
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_ma.c b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
index 069db9085e78..da8fcb51d13a 100644
--- a/tools/testing/selftests/bpf/progs/test_bpf_ma.c
+++ b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
@@ -20,6 +20,9 @@ char _license[] SEC("license") = "GPL";
 const unsigned int data_sizes[] = {16, 32, 64, 96, 128, 192, 256, 512, 1024, 2048, 4096};
 const volatile unsigned int data_btf_ids[ARRAY_SIZE(data_sizes)] = {};
 
+const unsigned int percpu_data_sizes[] = {8, 16, 32, 64, 96, 128, 192, 256, 512};
+const volatile unsigned int percpu_data_btf_ids[ARRAY_SIZE(data_sizes)] = {};
+
 int err = 0;
 int pid = 0;
 
@@ -27,10 +30,10 @@ int pid = 0;
 	struct bin_data_##_size { \
 		char data[_size - sizeof(void *)]; \
 	}; \
+	/* See Commit 5d8d6634ccc, force btf generation for type bin_data_##_size */	\
+	struct bin_data_##_size *__bin_data_##_size; \
 	struct map_value_##_size { \
 		struct bin_data_##_size __kptr * data; \
-		/* To emit BTF info for bin_data_xx */ \
-		struct bin_data_##_size not_used; \
 	}; \
 	struct { \
 		__uint(type, BPF_MAP_TYPE_ARRAY); \
@@ -40,8 +43,12 @@ int pid = 0;
 	} array_##_size SEC(".maps")
 
 #define DEFINE_ARRAY_WITH_PERCPU_KPTR(_size) \
+	struct percpu_bin_data_##_size { \
+		char data[_size]; \
+	}; \
+	struct percpu_bin_data_##_size *__percpu_bin_data_##_size; \
 	struct map_value_percpu_##_size { \
-		struct bin_data_##_size __percpu_kptr * data; \
+		struct percpu_bin_data_##_size __percpu_kptr * data; \
 	}; \
 	struct { \
 		__uint(type, BPF_MAP_TYPE_ARRAY); \
@@ -114,7 +121,7 @@ static __always_inline void batch_percpu_alloc(struct bpf_map *map, unsigned int
 			return;
 		}
 		/* per-cpu allocator may not be able to refill in time */
-		new = bpf_percpu_obj_new_impl(data_btf_ids[idx], NULL);
+		new = bpf_percpu_obj_new_impl(percpu_data_btf_ids[idx], NULL);
 		if (!new)
 			continue;
 
@@ -179,7 +186,7 @@ DEFINE_ARRAY_WITH_KPTR(1024);
 DEFINE_ARRAY_WITH_KPTR(2048);
 DEFINE_ARRAY_WITH_KPTR(4096);
 
-/* per-cpu kptr doesn't support bin_data_8 which is a zero-sized array */
+DEFINE_ARRAY_WITH_PERCPU_KPTR(8);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(16);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(32);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(64);
@@ -188,9 +195,6 @@ DEFINE_ARRAY_WITH_PERCPU_KPTR(128);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(192);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(256);
 DEFINE_ARRAY_WITH_PERCPU_KPTR(512);
-DEFINE_ARRAY_WITH_PERCPU_KPTR(1024);
-DEFINE_ARRAY_WITH_PERCPU_KPTR(2048);
-DEFINE_ARRAY_WITH_PERCPU_KPTR(4096);
 
 SEC("?fentry/" SYS_PREFIX "sys_nanosleep")
 int test_batch_alloc_free(void *ctx)
@@ -246,20 +250,18 @@ int test_batch_percpu_alloc_free(void *ctx)
 	if ((u32)bpf_get_current_pid_tgid() != pid)
 		return 0;
 
-	/* Alloc 128 16-bytes per-cpu objects in batch to trigger refilling,
-	 * then free 128 16-bytes per-cpu objects in batch to trigger freeing.
+	/* Alloc 128 8-bytes per-cpu objects in batch to trigger refilling,
+	 * then free 128 8-bytes per-cpu objects in batch to trigger freeing.
 	 */
-	CALL_BATCH_PERCPU_ALLOC_FREE(16, 128, 0);
-	CALL_BATCH_PERCPU_ALLOC_FREE(32, 128, 1);
-	CALL_BATCH_PERCPU_ALLOC_FREE(64, 128, 2);
-	CALL_BATCH_PERCPU_ALLOC_FREE(96, 128, 3);
-	CALL_BATCH_PERCPU_ALLOC_FREE(128, 128, 4);
-	CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 5);
-	CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 6);
-	CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 7);
-	CALL_BATCH_PERCPU_ALLOC_FREE(1024, 32, 8);
-	CALL_BATCH_PERCPU_ALLOC_FREE(2048, 16, 9);
-	CALL_BATCH_PERCPU_ALLOC_FREE(4096, 8, 10);
+	CALL_BATCH_PERCPU_ALLOC_FREE(8, 128, 0);
+	CALL_BATCH_PERCPU_ALLOC_FREE(16, 128, 1);
+	CALL_BATCH_PERCPU_ALLOC_FREE(32, 128, 2);
+	CALL_BATCH_PERCPU_ALLOC_FREE(64, 128, 3);
+	CALL_BATCH_PERCPU_ALLOC_FREE(96, 128, 4);
+	CALL_BATCH_PERCPU_ALLOC_FREE(128, 128, 5);
+	CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 6);
+	CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 7);
+	CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 8);
 
 	return 0;
 }
@@ -270,20 +272,18 @@ int test_percpu_free_through_map_free(void *ctx)
 	if ((u32)bpf_get_current_pid_tgid() != pid)
 		return 0;
 
-	/* Alloc 128 16-bytes per-cpu objects in batch to trigger refilling,
+	/* Alloc 128 8-bytes per-cpu objects in batch to trigger refilling,
 	 * then free these object through map free.
 	 */
-	CALL_BATCH_PERCPU_ALLOC(16, 128, 0);
-	CALL_BATCH_PERCPU_ALLOC(32, 128, 1);
-	CALL_BATCH_PERCPU_ALLOC(64, 128, 2);
-	CALL_BATCH_PERCPU_ALLOC(96, 128, 3);
-	CALL_BATCH_PERCPU_ALLOC(128, 128, 4);
-	CALL_BATCH_PERCPU_ALLOC(192, 128, 5);
-	CALL_BATCH_PERCPU_ALLOC(256, 128, 6);
-	CALL_BATCH_PERCPU_ALLOC(512, 64, 7);
-	CALL_BATCH_PERCPU_ALLOC(1024, 32, 8);
-	CALL_BATCH_PERCPU_ALLOC(2048, 16, 9);
-	CALL_BATCH_PERCPU_ALLOC(4096, 8, 10);
+	CALL_BATCH_PERCPU_ALLOC(8, 128, 0);
+	CALL_BATCH_PERCPU_ALLOC(16, 128, 1);
+	CALL_BATCH_PERCPU_ALLOC(32, 128, 2);
+	CALL_BATCH_PERCPU_ALLOC(64, 128, 3);
+	CALL_BATCH_PERCPU_ALLOC(96, 128, 4);
+	CALL_BATCH_PERCPU_ALLOC(128, 128, 5);
+	CALL_BATCH_PERCPU_ALLOC(192, 128, 6);
+	CALL_BATCH_PERCPU_ALLOC(256, 128, 7);
+	CALL_BATCH_PERCPU_ALLOC(512, 64, 8);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v5 8/8] selftests/bpf: Add a selftest with > 512-byte percpu allocation size
  2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
                   ` (6 preceding siblings ...)
  2023-12-21  5:00 ` [PATCH bpf-next v5 7/8] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
@ 2023-12-21  5:00 ` Yonghong Song
  7 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  5:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Hou Tao

Add a selftest to capture the verification failure when the allocation
size is greater than 512.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/progs/percpu_alloc_fail.c    | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
index 1a891d30f1fe..f2b8eb2ff76f 100644
--- a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
+++ b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c
@@ -17,6 +17,10 @@ struct val_with_rb_root_t {
 	struct bpf_spin_lock lock;
 };
 
+struct val_600b_t {
+	char b[600];
+};
+
 struct elem {
 	long sum;
 	struct val_t __percpu_kptr *pc;
@@ -161,4 +165,18 @@ int BPF_PROG(test_array_map_7)
 	return 0;
 }
 
+SEC("?fentry.s/bpf_fentry_test1")
+__failure __msg("bpf_percpu_obj_new type size (600) is greater than 512")
+int BPF_PROG(test_array_map_8)
+{
+	struct val_600b_t __percpu_kptr *p;
+
+	p = bpf_percpu_obj_new(struct val_600b_t);
+	if (!p)
+		return 0;
+
+	bpf_percpu_obj_drop(p);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-21  5:00 ` [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
@ 2023-12-21  6:26   ` Hou Tao
  2023-12-21  7:16     ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2023-12-21  6:26 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Hi,

On 12/21/2023 1:00 PM, Yonghong Song wrote:
> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
> added support for non-fix-size percpu memory allocation.
> Such allocation will allocate percpu memory for all buckets on all
> cpus and the memory consumption is in the order to quadratic.
> For example, let us say, 4 cpus, unit size 16 bytes, so each
> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
> Then let us say, 8 cpus with the same unit size, each cpu
> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
> So if the number of cpus doubles, the number of memory consumption
> will be 4 times. So for a system with large number of cpus, the
> memory consumption goes up quickly with quadratic order.
> For example, for 4KB percpu allocation, 128 cpus. The total memory
> consumption will 4KB * 128 * 128 = 64MB. Things will become
> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>
> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
> done in boot time, so for system with large number of cpus, the initial
> percpu memory consumption is very visible. For example, for 128 cpu
> system, the total percpu memory allocation will be at least
> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>   * 128 * 128 = ~138MB.
> which is pretty big. It will be even bigger for larger number of cpus.

SNIP
> +
>  static void drain_mem_cache(struct bpf_mem_cache *c)
>  {
>  	bool percpu = !!c->percpu_size;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f13008d27f35..08f9a49cc11c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  				if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
>  					return -ENOMEM;
>  
> -				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
> -					if (!bpf_global_percpu_ma_set) {
> -						mutex_lock(&bpf_percpu_ma_lock);
> -						if (!bpf_global_percpu_ma_set) {
> -							err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
> -							if (!err)
> -								bpf_global_percpu_ma_set = true;
> -						}
> -						mutex_unlock(&bpf_percpu_ma_lock);
> -						if (err)
> -							return err;
> -					}
> -				}
> -
>  				if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
>  					verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
>  					return -EINVAL;
> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  					return -EINVAL;
>  				}
>  
> +				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
> +					if (!bpf_global_percpu_ma_set) {
> +						mutex_lock(&bpf_percpu_ma_lock);
> +						if (!bpf_global_percpu_ma_set) {
> +							err = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);

Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I
think the memory account will be incorrect, right ? Maybe we should pass
objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think
the objcg is NULL.
> +							if (!err)
> +								bpf_global_percpu_ma_set = true;
> +						}
> +						mutex_unlock(&bpf_percpu_ma_lock);
> +						if (err)
> +							return err;
> +					}
> +
> +					mutex_lock(&bpf_percpu_ma_lock);
> +					err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
> +					mutex_unlock(&bpf_percpu_ma_lock);
> +					if (err)
> +						return err;
> +				}
> +
>  				struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id);
>  				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>  					if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-21  6:26   ` Hou Tao
@ 2023-12-21  7:16     ` Yonghong Song
  2023-12-21  7:52       ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  7:16 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/20/23 10:26 PM, Hou Tao wrote:
> Hi,
>
> On 12/21/2023 1:00 PM, Yonghong Song wrote:
>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation")
>> added support for non-fix-size percpu memory allocation.
>> Such allocation will allocate percpu memory for all buckets on all
>> cpus and the memory consumption is in the order to quadratic.
>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
>> Then let us say, 8 cpus with the same unit size, each cpu
>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
>> So if the number of cpus doubles, the number of memory consumption
>> will be 4 times. So for a system with large number of cpus, the
>> memory consumption goes up quickly with quadratic order.
>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>
>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>> done in boot time, so for system with large number of cpus, the initial
>> percpu memory consumption is very visible. For example, for 128 cpu
>> system, the total percpu memory allocation will be at least
>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>    * 128 * 128 = ~138MB.
>> which is pretty big. It will be even bigger for larger number of cpus.
> SNIP
>> +
>>   static void drain_mem_cache(struct bpf_mem_cache *c)
>>   {
>>   	bool percpu = !!c->percpu_size;
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f13008d27f35..08f9a49cc11c 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>   				if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
>>   					return -ENOMEM;
>>   
>> -				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>> -					if (!bpf_global_percpu_ma_set) {
>> -						mutex_lock(&bpf_percpu_ma_lock);
>> -						if (!bpf_global_percpu_ma_set) {
>> -							err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
>> -							if (!err)
>> -								bpf_global_percpu_ma_set = true;
>> -						}
>> -						mutex_unlock(&bpf_percpu_ma_lock);
>> -						if (err)
>> -							return err;
>> -					}
>> -				}
>> -
>>   				if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
>>   					verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
>>   					return -EINVAL;
>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>   					return -EINVAL;
>>   				}
>>   
>> +				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>> +					if (!bpf_global_percpu_ma_set) {
>> +						mutex_lock(&bpf_percpu_ma_lock);
>> +						if (!bpf_global_percpu_ma_set) {
>> +							err = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I
> think the memory account will be incorrect, right ? Maybe we should pass
> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think
> the objcg is NULL.

You are correct. Calling bpf_mem_alloc_percpu_init() in init stage
is exactly the reason to have proper root memcg for objcg. Sorry I missed it.

I remembered I indeed traced it a few days ago and indeed it is NULL.
There are three ways to resolve this:
    1 Just do 'ma->objcg = NULL' unconditionally in bpf_mem_alloc_percpu_init().
    2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at init stage,
      e.g., in bpf_global_ma_init() init function (core.c), and later it can
      be used in bpf_mem_alloc_percpu_init().
    3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize ma->objcg
      properly. But delay __alloc_percpu_gfp() later when verifier found a call
      to bpf_percpu_obj_new(). We could add a call bpf_mem_alloc_percpu_init_caches()
      to do __alloc_percpu_grp().

I prefer option 3, what do you think?

>> +							if (!err)
>> +								bpf_global_percpu_ma_set = true;
>> +						}
>> +						mutex_unlock(&bpf_percpu_ma_lock);
>> +						if (err)
>> +							return err;
>> +					}
>> +
>> +					mutex_lock(&bpf_percpu_ma_lock);
>> +					err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
>> +					mutex_unlock(&bpf_percpu_ma_lock);
>> +					if (err)
>> +						return err;
>> +				}
>> +
>>   				struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id);
>>   				if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>   					if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-21  7:16     ` Yonghong Song
@ 2023-12-21  7:52       ` Yonghong Song
  2023-12-21  8:42         ` Hou Tao
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2023-12-21  7:52 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/20/23 11:16 PM, Yonghong Song wrote:
>
> On 12/20/23 10:26 PM, Hou Tao wrote:
>> Hi,
>>
>> On 12/21/2023 1:00 PM, Yonghong Song wrote:
>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem 
>>> allocation")
>>> added support for non-fix-size percpu memory allocation.
>>> Such allocation will allocate percpu memory for all buckets on all
>>> cpus and the memory consumption is in the order to quadratic.
>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 
>>> bytes.
>>> Then let us say, 8 cpus with the same unit size, each cpu
>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 
>>> bytes.
>>> So if the number of cpus doubles, the number of memory consumption
>>> will be 4 times. So for a system with large number of cpus, the
>>> memory consumption goes up quickly with quadratic order.
>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>>
>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>>> done in boot time, so for system with large number of cpus, the initial
>>> percpu memory consumption is very visible. For example, for 128 cpu
>>> system, the total percpu memory allocation will be at least
>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>>    * 128 * 128 = ~138MB.
>>> which is pretty big. It will be even bigger for larger number of cpus.
>> SNIP
>>> +
>>>   static void drain_mem_cache(struct bpf_mem_cache *c)
>>>   {
>>>       bool percpu = !!c->percpu_size;
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index f13008d27f35..08f9a49cc11c 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct 
>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>                   if (meta.func_id == 
>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
>>>                       return -ENOMEM;
>>>   -                if (meta.func_id == 
>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>> -                    if (!bpf_global_percpu_ma_set) {
>>> -                        mutex_lock(&bpf_percpu_ma_lock);
>>> -                        if (!bpf_global_percpu_ma_set) {
>>> -                            err = 
>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
>>> -                            if (!err)
>>> -                                bpf_global_percpu_ma_set = true;
>>> -                        }
>>> - mutex_unlock(&bpf_percpu_ma_lock);
>>> -                        if (err)
>>> -                            return err;
>>> -                    }
>>> -                }
>>> -
>>>                   if (((u64)(u32)meta.arg_constant.value) != 
>>> meta.arg_constant.value) {
>>>                       verbose(env, "local type ID argument must be 
>>> in range [0, U32_MAX]\n");
>>>                       return -EINVAL;
>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct 
>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>                       return -EINVAL;
>>>                   }
>>>   +                if (meta.func_id == 
>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>> +                    if (!bpf_global_percpu_ma_set) {
>>> +                        mutex_lock(&bpf_percpu_ma_lock);
>>> +                        if (!bpf_global_percpu_ma_set) {
>>> +                            err = 
>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I
>> think the memory account will be incorrect, right ? Maybe we should pass
>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think
>> the objcg is NULL.
>
> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage
> is exactly the reason to have proper root memcg for objcg. Sorry I 
> missed it.
>
> I remembered I indeed traced it a few days ago and indeed it is NULL.
> There are three ways to resolve this:
>    1 Just do 'ma->objcg = NULL' unconditionally in 
> bpf_mem_alloc_percpu_init().
>    2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at 
> init stage,
>      e.g., in bpf_global_ma_init() init function (core.c), and later 
> it can
>      be used in bpf_mem_alloc_percpu_init().
>    3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize 
> ma->objcg
>      properly. But delay __alloc_percpu_gfp() later when verifier 
> found a call
>      to bpf_percpu_obj_new(). We could add a call 
> bpf_mem_alloc_percpu_init_caches()
>      to do __alloc_percpu_grp().
>
> I prefer option 3, what do you think?

The option 4 below:

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 984c83ecace9..f90989cc9cbc 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -122,6 +122,7 @@ struct bpf_mem_caches {
  };
  
  static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
+static struct obj_cgroup *objcg_at_init __ro_after_init;
  
  static struct llist_node notrace *__llist_del_first(struct llist_head *head)
  {
@@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
         ma->percpu = true;
  
  #ifdef CONFIG_MEMCG_KMEM
-       ma->objcg = get_obj_cgroup_from_current();
+       ma->objcg = objcg_at_init;
  #else
         ma->objcg = NULL;
  #endif
@@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags)
  
         return !ret ? NULL : ret + LLIST_NODE_SZ;
  }
+
+static int __init find_objcg_at_init(void)
+{
+       objcg_at_init = get_obj_cgroup_from_current();
+       return 0;
+}
+late_initcall(find_objcg_at_init);

It seems this is better?

>
>>> +                            if (!err)
>>> +                                bpf_global_percpu_ma_set = true;
>>> +                        }
>>> + mutex_unlock(&bpf_percpu_ma_lock);
>>> +                        if (err)
>>> +                            return err;
>>> +                    }
>>> +
>>> +                    mutex_lock(&bpf_percpu_ma_lock);
>>> +                    err = 
>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
>>> +                    mutex_unlock(&bpf_percpu_ma_lock);
>>> +                    if (err)
>>> +                        return err;
>>> +                }
>>> +
>>>                   struct_meta = btf_find_struct_meta(ret_btf, 
>>> ret_btf_id);
>>>                   if (meta.func_id == 
>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>                       if (!__btf_type_is_scalar_struct(env, ret_btf, 
>>> ret_t, 0)) {
>>
>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-21  7:52       ` Yonghong Song
@ 2023-12-21  8:42         ` Hou Tao
  2023-12-21 16:53           ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2023-12-21  8:42 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Hi,

On 12/21/2023 3:52 PM, Yonghong Song wrote:
>
> On 12/20/23 11:16 PM, Yonghong Song wrote:
>>
>> On 12/20/23 10:26 PM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/21/2023 1:00 PM, Yonghong Song wrote:
>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem
>>>> allocation")
>>>> added support for non-fix-size percpu memory allocation.
>>>> Such allocation will allocate percpu memory for all buckets on all
>>>> cpus and the memory consumption is in the order to quadratic.
>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256
>>>> bytes.
>>>> Then let us say, 8 cpus with the same unit size, each cpu
>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024
>>>> bytes.
>>>> So if the number of cpus doubles, the number of memory consumption
>>>> will be 4 times. So for a system with large number of cpus, the
>>>> memory consumption goes up quickly with quadratic order.
>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>>>
>>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>>>> done in boot time, so for system with large number of cpus, the
>>>> initial
>>>> percpu memory consumption is very visible. For example, for 128 cpu
>>>> system, the total percpu memory allocation will be at least
>>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>>>    * 128 * 128 = ~138MB.
>>>> which is pretty big. It will be even bigger for larger number of cpus.
>>> SNIP
>>>> +
>>>>   static void drain_mem_cache(struct bpf_mem_cache *c)
>>>>   {
>>>>       bool percpu = !!c->percpu_size;
>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>> index f13008d27f35..08f9a49cc11c 100644
>>>> --- a/kernel/bpf/verifier.c
>>>> +++ b/kernel/bpf/verifier.c
>>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct
>>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>>                   if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
>>>>                       return -ENOMEM;
>>>>   -                if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>> -                    if (!bpf_global_percpu_ma_set) {
>>>> -                        mutex_lock(&bpf_percpu_ma_lock);
>>>> -                        if (!bpf_global_percpu_ma_set) {
>>>> -                            err =
>>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
>>>> -                            if (!err)
>>>> -                                bpf_global_percpu_ma_set = true;
>>>> -                        }
>>>> - mutex_unlock(&bpf_percpu_ma_lock);
>>>> -                        if (err)
>>>> -                            return err;
>>>> -                    }
>>>> -                }
>>>> -
>>>>                   if (((u64)(u32)meta.arg_constant.value) !=
>>>> meta.arg_constant.value) {
>>>>                       verbose(env, "local type ID argument must be
>>>> in range [0, U32_MAX]\n");
>>>>                       return -EINVAL;
>>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct
>>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>>                       return -EINVAL;
>>>>                   }
>>>>   +                if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>> +                    if (!bpf_global_percpu_ma_set) {
>>>> +                        mutex_lock(&bpf_percpu_ma_lock);
>>>> +                        if (!bpf_global_percpu_ma_set) {
>>>> +                            err =
>>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
>>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I
>>> think the memory account will be incorrect, right ? Maybe we should
>>> pass
>>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think
>>> the objcg is NULL.
>>
>> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage
>> is exactly the reason to have proper root memcg for objcg. Sorry I
>> missed it.
>>
>> I remembered I indeed traced it a few days ago and indeed it is NULL.
>> There are three ways to resolve this:
>>    1 Just do 'ma->objcg = NULL' unconditionally in
>> bpf_mem_alloc_percpu_init().
>>    2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at
>> init stage,
>>      e.g., in bpf_global_ma_init() init function (core.c), and later
>> it can
>>      be used in bpf_mem_alloc_percpu_init().
>>    3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize
>> ma->objcg
>>      properly. But delay __alloc_percpu_gfp() later when verifier
>> found a call
>>      to bpf_percpu_obj_new(). We could add a call
>> bpf_mem_alloc_percpu_init_caches()
>>      to do __alloc_percpu_grp().
>>
>> I prefer option 3, what do you think?
>
> The option 4 below:
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 984c83ecace9..f90989cc9cbc 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -122,6 +122,7 @@ struct bpf_mem_caches {
>  };
>  
>  static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256,
> 512, 1024, 2048, 4096};
> +static struct obj_cgroup *objcg_at_init __ro_after_init;
>  
>  static struct llist_node notrace *__llist_del_first(struct llist_head
> *head)
>  {
> @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc
> *ma)
>         ma->percpu = true;
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -       ma->objcg = get_obj_cgroup_from_current();
> +       ma->objcg = objcg_at_init;
>  #else
>         ma->objcg = NULL;
>  #endif
> @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct
> bpf_mem_alloc *ma, gfp_t flags)
>  
>         return !ret ? NULL : ret + LLIST_NODE_SZ;
>  }
> +
> +static int __init find_objcg_at_init(void)
> +{
> +       objcg_at_init = get_obj_cgroup_from_current();
> +       return 0;
> +}
> +late_initcall(find_objcg_at_init);
>
> It seems this is better?

It seems that you are worried about the objcg of root memcg may change
from NULL to something else one day, right ? If it is the case, I think
both option 3 and 4 are fine. But I still think passing the desired
objcg to bpf_mem_alloc_percpu_init() directly is better. If the objcg of
root memcg is not NULL afterwards, we can update the passed parameter
accordingly.
>
>>
>>>> +                            if (!err)
>>>> +                                bpf_global_percpu_ma_set = true;
>>>> +                        }
>>>> + mutex_unlock(&bpf_percpu_ma_lock);
>>>> +                        if (err)
>>>> +                            return err;
>>>> +                    }
>>>> +
>>>> +                    mutex_lock(&bpf_percpu_ma_lock);
>>>> +                    err =
>>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
>>>> +                    mutex_unlock(&bpf_percpu_ma_lock);
>>>> +                    if (err)
>>>> +                        return err;
>>>> +                }
>>>> +
>>>>                   struct_meta = btf_find_struct_meta(ret_btf,
>>>> ret_btf_id);
>>>>                   if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>>                       if (!__btf_type_is_scalar_struct(env,
>>>> ret_btf, ret_t, 0)) {
>>>
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  2023-12-21  8:42         ` Hou Tao
@ 2023-12-21 16:53           ` Yonghong Song
  0 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2023-12-21 16:53 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/21/23 12:42 AM, Hou Tao wrote:
> Hi,
>
> On 12/21/2023 3:52 PM, Yonghong Song wrote:
>> On 12/20/23 11:16 PM, Yonghong Song wrote:
>>> On 12/20/23 10:26 PM, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 12/21/2023 1:00 PM, Yonghong Song wrote:
>>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem
>>>>> allocation")
>>>>> added support for non-fix-size percpu memory allocation.
>>>>> Such allocation will allocate percpu memory for all buckets on all
>>>>> cpus and the memory consumption is in the order to quadratic.
>>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256
>>>>> bytes.
>>>>> Then let us say, 8 cpus with the same unit size, each cpu
>>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024
>>>>> bytes.
>>>>> So if the number of cpus doubles, the number of memory consumption
>>>>> will be 4 times. So for a system with large number of cpus, the
>>>>> memory consumption goes up quickly with quadratic order.
>>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>>>>
>>>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>>>>> done in boot time, so for system with large number of cpus, the
>>>>> initial
>>>>> percpu memory consumption is very visible. For example, for 128 cpu
>>>>> system, the total percpu memory allocation will be at least
>>>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>>>>     * 128 * 128 = ~138MB.
>>>>> which is pretty big. It will be even bigger for larger number of cpus.
>>>> SNIP
>>>>> +
>>>>>    static void drain_mem_cache(struct bpf_mem_cache *c)
>>>>>    {
>>>>>        bool percpu = !!c->percpu_size;
>>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>>> index f13008d27f35..08f9a49cc11c 100644
>>>>> --- a/kernel/bpf/verifier.c
>>>>> +++ b/kernel/bpf/verifier.c
>>>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct
>>>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>>>                    if (meta.func_id ==
>>>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
>>>>>                        return -ENOMEM;
>>>>>    -                if (meta.func_id ==
>>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>>> -                    if (!bpf_global_percpu_ma_set) {
>>>>> -                        mutex_lock(&bpf_percpu_ma_lock);
>>>>> -                        if (!bpf_global_percpu_ma_set) {
>>>>> -                            err =
>>>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
>>>>> -                            if (!err)
>>>>> -                                bpf_global_percpu_ma_set = true;
>>>>> -                        }
>>>>> - mutex_unlock(&bpf_percpu_ma_lock);
>>>>> -                        if (err)
>>>>> -                            return err;
>>>>> -                    }
>>>>> -                }
>>>>> -
>>>>>                    if (((u64)(u32)meta.arg_constant.value) !=
>>>>> meta.arg_constant.value) {
>>>>>                        verbose(env, "local type ID argument must be
>>>>> in range [0, U32_MAX]\n");
>>>>>                        return -EINVAL;
>>>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct
>>>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>>>                        return -EINVAL;
>>>>>                    }
>>>>>    +                if (meta.func_id ==
>>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>>> +                    if (!bpf_global_percpu_ma_set) {
>>>>> +                        mutex_lock(&bpf_percpu_ma_lock);
>>>>> +                        if (!bpf_global_percpu_ma_set) {
>>>>> +                            err =
>>>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
>>>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I
>>>> think the memory account will be incorrect, right ? Maybe we should
>>>> pass
>>>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think
>>>> the objcg is NULL.
>>> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage
>>> is exactly the reason to have proper root memcg for objcg. Sorry I
>>> missed it.
>>>
>>> I remembered I indeed traced it a few days ago and indeed it is NULL.
>>> There are three ways to resolve this:
>>>     1 Just do 'ma->objcg = NULL' unconditionally in
>>> bpf_mem_alloc_percpu_init().
>>>     2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at
>>> init stage,
>>>       e.g., in bpf_global_ma_init() init function (core.c), and later
>>> it can
>>>       be used in bpf_mem_alloc_percpu_init().
>>>     3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize
>>> ma->objcg
>>>       properly. But delay __alloc_percpu_gfp() later when verifier
>>> found a call
>>>       to bpf_percpu_obj_new(). We could add a call
>>> bpf_mem_alloc_percpu_init_caches()
>>>       to do __alloc_percpu_grp().
>>>
>>> I prefer option 3, what do you think?
>> The option 4 below:
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 984c83ecace9..f90989cc9cbc 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -122,6 +122,7 @@ struct bpf_mem_caches {
>>   };
>>   
>>   static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256,
>> 512, 1024, 2048, 4096};
>> +static struct obj_cgroup *objcg_at_init __ro_after_init;
>>   
>>   static struct llist_node notrace *__llist_del_first(struct llist_head
>> *head)
>>   {
>> @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc
>> *ma)
>>          ma->percpu = true;
>>   
>>   #ifdef CONFIG_MEMCG_KMEM
>> -       ma->objcg = get_obj_cgroup_from_current();
>> +       ma->objcg = objcg_at_init;
>>   #else
>>          ma->objcg = NULL;
>>   #endif
>> @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct
>> bpf_mem_alloc *ma, gfp_t flags)
>>   
>>          return !ret ? NULL : ret + LLIST_NODE_SZ;
>>   }
>> +
>> +static int __init find_objcg_at_init(void)
>> +{
>> +       objcg_at_init = get_obj_cgroup_from_current();
>> +       return 0;
>> +}
>> +late_initcall(find_objcg_at_init);
>>
>> It seems this is better?
> It seems that you are worried about the objcg of root memcg may change
> from NULL to something else one day, right ? If it is the case, I think
> both option 3 and 4 are fine. But I still think passing the desired

I indeed a little bit worried that objcg for root memcg may change to
a non-NULL value.

Another option is to add an inline static function like below in
memcontrol.h file something like

static inline struct obj_cgroup *get_obj_cgroup_from_root_memcg() {
	return NULL;
}

But it seems silly that we have such helper function as it is obvious
from current implementation.

> objcg to bpf_mem_alloc_percpu_init() directly is better. If the objcg of
> root memcg is not NULL afterwards, we can update the passed parameter
> accordingly.

So I guess what you suggested also make sense. I can add a comment to
the call site of bpf_mem_alloc_percpu_init() to explain why 'NULL' objcg.
If in really unlikely case objcg for root memcg changed, we can fix it
accordingly.

Will send v6 to fix this later today.

>>>>> +                            if (!err)
>>>>> +                                bpf_global_percpu_ma_set = true;
>>>>> +                        }
>>>>> + mutex_unlock(&bpf_percpu_ma_lock);
>>>>> +                        if (err)
>>>>> +                            return err;
>>>>> +                    }
>>>>> +
>>>>> +                    mutex_lock(&bpf_percpu_ma_lock);
>>>>> +                    err =
>>>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
>>>>> +                    mutex_unlock(&bpf_percpu_ma_lock);
>>>>> +                    if (err)
>>>>> +                        return err;
>>>>> +                }
>>>>> +
>>>>>                    struct_meta = btf_find_struct_meta(ret_btf,
>>>>> ret_btf_id);
>>>>>                    if (meta.func_id ==
>>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>>>                        if (!__btf_type_is_scalar_struct(env,
>>>>> ret_btf, ret_t, 0)) {

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-12-21 16:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-21  4:59 [PATCH bpf-next v5 0/8] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 1/8] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 2/8] bpf: Add objcg to bpf_mem_alloc Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
2023-12-21  6:26   ` Hou Tao
2023-12-21  7:16     ` Yonghong Song
2023-12-21  7:52       ` Yonghong Song
2023-12-21  8:42         ` Hou Tao
2023-12-21 16:53           ` Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 4/8] bpf: Refill only one percpu element in memalloc Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 5/8] bpf: Use smaller low/high marks for percpu allocation Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 6/8] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 7/8] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
2023-12-21  5:00 ` [PATCH bpf-next v5 8/8] selftests/bpf: Add a selftest with > 512-byte percpu allocation size Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox