BPF List
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
@ 2022-12-12  0:37 Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 1/9] mm: Introduce active vm item Yafang Shao
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

Currently there's no way to get BPF memory usage, while we can only
estimate the usage by bpftool or memcg, both of which are not reliable.

- bpftool
  `bpftool {map,prog} show` can show us the memlock of each map and
  prog, but the memlock is vary from the real memory size. The memlock
  of a bpf object is approximately
  `round_up(key_size + value_size, 8) * max_entries`,
  so 1) it can't apply to the non-preallocated bpf map which may
  increase or decrease the real memory size dynamically. 2) the element
  size of some bpf map is not `key_size + value_size`, for example the
  element size of htab is
  `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
  That said the differece between these two values may be very great if
  the key_size and value_size is small. For example in my verifaction,
  the size of memlock and real memory of a preallocated hash map are,

  $ grep BPF /proc/meminfo
  BPF:             1026048 B <<< the size of preallocated memalloc pool

  (create hash map)

  $ bpftool map show
  3: hash  name count_map  flags 0x0
          key 4B  value 4B  max_entries 1048576  memlock 8388608B

  $ grep BPF /proc/meminfo
  BPF:            84919344 B
 
  So the real memory size is $((84919344 - 1026048)) which is 83893296
  bytes while the memlock is only 8388608 bytes.

- memcg  
  With memcg we only know that the BPF memory usage is less than
  memory.usage_in_bytes (or memory.current in v2). Furthermore, we only 
  know that the BPF memory usage is less than $MemTotal if the BPF
  object is charged into root memcg :)

So we need a way to get the BPF memory usage especially there will be
more and more bpf programs running on the production environment. The
memory usage of BPF memory is not trivial, which deserves a new item in
/proc/meminfo.

This patchset introduce a solution to calculate the BPF memory usage.
This solution is similar to how memory is charged into memcg, so it is
easy to understand. It counts three types of memory usage -
 - page
   via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
   their families.
   When a page is allocated, we will count its size and mark the head
   page, and then check the head page at page freeing.
 - slab
   via kmalloc, kmem_cache_alloc and their families.
   When a slab object is allocated, we will mark this object in this
   slab and check it at slab object freeing. That said we need extra memory
   to store the information of each object in a slab.
 - percpu 
   via alloc_percpu and its family.
   When a percpu area is allocated, we will mark this area in this
   percpu chunk and check it at percpu area freeing. That said we need
   extra memory to store the information of each area in a percpu chunk.

So we only need to annotate the allcation to add the BPF memory size,
and the sub of the BPF memory size will be handled automatically at
freeing. We can annotate it in irq, softirq or process context. To avoid
counting the nested allcations, for example the percpu backing allocator,
we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
the count consistent with memcg accounting. 

To store the information of a slab or a page, we need to create a new
member in struct page, but we can do it in page extension which can
avoid changing the size of struct page. So a new page extension
active_vm is introduced. Each page and each slab which is allocated as
BPF memory will have a struct active_vm. The reason it is named as
active_vm is that we can extend it to other areas easily, for example in
the future we may use it to count other memory usage. 

The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
compile time or kernel parameter `active_vm=` at runtime.

Below is the result of this patchset,

$ grep BPF /proc/meminfo
BPF:                1002 kB

Currently only bpf map is supported, and only slub in supported.

Future works:
- support bpf prog
- not sure if it needs to support slab
  (it seems slab will be deprecated)
- support per-map memory usage
- support per-memcg memory usage

Yafang Shao (9):
  mm: Introduce active vm item
  mm: Allow using active vm in all contexts
  mm: percpu: Account active vm for percpu
  mm: slab: Account active vm for slab
  mm: Account active vm for page
  bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: Use bpf_map_kzalloc in arraymap
  bpf: Use bpf_map_kvcalloc in bpf_local_storage
  bpf: Use active vm to account bpf map memory usage

 fs/proc/meminfo.c              |   3 +
 include/linux/active_vm.h      |  73 ++++++++++++
 include/linux/bpf.h            |   8 ++
 include/linux/page_ext.h       |   1 +
 include/linux/sched.h          |   5 +
 kernel/bpf/arraymap.c          |  16 +--
 kernel/bpf/bpf_local_storage.c |   4 +-
 kernel/bpf/memalloc.c          |   5 +
 kernel/bpf/ringbuf.c           |  75 ++++++++----
 kernel/bpf/syscall.c           |  40 ++++++-
 kernel/fork.c                  |   4 +
 mm/Kconfig                     |   8 ++
 mm/Makefile                    |   1 +
 mm/active_vm.c                 | 203 +++++++++++++++++++++++++++++++++
 mm/active_vm.h                 |  74 ++++++++++++
 mm/page_alloc.c                |  14 +++
 mm/page_ext.c                  |   4 +
 mm/percpu-internal.h           |   3 +
 mm/percpu.c                    |  43 +++++++
 mm/slab.h                      |   7 ++
 mm/slub.c                      |   2 +
 21 files changed, 557 insertions(+), 36 deletions(-)
 create mode 100644 include/linux/active_vm.h
 create mode 100644 mm/active_vm.c
 create mode 100644 mm/active_vm.h

-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 1/9] mm: Introduce active vm item
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 2/9] mm: Allow using active vm in all contexts Yafang Shao
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

A new page extension active_vm is introduced in this patch. It can be
enabled and disabled by setting CONFIG_ACTIVE_VM at compile time or a boot
parameter 'active_vm' at run time.

In the followup patches, we will use it to account specific memory
allocation for page, slab and percpu memory.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/active_vm.h | 23 +++++++++++++++++++++++
 mm/Kconfig                |  8 ++++++++
 mm/Makefile               |  1 +
 mm/active_vm.c            | 33 +++++++++++++++++++++++++++++++++
 mm/active_vm.h            |  8 ++++++++
 mm/page_ext.c             |  4 ++++
 6 files changed, 77 insertions(+)
 create mode 100644 include/linux/active_vm.h
 create mode 100644 mm/active_vm.c
 create mode 100644 mm/active_vm.h

diff --git a/include/linux/active_vm.h b/include/linux/active_vm.h
new file mode 100644
index 000000000000..899e578e94fa
--- /dev/null
+++ b/include/linux/active_vm.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __INCLUDE_ACTIVE_VM_H
+#define __INCLUDE_ACTIVE_VM_H
+
+#ifdef CONFIG_ACTIVE_VM
+#include <linux/jump_label.h>
+
+extern struct static_key_true active_vm_disabled;
+
+static inline bool active_vm_enabled(void)
+{
+	if (static_branch_likely(&active_vm_disabled))
+		return false;
+
+	return true;
+}
+#else
+static inline bool active_vm_enabled(void)
+{
+	return false;
+}
+#endif /* CONFIG_ACTIVE_VM */
+#endif /* __INCLUDE_ACTIVE_VM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57e1d8c5b505..ba1087e4afff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1150,6 +1150,14 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config ACTIVE_VM
+	bool "To track memory size of active VM item"
+	default y
+	depends on PAGE_EXTENSION
+	help
+		Allow scope-based memory accouting for specific memory, e.g. the
+		system-wide BPF memory usage.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..347dcff061d5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_ACTIVE_VM) += active_vm.o
diff --git a/mm/active_vm.c b/mm/active_vm.c
new file mode 100644
index 000000000000..60849930a7d3
--- /dev/null
+++ b/mm/active_vm.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/page_ext.h>
+
+static bool __active_vm_enabled __initdata =
+				IS_ENABLED(CONFIG_ACTIVE_VM);
+
+DEFINE_STATIC_KEY_TRUE(active_vm_disabled);
+EXPORT_SYMBOL(active_vm_disabled);
+
+static int __init early_active_vm_param(char *buf)
+{
+	return strtobool(buf, &__active_vm_enabled);
+}
+
+early_param("active_vm", early_active_vm_param);
+
+static bool __init need_active_vm(void)
+{
+	return __active_vm_enabled;
+}
+
+static void __init init_active_vm(void)
+{
+	if (!__active_vm_enabled)
+		return;
+
+	static_branch_disable(&active_vm_disabled);
+}
+
+struct page_ext_operations active_vm_ops = {
+	.need = need_active_vm,
+	.init = init_active_vm,
+};
diff --git a/mm/active_vm.h b/mm/active_vm.h
new file mode 100644
index 000000000000..72978955833e
--- /dev/null
+++ b/mm/active_vm.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __MM_ACTIVE_VM_H
+#define __MM_ACTIVE_VM_H
+
+#ifdef CONFIG_ACTIVE_VM
+extern struct page_ext_operations active_vm_ops;
+#endif /* CONFIG_ACTIVE_VM */
+#endif /* __MM_ACTIVE_VM_H */
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ddf1968560f0..3a3a91bc9e06 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -10,6 +10,7 @@
 #include <linux/page_idle.h>
 #include <linux/page_table_check.h>
 #include <linux/rcupdate.h>
+#include "active_vm.h"
 
 /*
  * struct page extension
@@ -84,6 +85,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+#ifdef CONFIG_ACTIVE_VM
+	&active_vm_ops,
+#endif
 };
 
 unsigned long page_ext_size = sizeof(struct page_ext);
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 2/9] mm: Allow using active vm in all contexts
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 1/9] mm: Introduce active vm item Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 3/9] mm: percpu: Account active vm for percpu Yafang Shao
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

We can use active vm in task, softirq and irq to account specific memory
usage. Two new helpers active_vm_{add,sub} are introduced.

A dummy item is introduced here, which will be removed when we introduce
real item.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/active_vm.h | 50 +++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h     |  5 ++++
 kernel/fork.c             |  4 ++++
 mm/active_vm.c            | 23 ++++++++++++++++++
 mm/active_vm.h            | 38 +++++++++++++++++++++++++++++
 5 files changed, 120 insertions(+)

diff --git a/include/linux/active_vm.h b/include/linux/active_vm.h
index 899e578e94fa..21f9aaca12c4 100644
--- a/include/linux/active_vm.h
+++ b/include/linux/active_vm.h
@@ -4,6 +4,9 @@
 
 #ifdef CONFIG_ACTIVE_VM
 #include <linux/jump_label.h>
+#include <linux/preempt.h>
+#include <linux/percpu-defs.h>
+#include <linux/sched.h>
 
 extern struct static_key_true active_vm_disabled;
 
@@ -14,10 +17,57 @@ static inline bool active_vm_enabled(void)
 
 	return true;
 }
+
+enum active_vm_item {
+	DUMMY_ITEM = 1,
+	NR_ACTIVE_VM_ITEM = DUMMY_ITEM,
+};
+
+struct active_vm_stat {
+	long stat[NR_ACTIVE_VM_ITEM];
+};
+
+DECLARE_PER_CPU(struct active_vm_stat, active_vm_stats);
+DECLARE_PER_CPU(int, irq_active_vm_item);
+DECLARE_PER_CPU(int, soft_active_vm_item);
+
+static inline int
+active_vm_item_set(int item)
+{
+	int old_item;
+
+	if (in_irq()) {
+		old_item = this_cpu_read(irq_active_vm_item);
+		this_cpu_write(irq_active_vm_item, item);
+	} else if (in_softirq()) {
+		old_item = this_cpu_read(soft_active_vm_item);
+		this_cpu_write(soft_active_vm_item, item);
+	} else {
+		old_item = current->active_vm_item;
+		current->active_vm_item = item;
+	}
+
+	return old_item;
+}
+
+long active_vm_item_sum(int item);
+
 #else
 static inline bool active_vm_enabled(void)
 {
 	return false;
 }
+
+static inline int
+active_vm_item_set(int item)
+{
+	return 0;
+}
+
+static inline long active_vm_item_sum(int item)
+{
+	return 0;
+}
+
 #endif /* CONFIG_ACTIVE_VM */
 #endif /* __INCLUDE_ACTIVE_VM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..05acefd383d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1441,6 +1441,11 @@ struct task_struct {
 	struct mem_cgroup		*active_memcg;
 #endif
 
+#ifdef CONFIG_ACTIVE_VM
+	/* Used for scope-based memory accounting */
+	int				active_vm_item;
+#endif
+
 #ifdef CONFIG_BLK_CGROUP
 	struct request_queue		*throttle_queue;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..590d949ff131 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1043,6 +1043,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->active_memcg = NULL;
 #endif
 
+#ifdef CONFIG_ACTIVE_VM
+	tsk->active_vm_item = 0;
+#endif
+
 #ifdef CONFIG_CPU_SUP_INTEL
 	tsk->reported_split_lock = 0;
 #endif
diff --git a/mm/active_vm.c b/mm/active_vm.c
index 60849930a7d3..541b2ba22da9 100644
--- a/mm/active_vm.c
+++ b/mm/active_vm.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/page_ext.h>
+#include <linux/active_vm.h>
 
 static bool __active_vm_enabled __initdata =
 				IS_ENABLED(CONFIG_ACTIVE_VM);
@@ -31,3 +32,25 @@ struct page_ext_operations active_vm_ops = {
 	.need = need_active_vm,
 	.init = init_active_vm,
 };
+
+DEFINE_PER_CPU(int, irq_active_vm_item);
+DEFINE_PER_CPU(int, soft_active_vm_item);
+EXPORT_PER_CPU_SYMBOL(irq_active_vm_item);
+EXPORT_PER_CPU_SYMBOL(soft_active_vm_item);
+DEFINE_PER_CPU(struct active_vm_stat, active_vm_stats);
+EXPORT_PER_CPU_SYMBOL(active_vm_stats);
+
+long active_vm_item_sum(int item)
+{
+	struct active_vm_stat *this;
+	long sum = 0;
+	int cpu;
+
+	WARN_ON_ONCE(item <= 0);
+	for_each_online_cpu(cpu) {
+		this = &per_cpu(active_vm_stats, cpu);
+		sum += this->stat[item - 1];
+	}
+
+	return sum;
+}
diff --git a/mm/active_vm.h b/mm/active_vm.h
index 72978955833e..1df088d768ef 100644
--- a/mm/active_vm.h
+++ b/mm/active_vm.h
@@ -3,6 +3,44 @@
 #define __MM_ACTIVE_VM_H
 
 #ifdef CONFIG_ACTIVE_VM
+#include <linux/active_vm.h>
+
 extern struct page_ext_operations active_vm_ops;
+
+static inline int active_vm_item(void)
+{
+	if (in_irq())
+		return this_cpu_read(irq_active_vm_item);
+
+	if (in_softirq())
+		return this_cpu_read(soft_active_vm_item);
+
+	return current->active_vm_item;
+}
+
+static inline void active_vm_item_add(int item, long delta)
+{
+	WARN_ON_ONCE(item <= 0);
+	this_cpu_add(active_vm_stats.stat[item - 1], delta);
+}
+
+static inline void active_vm_item_sub(int item, long delta)
+{
+	WARN_ON_ONCE(item <= 0);
+	this_cpu_sub(active_vm_stats.stat[item - 1], delta);
+}
+#else /* CONFIG_ACTIVE_VM */
+static inline int active_vm_item(void)
+{
+	return 0;
+}
+
+static inline void active_vm_item_add(int item, long delta)
+{
+}
+
+static inline void active_vm_item_sub(int item, long delta)
+{
+}
 #endif /* CONFIG_ACTIVE_VM */
 #endif /* __MM_ACTIVE_VM_H */
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 3/9] mm: percpu: Account active vm for percpu
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 1/9] mm: Introduce active vm item Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 2/9] mm: Allow using active vm in all contexts Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 4/9] mm: slab: Account active vm for slab Yafang Shao
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

Account percpu allocation when active vm item is set. The percpu memory
is accounted at percpu alloc and unaccount at percpu free. To record
which part of percpu chunk is enabled with active vm, we have to
allocate extra memory for this percpu chunk to record the active vm
information. This extra memory will be freed when this percpu chunk
is freed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/percpu-internal.h |  3 +++
 mm/percpu.c          | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 70b1ea23f4d2..f56e236a2cf3 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -63,6 +63,9 @@ struct pcpu_chunk {
 	int			nr_pages;	/* # of pages served by this chunk */
 	int			nr_populated;	/* # of populated pages */
 	int                     nr_empty_pop_pages; /* # of empty populated pages */
+#ifdef CONFIG_ACTIVE_VM
+	int			*active_vm;	/* vector of activem vm items */
+#endif
 	unsigned long		populated[];	/* populated bitmap */
 };
 
diff --git a/mm/percpu.c b/mm/percpu.c
index 27697b2429c2..05858981ed4a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -98,6 +98,7 @@
 #include <trace/events/percpu.h>
 
 #include "percpu-internal.h"
+#include "active_vm.h"
 
 /*
  * The slots are sorted by the size of the biggest continuous free area.
@@ -1398,6 +1399,9 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 #ifdef CONFIG_MEMCG_KMEM
 	/* first chunk is free to use */
 	chunk->obj_cgroups = NULL;
+#endif
+#ifdef CONFIG_ACTIVE_VM
+	chunk->active_vm = NULL;
 #endif
 	pcpu_init_md_blocks(chunk);
 
@@ -1476,6 +1480,14 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 	}
 #endif
 
+#ifdef CONFIG_ACTIVE_VM
+	if (active_vm_enabled()) {
+		chunk->active_vm = pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
+								sizeof(int), gfp);
+		if (!chunk->active_vm)
+			goto active_vm_fail;
+	}
+#endif
 	pcpu_init_md_blocks(chunk);
 
 	/* init metadata */
@@ -1483,6 +1495,12 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 
 	return chunk;
 
+#ifdef CONFIG_ACTIVE_VM
+active_vm_fail:
+#ifdef CONFIG_MEMCG_KMEM
+	pcpu_mem_free(chunk->obj_cgroups);
+#endif
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 objcg_fail:
 	pcpu_mem_free(chunk->md_blocks);
@@ -1501,6 +1519,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
 	if (!chunk)
 		return;
+#ifdef CONFIG_ACTIVE_VM
+	pcpu_mem_free(chunk->active_vm);
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	pcpu_mem_free(chunk->obj_cgroups);
 #endif
@@ -1890,6 +1911,17 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
 
+#ifdef CONFIG_ACTIVE_VM
+	if (active_vm_enabled() && chunk->active_vm && (gfp & __GFP_ACCOUNT)) {
+		int item = active_vm_item();
+
+		if (item > 0) {
+			chunk->active_vm[off >> PCPU_MIN_ALLOC_SHIFT] = item;
+			active_vm_item_add(item, size);
+		}
+	}
+#endif
+
 	return ptr;
 
 fail_unlock:
@@ -2283,6 +2315,17 @@ void free_percpu(void __percpu *ptr)
 
 	pcpu_memcg_free_hook(chunk, off, size);
 
+#ifdef CONFIG_ACTIVE_VM
+	if (active_vm_enabled() && chunk->active_vm) {
+		int item = chunk->active_vm[off >> PCPU_MIN_ALLOC_SHIFT];
+
+		if (item > 0) {
+			active_vm_item_sub(item, size);
+			chunk->active_vm[off >> PCPU_MIN_ALLOC_SHIFT] = 0;
+		}
+	}
+#endif
+
 	/*
 	 * If there are more than one fully free chunks, wake up grim reaper.
 	 * If the chunk is isolated, it may be in the process of being
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 4/9] mm: slab: Account active vm for slab
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (2 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 3/9] mm: percpu: Account active vm for percpu Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 5/9] mm: Account active vm for page Yafang Shao
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

When a slab object is allocated, we will mark this object in this
slab and check it at slab object freeing. That said we need extra memory
to store the information of each object in a slab. The information of
each object in a slab can be stored in the new introduced page extension
active_vm, so a new member is added into struct active_vm.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/active_vm.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/active_vm.h |  16 +++++++
 mm/slab.h      |   7 ++++
 mm/slub.c      |   2 +
 4 files changed, 136 insertions(+)

diff --git a/mm/active_vm.c b/mm/active_vm.c
index 541b2ba22da9..ee38047a4adc 100644
--- a/mm/active_vm.c
+++ b/mm/active_vm.c
@@ -1,6 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
 #include <linux/page_ext.h>
 #include <linux/active_vm.h>
+#include <linux/slab.h>
+
+#include "active_vm.h"
+#include "slab.h"
 
 static bool __active_vm_enabled __initdata =
 				IS_ENABLED(CONFIG_ACTIVE_VM);
@@ -28,7 +33,12 @@ static void __init init_active_vm(void)
 	static_branch_disable(&active_vm_disabled);
 }
 
+struct active_vm {
+	int *slab_data;     /* for slab */
+};
+
 struct page_ext_operations active_vm_ops = {
+	.size = sizeof(struct active_vm),
 	.need = need_active_vm,
 	.init = init_active_vm,
 };
@@ -54,3 +64,104 @@ long active_vm_item_sum(int item)
 
 	return sum;
 }
+
+static int *active_vm_from_slab(struct page_ext *page_ext)
+{
+	struct active_vm *av;
+
+	if (static_branch_likely(&active_vm_disabled))
+		return NULL;
+
+	av = (void *)(page_ext) + active_vm_ops.offset;
+	return READ_ONCE(av->slab_data);
+}
+
+void active_vm_slab_free(struct slab *slab)
+{
+	struct page_ext *page_ext;
+	struct active_vm *av;
+	struct page *page;
+
+	page = slab_page(slab);
+	page_ext = page_ext_get(page);
+	if (!page_ext)
+		return;
+
+	av = (void *)(page_ext) + active_vm_ops.offset;
+	kfree(av->slab_data);
+	av->slab_data = NULL;
+	page_ext_put(page_ext);
+}
+
+static bool active_vm_slab_cmpxchg(struct page_ext *page_ext, int *new)
+{
+	struct active_vm *av;
+
+	av = (void *)(page_ext) + active_vm_ops.offset;
+	return cmpxchg(&av->slab_data, NULL, new) == NULL;
+}
+
+void active_vm_slab_add(struct kmem_cache *s, gfp_t flags, size_t size, void **p)
+{
+	struct page_ext *page_ext;
+	struct slab *slab;
+	struct page *page;
+	int *vec;
+	int item;
+	int off;
+	int i;
+
+	item = active_vm_item();
+	for (i = 0; i < size; i++) {
+		slab = virt_to_slab(p[i]);
+		page = slab_page(slab);
+		page_ext = page_ext_get(page);
+
+		if (!page_ext)
+			continue;
+
+		off = obj_to_index(s, slab, p[i]);
+		vec = active_vm_from_slab(page_ext);
+		if (!vec) {
+			vec = kcalloc_node(objs_per_slab(s, slab), sizeof(int),
+						flags & ~__GFP_ACCOUNT, slab_nid(slab));
+			if (!vec) {
+				page_ext_put(page_ext);
+				continue;
+			}
+
+			if (!active_vm_slab_cmpxchg(page_ext, vec)) {
+				kfree(vec);
+				vec = active_vm_from_slab(page_ext);
+			}
+		}
+
+		vec[off] = item;
+		active_vm_item_add(item, obj_full_size(s));
+		page_ext_put(page_ext);
+	}
+}
+
+void active_vm_slab_sub(struct kmem_cache *s, struct slab *slab, void **p, int cnt)
+{
+	struct page *page = slab_page(slab);
+	struct page_ext *page_ext = page_ext_get(page);
+	int *vec;
+	int off;
+	int i;
+
+	if (!page_ext)
+		return;
+
+	for (i = 0; i < cnt; i++) {
+		vec = active_vm_from_slab(page_ext);
+		if (vec) {
+			off = obj_to_index(s, slab, p[i]);
+			if (vec[off] > 0) {
+				active_vm_item_sub(vec[off], obj_full_size(s));
+				vec[off] = 0;
+			}
+		}
+	}
+	page_ext_put(page_ext);
+}
diff --git a/mm/active_vm.h b/mm/active_vm.h
index 1df088d768ef..cf80b35412c5 100644
--- a/mm/active_vm.h
+++ b/mm/active_vm.h
@@ -4,8 +4,12 @@
 
 #ifdef CONFIG_ACTIVE_VM
 #include <linux/active_vm.h>
+#include <linux/page_ext.h>
 
 extern struct page_ext_operations active_vm_ops;
+void active_vm_slab_add(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+void active_vm_slab_sub(struct kmem_cache *s, struct slab *slab, void **p, int cnt);
+void active_vm_slab_free(struct slab *slab);
 
 static inline int active_vm_item(void)
 {
@@ -42,5 +46,17 @@ static inline void active_vm_item_add(int item, long delta)
 static inline void active_vm_item_sub(int item, long delta)
 {
 }
+
+static inline void active_vm_slab_add(struct kmem_cache *s, gfp_t flags, size_t size, void **p)
+{
+}
+
+static inline void active_vm_slab_sub(struct kmem_cache *s, struct slab *slab, void **p, int cnt)
+{
+}
+
+static inline void active_vm_slab_free(struct slab *slab)
+{
+}
 #endif /* CONFIG_ACTIVE_VM */
 #endif /* __MM_ACTIVE_VM_H */
diff --git a/mm/slab.h b/mm/slab.h
index 0202a8c2f0d2..e8a4c16c29cb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -232,6 +232,8 @@ struct kmem_cache {
 #include <linux/random.h>
 #include <linux/sched/mm.h>
 #include <linux/list_lru.h>
+#include <linux/active_vm.h>
+#include "active_vm.h"
 
 /*
  * State of the slab allocator.
@@ -644,6 +646,9 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 	if (memcg_kmem_enabled())
 		memcg_free_slab_cgroups(slab);
 
+	if (active_vm_enabled())
+		active_vm_slab_free(slab);
+
 	mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
@@ -742,6 +747,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
 		kmsan_slab_alloc(s, p[i], flags);
 	}
 
+	if (active_vm_enabled() && (flags & __GFP_ACCOUNT) && active_vm_item() > 0)
+		active_vm_slab_add(s, flags, size, p);
 	memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 157527d7101b..21bd714c1182 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -45,6 +45,7 @@
 #include <trace/events/kmem.h>
 
 #include "internal.h"
+#include "active_vm.h"
 
 /*
  * Lock order:
@@ -3654,6 +3655,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
 				      unsigned long addr)
 {
 	memcg_slab_free_hook(s, slab, p, cnt);
+	active_vm_slab_sub(s, slab, p, cnt);
 	/*
 	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
 	 * to remove objects, whose reuse must be delayed.
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 5/9] mm: Account active vm for page
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (3 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 4/9] mm: slab: Account active vm for slab Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 6/9] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

Account active vm for page allocation and unaccount then page is freed.
We can reuse the slab_data in struct active_vm to store the information
of page allocation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/page_ext.h |  1 +
 mm/active_vm.c           | 38 +++++++++++++++++++++++++++++++++++++-
 mm/active_vm.h           | 12 ++++++++++++
 mm/page_alloc.c          | 14 ++++++++++++++
 4 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 22be4582faae..5d02f939d5df 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/stacktrace.h>
 #include <linux/stackdepot.h>
+#include <linux/active_vm.h>
 
 struct pglist_data;
 struct page_ext_operations {
diff --git a/mm/active_vm.c b/mm/active_vm.c
index ee38047a4adc..a06987639e00 100644
--- a/mm/active_vm.c
+++ b/mm/active_vm.c
@@ -34,7 +34,10 @@ static void __init init_active_vm(void)
 }
 
 struct active_vm {
-	int *slab_data;     /* for slab */
+	union {
+		int *slab_data;     /* for slab */
+		unsigned long page_data;	/* for page */
+	}
 };
 
 struct page_ext_operations active_vm_ops = {
@@ -165,3 +168,36 @@ void active_vm_slab_sub(struct kmem_cache *s, struct slab *slab, void **p, int c
 	}
 	page_ext_put(page_ext);
 }
+
+void page_set_active_vm(struct page *page, unsigned int item, unsigned int order)
+{
+	struct page_ext *page_ext = page_ext_get(page);
+	struct active_vm *av;
+
+	if (unlikely(!page_ext))
+		return;
+
+	av = (void *)(page_ext) + active_vm_ops.offset;
+	WARN_ON_ONCE(av->page_data != 0);
+	av->page_data = item;
+	page_ext_put(page_ext);
+	active_vm_item_add(item, PAGE_SIZE << order);
+}
+
+void page_test_clear_active_vm(struct page *page, unsigned int order)
+{
+	struct page_ext *page_ext = page_ext_get(page);
+	struct active_vm *av;
+
+	if (unlikely(!page_ext))
+		return;
+
+	av = (void *)(page_ext) + active_vm_ops.offset;
+	if (av->page_data <= 0)
+		goto out;
+
+	active_vm_item_sub(av->page_data, PAGE_SIZE << order);
+	av->page_data = 0;
+out:
+	page_ext_put(page_ext);
+}
diff --git a/mm/active_vm.h b/mm/active_vm.h
index cf80b35412c5..1ff27b0b5dbe 100644
--- a/mm/active_vm.h
+++ b/mm/active_vm.h
@@ -10,6 +10,8 @@ extern struct page_ext_operations active_vm_ops;
 void active_vm_slab_add(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
 void active_vm_slab_sub(struct kmem_cache *s, struct slab *slab, void **p, int cnt);
 void active_vm_slab_free(struct slab *slab);
+void page_set_active_vm(struct page *page, unsigned int item, unsigned int order);
+void page_test_clear_active_vm(struct page *page, unsigned int order);
 
 static inline int active_vm_item(void)
 {
@@ -33,6 +35,7 @@ static inline void active_vm_item_sub(int item, long delta)
 	WARN_ON_ONCE(item <= 0);
 	this_cpu_sub(active_vm_stats.stat[item - 1], delta);
 }
+
 #else /* CONFIG_ACTIVE_VM */
 static inline int active_vm_item(void)
 {
@@ -58,5 +61,14 @@ static inline void active_vm_slab_sub(struct kmem_cache *s, struct slab *slab, v
 static inline void active_vm_slab_free(struct slab *slab)
 {
 }
+
+static inline void page_set_active_vm(struct page *page, int item,
+									  unsigned int order)
+{
+}
+
+static inline void page_test_clear_active_vm(struct page *page, unsigned int order)
+{
+}
 #endif /* CONFIG_ACTIVE_VM */
 #endif /* __MM_ACTIVE_VM_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e60657875d3..deac544e9050 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -76,6 +76,8 @@
 #include <linux/khugepaged.h>
 #include <linux/buffer_head.h>
 #include <linux/delayacct.h>
+#include <linux/page_ext.h>
+#include <linux/active_vm.h>
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -83,6 +85,7 @@
 #include "shuffle.h"
 #include "page_reporting.h"
 #include "swap.h"
+#include "active_vm.h"
 
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
@@ -1449,6 +1452,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
 		page->mapping = NULL;
 	if (memcg_kmem_enabled() && PageMemcgKmem(page))
 		__memcg_kmem_uncharge_page(page, order);
+
+	if (active_vm_enabled())
+		page_test_clear_active_vm(page, order);
+
 	if (check_free && free_page_is_bad(page))
 		bad++;
 	if (bad)
@@ -5577,6 +5584,13 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 		page = NULL;
 	}
 
+	if (active_vm_enabled() && (gfp & __GFP_ACCOUNT) && page) {
+		int active_vm = active_vm_item();
+
+		if (active_vm > 0)
+			page_set_active_vm(page, active_vm, order);
+	}
+
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 6/9] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (4 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 5/9] mm: Account active vm for page Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 7/9] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

Allocate pages related memory into the new helper
bpf_ringbuf_pages_alloc(), then it can be handled as a single unit.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/ringbuf.c | 71 +++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 80f4b4d88aaf..3264bf509c68 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -92,6 +92,48 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
+static void bpf_ringbuf_pages_free(struct page **pages, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		__free_page(pages[i]);
+	bpf_map_area_free(pages);
+}
+
+static struct page **bpf_ringbuf_pages_alloc(int nr_meta_pages,
+						int nr_data_pages, int numa_node,
+						const gfp_t flags)
+{
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	struct page **pages, *page;
+	int array_size;
+	int i;
+
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	pages = bpf_map_area_alloc(array_size, numa_node);
+	if (!pages)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+
+	return pages;
+
+err_free_pages:
+	bpf_ringbuf_pages_free(pages, nr_pages);
+err:
+	return NULL;
+}
+
 static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 {
 	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
@@ -99,10 +141,8 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
-	struct page **pages, *page;
 	struct bpf_ringbuf *rb;
-	size_t array_size;
-	int i;
+	struct page **pages;
 
 	/* Each data page is mapped twice to allow "virtual"
 	 * continuous read of samples wrapping around the end of ring
@@ -121,22 +161,11 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	 * when mmap()'ed in user-space, simplifying both kernel and
 	 * user-space implementations significantly.
 	 */
-	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node);
+	pages = bpf_ringbuf_pages_alloc(nr_meta_pages, nr_data_pages,
+									numa_node, flags);
 	if (!pages)
 		return NULL;
 
-	for (i = 0; i < nr_pages; i++) {
-		page = alloc_pages_node(numa_node, flags, 0);
-		if (!page) {
-			nr_pages = i;
-			goto err_free_pages;
-		}
-		pages[i] = page;
-		if (i >= nr_meta_pages)
-			pages[nr_data_pages + i] = page;
-	}
-
 	rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
 		  VM_MAP | VM_USERMAP, PAGE_KERNEL);
 	if (rb) {
@@ -146,10 +175,6 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 		return rb;
 	}
 
-err_free_pages:
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages);
 	return NULL;
 }
 
@@ -219,12 +244,10 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	 * to unmap rb itself with vunmap() below
 	 */
 	struct page **pages = rb->pages;
-	int i, nr_pages = rb->nr_pages;
+	int nr_pages = rb->nr_pages;
 
 	vunmap(rb);
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_ringbuf_pages_free(pages, nr_pages);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 7/9] bpf: Use bpf_map_kzalloc in arraymap
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (5 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 6/9] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 8/9] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

Allocates memory after map creation, then we can use the generic helper
bpf_map_kzalloc() instead of the open-coded kzalloc().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/arraymap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 484706959556..e64a4178d92d 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1102,20 +1102,20 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
 	struct bpf_array_aux *aux;
 	struct bpf_map *map;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
-	if (!aux)
+	map = array_map_alloc(attr);
+	if (IS_ERR(map))
 		return ERR_PTR(-ENOMEM);
 
+	aux = bpf_map_kzalloc(map, sizeof(*aux), GFP_KERNEL);
+	if (!aux) {
+		array_map_free(map);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	INIT_WORK(&aux->work, prog_array_map_clear_deferred);
 	INIT_LIST_HEAD(&aux->poke_progs);
 	mutex_init(&aux->poke_mutex);
 
-	map = array_map_alloc(attr);
-	if (IS_ERR(map)) {
-		kfree(aux);
-		return map;
-	}
-
 	container_of(map, struct bpf_array, map)->aux = aux;
 	aux->map = map;
 
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 8/9] bpf: Use bpf_map_kvcalloc in bpf_local_storage
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (6 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 7/9] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
  2022-12-12  0:37 ` [RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage Yafang Shao
  2022-12-12 17:54 ` [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Vlastimil Babka
  9 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

Introduce new helper bpf_map_kvcalloc() for this memory allocation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  8 ++++++++
 kernel/bpf/bpf_local_storage.c |  4 ++--
 kernel/bpf/syscall.c           | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3de24cfb7a3d..80aca1ef8cc5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1873,6 +1873,8 @@ struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node);
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags);
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags);
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags);
 #else
@@ -1889,6 +1891,12 @@ bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	return kzalloc(size, flags);
 }
 
+static inline void *
+bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, gfp_t flags)
+{
+	return kvcalloc(n, size, flags);
+}
+
 static inline void __percpu *
 bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 		     gfp_t flags)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index b39a46e8fb08..0e43cecfdc07 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -568,8 +568,8 @@ static struct bpf_local_storage_map *__bpf_local_storage_map_alloc(union bpf_att
 	nbuckets = max_t(u32, 2, nbuckets);
 	smap->bucket_log = ilog2(nbuckets);
 
-	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
-				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	smap->buckets = bpf_map_kvcalloc(&smap->map, sizeof(*smap->buckets),
+					 nbuckets, GFP_USER | __GFP_NOWARN);
 	if (!smap->buckets) {
 		bpf_map_area_free(smap);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35972afb6850..c38875d6aea4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -470,6 +470,21 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	return ptr;
 }
 
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags)
+{
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+
+	return ptr;
+}
+
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags)
 {
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (7 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 8/9] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
@ 2022-12-12  0:37 ` Yafang Shao
       [not found]   ` <202212141512.469bca4-yujie.liu@intel.com>
  2022-12-12 17:54 ` [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Vlastimil Babka
  9 siblings, 1 reply; 19+ messages in thread
From: Yafang Shao @ 2022-12-12  0:37 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf, Yafang Shao

The memory allocation via bpf_map_*alloc or bpf memalloc are accounted
with active vm. We only need to annotate the allocation. These memory
will automatically unaccount when they are freed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/proc/meminfo.c         |  3 +++
 include/linux/active_vm.h | 10 +++++-----
 kernel/bpf/memalloc.c     |  5 +++++
 kernel/bpf/ringbuf.c      |  8 ++++++--
 kernel/bpf/syscall.c      | 25 +++++++++++++++++++++++--
 5 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 440960110a42..efe1fbd6a80e 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include <linux/cma.h>
 #endif
+#include <linux/active_vm.h>
 #include <asm/page.h>
 #include "internal.h"
 
@@ -159,6 +160,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	arch_report_meminfo(m);
 
+	seq_printf(m,  "BPF:            %8lu kB\n",
+			active_vm_item_sum(ACTIVE_VM_BPF) >> 10);
 	return 0;
 }
 
diff --git a/include/linux/active_vm.h b/include/linux/active_vm.h
index 21f9aaca12c4..e26edfb3654e 100644
--- a/include/linux/active_vm.h
+++ b/include/linux/active_vm.h
@@ -2,6 +2,11 @@
 #ifndef __INCLUDE_ACTIVE_VM_H
 #define __INCLUDE_ACTIVE_VM_H
 
+enum active_vm_item {
+	ACTIVE_VM_BPF = 1,
+	NR_ACTIVE_VM_ITEM = ACTIVE_VM_BPF,
+};
+
 #ifdef CONFIG_ACTIVE_VM
 #include <linux/jump_label.h>
 #include <linux/preempt.h>
@@ -18,11 +23,6 @@ static inline bool active_vm_enabled(void)
 	return true;
 }
 
-enum active_vm_item {
-	DUMMY_ITEM = 1,
-	NR_ACTIVE_VM_ITEM = DUMMY_ITEM,
-};
-
 struct active_vm_stat {
 	long stat[NR_ACTIVE_VM_ITEM];
 };
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index ebcc3dd0fa19..403ae0d83241 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -7,6 +7,8 @@
 #include <linux/bpf_mem_alloc.h>
 #include <linux/memcontrol.h>
 #include <asm/local.h>
+#include <linux/page_ext.h>
+#include <linux/active_vm.h>
 
 /* Any context (including NMI) BPF specific memory allocator.
  *
@@ -165,11 +167,13 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
 	struct mem_cgroup *memcg = NULL, *old_memcg;
 	unsigned long flags;
+	int old_active_vm;
 	void *obj;
 	int i;
 
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	for (i = 0; i < cnt; i++) {
 		/*
 		 * free_by_rcu is only manipulated by irq work refill_work().
@@ -209,6 +213,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
 	}
+	active_vm_item_set(old_active_vm);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
 }
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 3264bf509c68..7575f078eb34 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -11,6 +11,7 @@
 #include <linux/kmemleak.h>
 #include <uapi/linux/btf.h>
 #include <linux/btf_ids.h>
+#include <linux/active_vm.h>
 
 #define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
 
@@ -107,16 +108,18 @@ static struct page **bpf_ringbuf_pages_alloc(int nr_meta_pages,
 {
 	int nr_pages = nr_meta_pages + nr_data_pages;
 	struct page **pages, *page;
+	int old_active_vm;
 	int array_size;
 	int i;
 
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
 	pages = bpf_map_area_alloc(array_size, numa_node);
 	if (!pages)
 		goto err;
 
 	for (i = 0; i < nr_pages; i++) {
-		page = alloc_pages_node(numa_node, flags, 0);
+		page = alloc_pages_node(numa_node, flags | __GFP_ACCOUNT, 0);
 		if (!page) {
 			nr_pages = i;
 			goto err_free_pages;
@@ -125,12 +128,13 @@ static struct page **bpf_ringbuf_pages_alloc(int nr_meta_pages,
 		if (i >= nr_meta_pages)
 			pages[nr_data_pages + i] = page;
 	}
-
+	active_vm_item_set(old_active_vm);
 	return pages;
 
 err_free_pages:
 	bpf_ringbuf_pages_free(pages, nr_pages);
 err:
+	active_vm_item_set(old_active_vm);
 	return NULL;
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c38875d6aea4..92572d4a09fb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -35,6 +35,8 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/memcontrol.h>
 #include <linux/trace_events.h>
+#include <linux/page_ext.h>
+#include <linux/active_vm.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -312,11 +314,14 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 	const gfp_t gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT;
 	unsigned int flags = 0;
 	unsigned long align = 1;
+	int old_active_vm;
 	void *area;
+	void *ptr;
 
 	if (size >= SIZE_MAX)
 		return NULL;
 
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	/* kmalloc()'ed memory can't be mmap()'ed */
 	if (mmapable) {
 		BUG_ON(!PAGE_ALIGNED(size));
@@ -325,13 +330,17 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 	} else if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
 		area = kmalloc_node(size, gfp | GFP_USER | __GFP_NORETRY,
 				    numa_node);
-		if (area != NULL)
+		if (area != NULL) {
+			active_vm_item_set(old_active_vm);
 			return area;
+		}
 	}
 
-	return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
+	ptr = __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
 			gfp | GFP_KERNEL | __GFP_RETRY_MAYFAIL, PAGE_KERNEL,
 			flags, numa_node, __builtin_return_address(0));
+	active_vm_item_set(old_active_vm);
+	return ptr;
 }
 
 void *bpf_map_area_alloc(u64 size, int numa_node)
@@ -445,11 +454,14 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
 	struct mem_cgroup *memcg, *old_memcg;
+	int old_active_vm;
 	void *ptr;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
+	active_vm_item_set(old_active_vm);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
 
@@ -459,11 +471,14 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 {
 	struct mem_cgroup *memcg, *old_memcg;
+	int old_active_vm;
 	void *ptr;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
+	active_vm_item_set(old_active_vm);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
 
@@ -474,11 +489,14 @@ void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
 		       gfp_t flags)
 {
 	struct mem_cgroup *memcg, *old_memcg;
+	int old_active_vm;
 	void *ptr;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
+	active_vm_item_set(old_active_vm);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
 
@@ -490,10 +508,13 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 {
 	struct mem_cgroup *memcg, *old_memcg;
 	void __percpu *ptr;
+	int old_active_vm;
 
 	memcg = bpf_map_get_memcg(map);
 	old_memcg = set_active_memcg(memcg);
+	old_active_vm = active_vm_item_set(ACTIVE_VM_BPF);
 	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
+	active_vm_item_set(old_active_vm);
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
 
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
                   ` (8 preceding siblings ...)
  2022-12-12  0:37 ` [RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage Yafang Shao
@ 2022-12-12 17:54 ` Vlastimil Babka
  2022-12-13 11:52   ` Yafang Shao
  9 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2022-12-12 17:54 UTC (permalink / raw)
  To: Yafang Shao, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm,
	penberg, rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo
  Cc: linux-mm, bpf

On 12/12/22 01:37, Yafang Shao wrote:
> Currently there's no way to get BPF memory usage, while we can only
> estimate the usage by bpftool or memcg, both of which are not reliable.
> 
> - bpftool
>   `bpftool {map,prog} show` can show us the memlock of each map and
>   prog, but the memlock is vary from the real memory size. The memlock
>   of a bpf object is approximately
>   `round_up(key_size + value_size, 8) * max_entries`,
>   so 1) it can't apply to the non-preallocated bpf map which may
>   increase or decrease the real memory size dynamically. 2) the element
>   size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>   That said the differece between these two values may be very great if
>   the key_size and value_size is small. For example in my verifaction,
>   the size of memlock and real memory of a preallocated hash map are,
> 
>   $ grep BPF /proc/meminfo
>   BPF:             1026048 B <<< the size of preallocated memalloc pool
> 
>   (create hash map)
> 
>   $ bpftool map show
>   3: hash  name count_map  flags 0x0
>           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> 
>   $ grep BPF /proc/meminfo
>   BPF:            84919344 B
>  
>   So the real memory size is $((84919344 - 1026048)) which is 83893296
>   bytes while the memlock is only 8388608 bytes.
> 
> - memcg  
>   With memcg we only know that the BPF memory usage is less than
>   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only 
>   know that the BPF memory usage is less than $MemTotal if the BPF
>   object is charged into root memcg :)
> 
> So we need a way to get the BPF memory usage especially there will be
> more and more bpf programs running on the production environment. The
> memory usage of BPF memory is not trivial, which deserves a new item in
> /proc/meminfo.
> 
> This patchset introduce a solution to calculate the BPF memory usage.
> This solution is similar to how memory is charged into memcg, so it is
> easy to understand. It counts three types of memory usage -
>  - page
>    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
>    their families.
>    When a page is allocated, we will count its size and mark the head
>    page, and then check the head page at page freeing.
>  - slab
>    via kmalloc, kmem_cache_alloc and their families.
>    When a slab object is allocated, we will mark this object in this
>    slab and check it at slab object freeing. That said we need extra memory
>    to store the information of each object in a slab.
>  - percpu 
>    via alloc_percpu and its family.
>    When a percpu area is allocated, we will mark this area in this
>    percpu chunk and check it at percpu area freeing. That said we need
>    extra memory to store the information of each area in a percpu chunk.
> 
> So we only need to annotate the allcation to add the BPF memory size,
> and the sub of the BPF memory size will be handled automatically at
> freeing. We can annotate it in irq, softirq or process context. To avoid
> counting the nested allcations, for example the percpu backing allocator,
> we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> the count consistent with memcg accounting. 

So you can't easily annotate the freeing places as well, to avoid the whole
tracking infrastructure? I thought there was a patchset for a whole
bfp-specific memory allocator, where accounting would be implemented
naturally, I would imagine.

> To store the information of a slab or a page, we need to create a new
> member in struct page, but we can do it in page extension which can
> avoid changing the size of struct page. So a new page extension
> active_vm is introduced. Each page and each slab which is allocated as
> BPF memory will have a struct active_vm. The reason it is named as
> active_vm is that we can extend it to other areas easily, for example in
> the future we may use it to count other memory usage. 
> 
> The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
> compile time or kernel parameter `active_vm=` at runtime.

The issue with page_ext is the extra memory usage, so it was rather intended
for debugging features that can be always compiled in, but only enabled at
runtime when debugging is needed. The overhead is only paid when enabled.
That's at least the case of page_owner and page_table_check. The 32bit
page_idle is rather an oddity that could have instead stayed 64-bit only.

But this is proposing a page_ext functionality supposed to be enabled at all
times in production, with the goal of improved accounting. Not an on-demand
debugging. I'm afraid the costs will outweight the benefits.

Just a quick thought, in case the bpf accounting really can't be handled
without marking pages and slab objects - since memcg already has hooks there
without need of page_ext, couldn't it be done by extending the memcg infra
instead?

> Below is the result of this patchset,
> 
> $ grep BPF /proc/meminfo
> BPF:                1002 kB
> 
> Currently only bpf map is supported, and only slub in supported.
> 
> Future works:
> - support bpf prog
> - not sure if it needs to support slab
>   (it seems slab will be deprecated)
> - support per-map memory usage
> - support per-memcg memory usage
> 
> Yafang Shao (9):
>   mm: Introduce active vm item
>   mm: Allow using active vm in all contexts
>   mm: percpu: Account active vm for percpu
>   mm: slab: Account active vm for slab
>   mm: Account active vm for page
>   bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
>   bpf: Use bpf_map_kzalloc in arraymap
>   bpf: Use bpf_map_kvcalloc in bpf_local_storage
>   bpf: Use active vm to account bpf map memory usage
> 
>  fs/proc/meminfo.c              |   3 +
>  include/linux/active_vm.h      |  73 ++++++++++++
>  include/linux/bpf.h            |   8 ++
>  include/linux/page_ext.h       |   1 +
>  include/linux/sched.h          |   5 +
>  kernel/bpf/arraymap.c          |  16 +--
>  kernel/bpf/bpf_local_storage.c |   4 +-
>  kernel/bpf/memalloc.c          |   5 +
>  kernel/bpf/ringbuf.c           |  75 ++++++++----
>  kernel/bpf/syscall.c           |  40 ++++++-
>  kernel/fork.c                  |   4 +
>  mm/Kconfig                     |   8 ++
>  mm/Makefile                    |   1 +
>  mm/active_vm.c                 | 203 +++++++++++++++++++++++++++++++++
>  mm/active_vm.h                 |  74 ++++++++++++
>  mm/page_alloc.c                |  14 +++
>  mm/page_ext.c                  |   4 +
>  mm/percpu-internal.h           |   3 +
>  mm/percpu.c                    |  43 +++++++
>  mm/slab.h                      |   7 ++
>  mm/slub.c                      |   2 +
>  21 files changed, 557 insertions(+), 36 deletions(-)
>  create mode 100644 include/linux/active_vm.h
>  create mode 100644 mm/active_vm.c
>  create mode 100644 mm/active_vm.h
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-12 17:54 ` [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Vlastimil Babka
@ 2022-12-13 11:52   ` Yafang Shao
  2022-12-13 14:56     ` Hyeonggon Yoo
  0 siblings, 1 reply; 19+ messages in thread
From: Yafang Shao @ 2022-12-13 11:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, 42.hyeyoo, linux-mm,
	bpf

On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/12/22 01:37, Yafang Shao wrote:
> > Currently there's no way to get BPF memory usage, while we can only
> > estimate the usage by bpftool or memcg, both of which are not reliable.
> >
> > - bpftool
> >   `bpftool {map,prog} show` can show us the memlock of each map and
> >   prog, but the memlock is vary from the real memory size. The memlock
> >   of a bpf object is approximately
> >   `round_up(key_size + value_size, 8) * max_entries`,
> >   so 1) it can't apply to the non-preallocated bpf map which may
> >   increase or decrease the real memory size dynamically. 2) the element
> >   size of some bpf map is not `key_size + value_size`, for example the
> >   element size of htab is
> >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >   That said the differece between these two values may be very great if
> >   the key_size and value_size is small. For example in my verifaction,
> >   the size of memlock and real memory of a preallocated hash map are,
> >
> >   $ grep BPF /proc/meminfo
> >   BPF:             1026048 B <<< the size of preallocated memalloc pool
> >
> >   (create hash map)
> >
> >   $ bpftool map show
> >   3: hash  name count_map  flags 0x0
> >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> >   $ grep BPF /proc/meminfo
> >   BPF:            84919344 B
> >
> >   So the real memory size is $((84919344 - 1026048)) which is 83893296
> >   bytes while the memlock is only 8388608 bytes.
> >
> > - memcg
> >   With memcg we only know that the BPF memory usage is less than
> >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
> >   know that the BPF memory usage is less than $MemTotal if the BPF
> >   object is charged into root memcg :)
> >
> > So we need a way to get the BPF memory usage especially there will be
> > more and more bpf programs running on the production environment. The
> > memory usage of BPF memory is not trivial, which deserves a new item in
> > /proc/meminfo.
> >
> > This patchset introduce a solution to calculate the BPF memory usage.
> > This solution is similar to how memory is charged into memcg, so it is
> > easy to understand. It counts three types of memory usage -
> >  - page
> >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
> >    their families.
> >    When a page is allocated, we will count its size and mark the head
> >    page, and then check the head page at page freeing.
> >  - slab
> >    via kmalloc, kmem_cache_alloc and their families.
> >    When a slab object is allocated, we will mark this object in this
> >    slab and check it at slab object freeing. That said we need extra memory
> >    to store the information of each object in a slab.
> >  - percpu
> >    via alloc_percpu and its family.
> >    When a percpu area is allocated, we will mark this area in this
> >    percpu chunk and check it at percpu area freeing. That said we need
> >    extra memory to store the information of each area in a percpu chunk.
> >
> > So we only need to annotate the allcation to add the BPF memory size,
> > and the sub of the BPF memory size will be handled automatically at
> > freeing. We can annotate it in irq, softirq or process context. To avoid
> > counting the nested allcations, for example the percpu backing allocator,
> > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> > the count consistent with memcg accounting.
>
> So you can't easily annotate the freeing places as well, to avoid the whole
> tracking infrastructure?

The trouble is kfree_rcu().  for example,
    old_item = active_vm_item_set(ACTIVE_VM_BPF);
    kfree_rcu();
    active_vm_item_set(old_item);
If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
will change lots of code in the RCU subsystem. I'm not sure if it is
worth it.

>  I thought there was a patchset for a whole
> bfp-specific memory allocator, where accounting would be implemented
> naturally, I would imagine.
>

I posted a patchset[1] which annotates both allocating and freeing
several months ago.
But unfortunately after more investigation and verification I found
the deferred freeing context is a problem, which can't be resolved
easily.
That's why I finally decided to annotate allocating only.

[1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/

> > To store the information of a slab or a page, we need to create a new
> > member in struct page, but we can do it in page extension which can
> > avoid changing the size of struct page. So a new page extension
> > active_vm is introduced. Each page and each slab which is allocated as
> > BPF memory will have a struct active_vm. The reason it is named as
> > active_vm is that we can extend it to other areas easily, for example in
> > the future we may use it to count other memory usage.
> >
> > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
> > compile time or kernel parameter `active_vm=` at runtime.
>
> The issue with page_ext is the extra memory usage, so it was rather intended
> for debugging features that can be always compiled in, but only enabled at
> runtime when debugging is needed. The overhead is only paid when enabled.
> That's at least the case of page_owner and page_table_check. The 32bit
> page_idle is rather an oddity that could have instead stayed 64-bit only.
>

Right, it seems currently page_ext is for debugging purposes only.

> But this is proposing a page_ext functionality supposed to be enabled at all
> times in production, with the goal of improved accounting. Not an on-demand
> debugging. I'm afraid the costs will outweight the benefits.
>

The memory overhead of this new page extension is (8/4096), which is
0.2% of total memory. Not too big to be acceptable. If the user really
thinks this overhead is not accepted, he can set "active_vm=off" to
disable it.

To reduce the memory overhead further, I have a bold idea.
Actually we don't need to allocate such a page extension for every
page,  while we only need to allocate it if the user needs to access
it. That said it seems that we can allocate some kind of page
extensions dynamically rather than preallocate at booting, but I
haven't investigated it deeply to check if it can work. What do you
think?

> Just a quick thought, in case the bpf accounting really can't be handled
> without marking pages and slab objects - since memcg already has hooks there
> without need of page_ext, couldn't it be done by extending the memcg infra
> instead?
>

We need to make sure the accounting of BPF memory usage is still
workable even without memcg, see also the previous discussion[2].

[2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-13 11:52   ` Yafang Shao
@ 2022-12-13 14:56     ` Hyeonggon Yoo
  2022-12-13 15:52       ` Vlastimil Babka
  2022-12-14 10:34       ` Yafang Shao
  0 siblings, 2 replies; 19+ messages in thread
From: Hyeonggon Yoo @ 2022-12-13 14:56 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Vlastimil Babka, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm,
	penberg, rientjes, iamjoonsoo.kim, roman.gushchin, linux-mm, bpf,
	rcu

On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote:
> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/12/22 01:37, Yafang Shao wrote:
> > > Currently there's no way to get BPF memory usage, while we can only
> > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > >
> > > - bpftool
> > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > >   prog, but the memlock is vary from the real memory size. The memlock
> > >   of a bpf object is approximately
> > >   `round_up(key_size + value_size, 8) * max_entries`,
> > >   so 1) it can't apply to the non-preallocated bpf map which may
> > >   increase or decrease the real memory size dynamically. 2) the element
> > >   size of some bpf map is not `key_size + value_size`, for example the
> > >   element size of htab is
> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > >   That said the differece between these two values may be very great if
> > >   the key_size and value_size is small. For example in my verifaction,
> > >   the size of memlock and real memory of a preallocated hash map are,
> > >
> > >   $ grep BPF /proc/meminfo
> > >   BPF:             1026048 B <<< the size of preallocated memalloc pool
> > >
> > >   (create hash map)
> > >
> > >   $ bpftool map show
> > >   3: hash  name count_map  flags 0x0
> > >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > >
> > >   $ grep BPF /proc/meminfo
> > >   BPF:            84919344 B
> > >
> > >   So the real memory size is $((84919344 - 1026048)) which is 83893296
> > >   bytes while the memlock is only 8388608 bytes.
> > >
> > > - memcg
> > >   With memcg we only know that the BPF memory usage is less than
> > >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
> > >   know that the BPF memory usage is less than $MemTotal if the BPF
> > >   object is charged into root memcg :)
> > >
> > > So we need a way to get the BPF memory usage especially there will be
> > > more and more bpf programs running on the production environment. The
> > > memory usage of BPF memory is not trivial, which deserves a new item in
> > > /proc/meminfo.
> > >
> > > This patchset introduce a solution to calculate the BPF memory usage.
> > > This solution is similar to how memory is charged into memcg, so it is
> > > easy to understand. It counts three types of memory usage -
> > >  - page
> > >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
> > >    their families.
> > >    When a page is allocated, we will count its size and mark the head
> > >    page, and then check the head page at page freeing.
> > >  - slab
> > >    via kmalloc, kmem_cache_alloc and their families.
> > >    When a slab object is allocated, we will mark this object in this
> > >    slab and check it at slab object freeing. That said we need extra memory
> > >    to store the information of each object in a slab.
> > >  - percpu
> > >    via alloc_percpu and its family.
> > >    When a percpu area is allocated, we will mark this area in this
> > >    percpu chunk and check it at percpu area freeing. That said we need
> > >    extra memory to store the information of each area in a percpu chunk.
> > >
> > > So we only need to annotate the allcation to add the BPF memory size,
> > > and the sub of the BPF memory size will be handled automatically at
> > > freeing. We can annotate it in irq, softirq or process context. To avoid
> > > counting the nested allcations, for example the percpu backing allocator,
> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> > > the count consistent with memcg accounting.
> >
> > So you can't easily annotate the freeing places as well, to avoid the whole
> > tracking infrastructure?
> 
> The trouble is kfree_rcu().  for example,
>     old_item = active_vm_item_set(ACTIVE_VM_BPF);
>     kfree_rcu();
>     active_vm_item_set(old_item);
> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
> will change lots of code in the RCU subsystem. I'm not sure if it is
> worth it.

(+Cc rcu folks)

IMO adding new kfree_rcu() varient for BPF that accounts BPF memory
usage would be much less churn :)
 
> 
> >  I thought there was a patchset for a whole
> > bfp-specific memory allocator, where accounting would be implemented
> > naturally, I would imagine.
> >
> 
> I posted a patchset[1] which annotates both allocating and freeing
> several months ago.
> But unfortunately after more investigation and verification I found
> the deferred freeing context is a problem, which can't be resolved
> easily.
> That's why I finally decided to annotate allocating only.
> 
> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/
> 
> > > To store the information of a slab or a page, we need to create a new
> > > member in struct page, but we can do it in page extension which can
> > > avoid changing the size of struct page. So a new page extension
> > > active_vm is introduced. Each page and each slab which is allocated as
> > > BPF memory will have a struct active_vm. The reason it is named as
> > > active_vm is that we can extend it to other areas easily, for example in
> > > the future we may use it to count other memory usage.
> > >
> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
> > > compile time or kernel parameter `active_vm=` at runtime.
> >
> > The issue with page_ext is the extra memory usage, so it was rather intended
> > for debugging features that can be always compiled in, but only enabled at
> > runtime when debugging is needed. The overhead is only paid when enabled.
> > That's at least the case of page_owner and page_table_check. The 32bit
> > page_idle is rather an oddity that could have instead stayed 64-bit only.
> >
> 
> Right, it seems currently page_ext is for debugging purposes only.
> 
> > But this is proposing a page_ext functionality supposed to be enabled at all
> > times in production, with the goal of improved accounting. Not an on-demand
> > debugging. I'm afraid the costs will outweight the benefits.
> >
> 
> The memory overhead of this new page extension is (8/4096), which is
> 0.2% of total memory. Not too big to be acceptable.

It's generally unacceptable to increase sizeof(struct page)
(nor enabling page_ext by default, and that's the why page_ext is for
debugging purposes only)

> If the user really
> thinks this overhead is not accepted, he can set "active_vm=off" to
> disable it.

I'd say many people won't welcome adding 0.2% of total memory by default
to get BPF memory usage. 

> To reduce the memory overhead further, I have a bold idea.
> Actually we don't need to allocate such a page extension for every
> page,  while we only need to allocate it if the user needs to access
> it. That said it seems that we can allocate some kind of page
> extensions dynamically rather than preallocate at booting, but I
> haven't investigated it deeply to check if it can work. What do you
> think?
> 
> > Just a quick thought, in case the bpf accounting really can't be handled
> > without marking pages and slab objects - since memcg already has hooks there
> > without need of page_ext, couldn't it be done by extending the memcg infra
> > instead?
> >
> 
> We need to make sure the accounting of BPF memory usage is still
> workable even without memcg, see also the previous discussion[2].
> 
> [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/

-- 
Thanks,
Hyeonggon

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-13 14:56     ` Hyeonggon Yoo
@ 2022-12-13 15:52       ` Vlastimil Babka
  2022-12-13 19:21         ` Paul E. McKenney
  2022-12-14 10:43         ` Yafang Shao
  2022-12-14 10:34       ` Yafang Shao
  1 sibling, 2 replies; 19+ messages in thread
From: Vlastimil Babka @ 2022-12-13 15:52 UTC (permalink / raw)
  To: Hyeonggon Yoo, Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm, penberg,
	rientjes, iamjoonsoo.kim, roman.gushchin, linux-mm, bpf, rcu,
	Matthew Wilcox

On 12/13/22 15:56, Hyeonggon Yoo wrote:
> On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote:
>> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > On 12/12/22 01:37, Yafang Shao wrote:
>> > > Currently there's no way to get BPF memory usage, while we can only
>> > > estimate the usage by bpftool or memcg, both of which are not reliable.
>> > >
>> > > - bpftool
>> > >   `bpftool {map,prog} show` can show us the memlock of each map and
>> > >   prog, but the memlock is vary from the real memory size. The memlock
>> > >   of a bpf object is approximately
>> > >   `round_up(key_size + value_size, 8) * max_entries`,
>> > >   so 1) it can't apply to the non-preallocated bpf map which may
>> > >   increase or decrease the real memory size dynamically. 2) the element
>> > >   size of some bpf map is not `key_size + value_size`, for example the
>> > >   element size of htab is
>> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>> > >   That said the differece between these two values may be very great if
>> > >   the key_size and value_size is small. For example in my verifaction,
>> > >   the size of memlock and real memory of a preallocated hash map are,
>> > >
>> > >   $ grep BPF /proc/meminfo
>> > >   BPF:             1026048 B <<< the size of preallocated memalloc pool
>> > >
>> > >   (create hash map)
>> > >
>> > >   $ bpftool map show
>> > >   3: hash  name count_map  flags 0x0
>> > >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
>> > >
>> > >   $ grep BPF /proc/meminfo
>> > >   BPF:            84919344 B
>> > >
>> > >   So the real memory size is $((84919344 - 1026048)) which is 83893296
>> > >   bytes while the memlock is only 8388608 bytes.
>> > >
>> > > - memcg
>> > >   With memcg we only know that the BPF memory usage is less than
>> > >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
>> > >   know that the BPF memory usage is less than $MemTotal if the BPF
>> > >   object is charged into root memcg :)
>> > >
>> > > So we need a way to get the BPF memory usage especially there will be
>> > > more and more bpf programs running on the production environment. The
>> > > memory usage of BPF memory is not trivial, which deserves a new item in
>> > > /proc/meminfo.
>> > >
>> > > This patchset introduce a solution to calculate the BPF memory usage.
>> > > This solution is similar to how memory is charged into memcg, so it is
>> > > easy to understand. It counts three types of memory usage -
>> > >  - page
>> > >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
>> > >    their families.
>> > >    When a page is allocated, we will count its size and mark the head
>> > >    page, and then check the head page at page freeing.
>> > >  - slab
>> > >    via kmalloc, kmem_cache_alloc and their families.
>> > >    When a slab object is allocated, we will mark this object in this
>> > >    slab and check it at slab object freeing. That said we need extra memory
>> > >    to store the information of each object in a slab.
>> > >  - percpu
>> > >    via alloc_percpu and its family.
>> > >    When a percpu area is allocated, we will mark this area in this
>> > >    percpu chunk and check it at percpu area freeing. That said we need
>> > >    extra memory to store the information of each area in a percpu chunk.
>> > >
>> > > So we only need to annotate the allcation to add the BPF memory size,
>> > > and the sub of the BPF memory size will be handled automatically at
>> > > freeing. We can annotate it in irq, softirq or process context. To avoid
>> > > counting the nested allcations, for example the percpu backing allocator,
>> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
>> > > the count consistent with memcg accounting.
>> >
>> > So you can't easily annotate the freeing places as well, to avoid the whole
>> > tracking infrastructure?
>> 
>> The trouble is kfree_rcu().  for example,
>>     old_item = active_vm_item_set(ACTIVE_VM_BPF);
>>     kfree_rcu();
>>     active_vm_item_set(old_item);
>> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
>> will change lots of code in the RCU subsystem. I'm not sure if it is
>> worth it.
> 
> (+Cc rcu folks)
> 
> IMO adding new kfree_rcu() varient for BPF that accounts BPF memory
> usage would be much less churn :)

Alternatively, just account the bpf memory as freed already when calling
kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is
a separate issue (if it's actually an issue) and not something each
kfree_rcu() user should think about separately?

>> 
>> >  I thought there was a patchset for a whole
>> > bfp-specific memory allocator, where accounting would be implemented
>> > naturally, I would imagine.
>> >
>> 
>> I posted a patchset[1] which annotates both allocating and freeing
>> several months ago.
>> But unfortunately after more investigation and verification I found
>> the deferred freeing context is a problem, which can't be resolved
>> easily.
>> That's why I finally decided to annotate allocating only.
>> 
>> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/
>> 
>> > > To store the information of a slab or a page, we need to create a new
>> > > member in struct page, but we can do it in page extension which can
>> > > avoid changing the size of struct page. So a new page extension
>> > > active_vm is introduced. Each page and each slab which is allocated as
>> > > BPF memory will have a struct active_vm. The reason it is named as
>> > > active_vm is that we can extend it to other areas easily, for example in
>> > > the future we may use it to count other memory usage.
>> > >
>> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
>> > > compile time or kernel parameter `active_vm=` at runtime.
>> >
>> > The issue with page_ext is the extra memory usage, so it was rather intended
>> > for debugging features that can be always compiled in, but only enabled at
>> > runtime when debugging is needed. The overhead is only paid when enabled.
>> > That's at least the case of page_owner and page_table_check. The 32bit
>> > page_idle is rather an oddity that could have instead stayed 64-bit only.
>> >
>> 
>> Right, it seems currently page_ext is for debugging purposes only.
>> 
>> > But this is proposing a page_ext functionality supposed to be enabled at all
>> > times in production, with the goal of improved accounting. Not an on-demand
>> > debugging. I'm afraid the costs will outweight the benefits.
>> >
>> 
>> The memory overhead of this new page extension is (8/4096), which is
>> 0.2% of total memory. Not too big to be acceptable.
> 
> It's generally unacceptable to increase sizeof(struct page)
> (nor enabling page_ext by default, and that's the why page_ext is for
> debugging purposes only)
> 
>> If the user really
>> thinks this overhead is not accepted, he can set "active_vm=off" to
>> disable it.
> 
> I'd say many people won't welcome adding 0.2% of total memory by default
> to get BPF memory usage. 

Agreed.

>> To reduce the memory overhead further, I have a bold idea.
>> Actually we don't need to allocate such a page extension for every
>> page,  while we only need to allocate it if the user needs to access
>> it. That said it seems that we can allocate some kind of page
>> extensions dynamically rather than preallocate at booting, but I
>> haven't investigated it deeply to check if it can work. What do you
>> think?

There's lots of benefits (simplicity) of page_ext being allocated as it is
today. What you're suggesting will be better solved (in few years :) by
Matthew's bold ideas about shrinking the current struct page and allocating
usecase-specific descriptors.

>> > Just a quick thought, in case the bpf accounting really can't be handled
>> > without marking pages and slab objects - since memcg already has hooks there
>> > without need of page_ext, couldn't it be done by extending the memcg infra
>> > instead?
>> >
>> 
>> We need to make sure the accounting of BPF memory usage is still
>> workable even without memcg, see also the previous discussion[2].
>> 
>> [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-13 15:52       ` Vlastimil Babka
@ 2022-12-13 19:21         ` Paul E. McKenney
  2022-12-14 10:46           ` Yafang Shao
  2022-12-14 10:43         ` Yafang Shao
  1 sibling, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2022-12-13 19:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hyeonggon Yoo, Yafang Shao, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	tj, dennis, cl, akpm, penberg, rientjes, iamjoonsoo.kim,
	roman.gushchin, linux-mm, bpf, rcu, Matthew Wilcox

On Tue, Dec 13, 2022 at 04:52:09PM +0100, Vlastimil Babka wrote:
> On 12/13/22 15:56, Hyeonggon Yoo wrote:
> > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote:
> >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >
> >> > On 12/12/22 01:37, Yafang Shao wrote:
> >> > > Currently there's no way to get BPF memory usage, while we can only
> >> > > estimate the usage by bpftool or memcg, both of which are not reliable.
> >> > >
> >> > > - bpftool
> >> > >   `bpftool {map,prog} show` can show us the memlock of each map and
> >> > >   prog, but the memlock is vary from the real memory size. The memlock
> >> > >   of a bpf object is approximately
> >> > >   `round_up(key_size + value_size, 8) * max_entries`,
> >> > >   so 1) it can't apply to the non-preallocated bpf map which may
> >> > >   increase or decrease the real memory size dynamically. 2) the element
> >> > >   size of some bpf map is not `key_size + value_size`, for example the
> >> > >   element size of htab is
> >> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >> > >   That said the differece between these two values may be very great if
> >> > >   the key_size and value_size is small. For example in my verifaction,
> >> > >   the size of memlock and real memory of a preallocated hash map are,
> >> > >
> >> > >   $ grep BPF /proc/meminfo
> >> > >   BPF:             1026048 B <<< the size of preallocated memalloc pool
> >> > >
> >> > >   (create hash map)
> >> > >
> >> > >   $ bpftool map show
> >> > >   3: hash  name count_map  flags 0x0
> >> > >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >> > >
> >> > >   $ grep BPF /proc/meminfo
> >> > >   BPF:            84919344 B
> >> > >
> >> > >   So the real memory size is $((84919344 - 1026048)) which is 83893296
> >> > >   bytes while the memlock is only 8388608 bytes.
> >> > >
> >> > > - memcg
> >> > >   With memcg we only know that the BPF memory usage is less than
> >> > >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
> >> > >   know that the BPF memory usage is less than $MemTotal if the BPF
> >> > >   object is charged into root memcg :)
> >> > >
> >> > > So we need a way to get the BPF memory usage especially there will be
> >> > > more and more bpf programs running on the production environment. The
> >> > > memory usage of BPF memory is not trivial, which deserves a new item in
> >> > > /proc/meminfo.
> >> > >
> >> > > This patchset introduce a solution to calculate the BPF memory usage.
> >> > > This solution is similar to how memory is charged into memcg, so it is
> >> > > easy to understand. It counts three types of memory usage -
> >> > >  - page
> >> > >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
> >> > >    their families.
> >> > >    When a page is allocated, we will count its size and mark the head
> >> > >    page, and then check the head page at page freeing.
> >> > >  - slab
> >> > >    via kmalloc, kmem_cache_alloc and their families.
> >> > >    When a slab object is allocated, we will mark this object in this
> >> > >    slab and check it at slab object freeing. That said we need extra memory
> >> > >    to store the information of each object in a slab.
> >> > >  - percpu
> >> > >    via alloc_percpu and its family.
> >> > >    When a percpu area is allocated, we will mark this area in this
> >> > >    percpu chunk and check it at percpu area freeing. That said we need
> >> > >    extra memory to store the information of each area in a percpu chunk.
> >> > >
> >> > > So we only need to annotate the allcation to add the BPF memory size,
> >> > > and the sub of the BPF memory size will be handled automatically at
> >> > > freeing. We can annotate it in irq, softirq or process context. To avoid
> >> > > counting the nested allcations, for example the percpu backing allocator,
> >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> >> > > the count consistent with memcg accounting.
> >> >
> >> > So you can't easily annotate the freeing places as well, to avoid the whole
> >> > tracking infrastructure?
> >> 
> >> The trouble is kfree_rcu().  for example,
> >>     old_item = active_vm_item_set(ACTIVE_VM_BPF);
> >>     kfree_rcu();
> >>     active_vm_item_set(old_item);
> >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
> >> will change lots of code in the RCU subsystem. I'm not sure if it is
> >> worth it.
> > 
> > (+Cc rcu folks)
> > 
> > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory
> > usage would be much less churn :)
> 
> Alternatively, just account the bpf memory as freed already when calling
> kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is
> a separate issue (if it's actually an issue) and not something each
> kfree_rcu() user should think about separately?

If the in-flight memory really does need to be accounted for, then one
straightforward approach is to use call_rcu() and do the first part of
the needed accounting at the call_rcu() callsite and the rest of the
accounting when the callback is invoked.  Or, if memory must be freed
quickly even on ChromeOS and Android, use call_rcu_hurry() instead
of call_rcu().

Or is there some accounting requirement that I am missing?

							Thanx, Paul

> >> >  I thought there was a patchset for a whole
> >> > bfp-specific memory allocator, where accounting would be implemented
> >> > naturally, I would imagine.
> >> >
> >> 
> >> I posted a patchset[1] which annotates both allocating and freeing
> >> several months ago.
> >> But unfortunately after more investigation and verification I found
> >> the deferred freeing context is a problem, which can't be resolved
> >> easily.
> >> That's why I finally decided to annotate allocating only.
> >> 
> >> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/
> >> 
> >> > > To store the information of a slab or a page, we need to create a new
> >> > > member in struct page, but we can do it in page extension which can
> >> > > avoid changing the size of struct page. So a new page extension
> >> > > active_vm is introduced. Each page and each slab which is allocated as
> >> > > BPF memory will have a struct active_vm. The reason it is named as
> >> > > active_vm is that we can extend it to other areas easily, for example in
> >> > > the future we may use it to count other memory usage.
> >> > >
> >> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
> >> > > compile time or kernel parameter `active_vm=` at runtime.
> >> >
> >> > The issue with page_ext is the extra memory usage, so it was rather intended
> >> > for debugging features that can be always compiled in, but only enabled at
> >> > runtime when debugging is needed. The overhead is only paid when enabled.
> >> > That's at least the case of page_owner and page_table_check. The 32bit
> >> > page_idle is rather an oddity that could have instead stayed 64-bit only.
> >> >
> >> 
> >> Right, it seems currently page_ext is for debugging purposes only.
> >> 
> >> > But this is proposing a page_ext functionality supposed to be enabled at all
> >> > times in production, with the goal of improved accounting. Not an on-demand
> >> > debugging. I'm afraid the costs will outweight the benefits.
> >> >
> >> 
> >> The memory overhead of this new page extension is (8/4096), which is
> >> 0.2% of total memory. Not too big to be acceptable.
> > 
> > It's generally unacceptable to increase sizeof(struct page)
> > (nor enabling page_ext by default, and that's the why page_ext is for
> > debugging purposes only)
> > 
> >> If the user really
> >> thinks this overhead is not accepted, he can set "active_vm=off" to
> >> disable it.
> > 
> > I'd say many people won't welcome adding 0.2% of total memory by default
> > to get BPF memory usage. 
> 
> Agreed.
> 
> >> To reduce the memory overhead further, I have a bold idea.
> >> Actually we don't need to allocate such a page extension for every
> >> page,  while we only need to allocate it if the user needs to access
> >> it. That said it seems that we can allocate some kind of page
> >> extensions dynamically rather than preallocate at booting, but I
> >> haven't investigated it deeply to check if it can work. What do you
> >> think?
> 
> There's lots of benefits (simplicity) of page_ext being allocated as it is
> today. What you're suggesting will be better solved (in few years :) by
> Matthew's bold ideas about shrinking the current struct page and allocating
> usecase-specific descriptors.
> 
> >> > Just a quick thought, in case the bpf accounting really can't be handled
> >> > without marking pages and slab objects - since memcg already has hooks there
> >> > without need of page_ext, couldn't it be done by extending the memcg infra
> >> > instead?
> >> >
> >> 
> >> We need to make sure the accounting of BPF memory usage is still
> >> workable even without memcg, see also the previous discussion[2].
> >> 
> >> [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-13 14:56     ` Hyeonggon Yoo
  2022-12-13 15:52       ` Vlastimil Babka
@ 2022-12-14 10:34       ` Yafang Shao
  1 sibling, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-14 10:34 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Vlastimil Babka, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm,
	penberg, rientjes, iamjoonsoo.kim, roman.gushchin, linux-mm, bpf,
	rcu

On Tue, Dec 13, 2022 at 10:56 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote:
> > On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 12/12/22 01:37, Yafang Shao wrote:
> > > > Currently there's no way to get BPF memory usage, while we can only
> > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > >
> > > > - bpftool
> > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > >   of a bpf object is approximately
> > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > >   increase or decrease the real memory size dynamically. 2) the element
> > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > >   element size of htab is
> > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > >   That said the differece between these two values may be very great if
> > > >   the key_size and value_size is small. For example in my verifaction,
> > > >   the size of memlock and real memory of a preallocated hash map are,
> > > >
> > > >   $ grep BPF /proc/meminfo
> > > >   BPF:             1026048 B <<< the size of preallocated memalloc pool
> > > >
> > > >   (create hash map)
> > > >
> > > >   $ bpftool map show
> > > >   3: hash  name count_map  flags 0x0
> > > >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > >
> > > >   $ grep BPF /proc/meminfo
> > > >   BPF:            84919344 B
> > > >
> > > >   So the real memory size is $((84919344 - 1026048)) which is 83893296
> > > >   bytes while the memlock is only 8388608 bytes.
> > > >
> > > > - memcg
> > > >   With memcg we only know that the BPF memory usage is less than
> > > >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
> > > >   know that the BPF memory usage is less than $MemTotal if the BPF
> > > >   object is charged into root memcg :)
> > > >
> > > > So we need a way to get the BPF memory usage especially there will be
> > > > more and more bpf programs running on the production environment. The
> > > > memory usage of BPF memory is not trivial, which deserves a new item in
> > > > /proc/meminfo.
> > > >
> > > > This patchset introduce a solution to calculate the BPF memory usage.
> > > > This solution is similar to how memory is charged into memcg, so it is
> > > > easy to understand. It counts three types of memory usage -
> > > >  - page
> > > >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
> > > >    their families.
> > > >    When a page is allocated, we will count its size and mark the head
> > > >    page, and then check the head page at page freeing.
> > > >  - slab
> > > >    via kmalloc, kmem_cache_alloc and their families.
> > > >    When a slab object is allocated, we will mark this object in this
> > > >    slab and check it at slab object freeing. That said we need extra memory
> > > >    to store the information of each object in a slab.
> > > >  - percpu
> > > >    via alloc_percpu and its family.
> > > >    When a percpu area is allocated, we will mark this area in this
> > > >    percpu chunk and check it at percpu area freeing. That said we need
> > > >    extra memory to store the information of each area in a percpu chunk.
> > > >
> > > > So we only need to annotate the allcation to add the BPF memory size,
> > > > and the sub of the BPF memory size will be handled automatically at
> > > > freeing. We can annotate it in irq, softirq or process context. To avoid
> > > > counting the nested allcations, for example the percpu backing allocator,
> > > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> > > > the count consistent with memcg accounting.
> > >
> > > So you can't easily annotate the freeing places as well, to avoid the whole
> > > tracking infrastructure?
> >
> > The trouble is kfree_rcu().  for example,
> >     old_item = active_vm_item_set(ACTIVE_VM_BPF);
> >     kfree_rcu();
> >     active_vm_item_set(old_item);
> > If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
> > will change lots of code in the RCU subsystem. I'm not sure if it is
> > worth it.
>
> (+Cc rcu folks)
>
> IMO adding new kfree_rcu() varient for BPF that accounts BPF memory
> usage would be much less churn :)
>

Besides the kfree_rcu(), there is another deferred freeing, for
example, the vfree_deferred(), but vfree_deferred() can be handled
easily.
Once we do it this way, the BPF memory usage accounting will be a
burden of all newly introduced deferred freeing.
For example, if the MM guys optimize the memory freeing code in the
future without noticing this BPF accounting, it will be broken easily.
So in the long run, the restrictions should be as less as possible.

> >
> > >  I thought there was a patchset for a whole
> > > bfp-specific memory allocator, where accounting would be implemented
> > > naturally, I would imagine.
> > >
> >
> > I posted a patchset[1] which annotates both allocating and freeing
> > several months ago.
> > But unfortunately after more investigation and verification I found
> > the deferred freeing context is a problem, which can't be resolved
> > easily.
> > That's why I finally decided to annotate allocating only.
> >
> > [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/
> >
> > > > To store the information of a slab or a page, we need to create a new
> > > > member in struct page, but we can do it in page extension which can
> > > > avoid changing the size of struct page. So a new page extension
> > > > active_vm is introduced. Each page and each slab which is allocated as
> > > > BPF memory will have a struct active_vm. The reason it is named as
> > > > active_vm is that we can extend it to other areas easily, for example in
> > > > the future we may use it to count other memory usage.
> > > >
> > > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
> > > > compile time or kernel parameter `active_vm=` at runtime.
> > >
> > > The issue with page_ext is the extra memory usage, so it was rather intended
> > > for debugging features that can be always compiled in, but only enabled at
> > > runtime when debugging is needed. The overhead is only paid when enabled.
> > > That's at least the case of page_owner and page_table_check. The 32bit
> > > page_idle is rather an oddity that could have instead stayed 64-bit only.
> > >
> >
> > Right, it seems currently page_ext is for debugging purposes only.
> >
> > > But this is proposing a page_ext functionality supposed to be enabled at all
> > > times in production, with the goal of improved accounting. Not an on-demand
> > > debugging. I'm afraid the costs will outweight the benefits.
> > >
> >
> > The memory overhead of this new page extension is (8/4096), which is
> > 0.2% of total memory. Not too big to be acceptable.
>
> It's generally unacceptable to increase sizeof(struct page)
> (nor enabling page_ext by default, and that's the why page_ext is for
> debugging purposes only)
>

Then we can disable page_ext by default. The user who wants to use it
can pass "active_vm=on".

> > If the user really
> > thinks this overhead is not accepted, he can set "active_vm=off" to
> > disable it.
>
> I'd say many people won't welcome adding 0.2% of total memory by default
> to get BPF memory usage.
>

This 0.2% of total memory can be reduced significantly by introducing
dynamical page ext. so it is not a problem which will last forever.
We can optimize the memory overhead step by step to make it be
accepted by more and more people.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-13 15:52       ` Vlastimil Babka
  2022-12-13 19:21         ` Paul E. McKenney
@ 2022-12-14 10:43         ` Yafang Shao
  1 sibling, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-14 10:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hyeonggon Yoo, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, tj, dennis, cl, akpm,
	penberg, rientjes, iamjoonsoo.kim, roman.gushchin, linux-mm, bpf,
	rcu, Matthew Wilcox

On Tue, Dec 13, 2022 at 11:52 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/13/22 15:56, Hyeonggon Yoo wrote:
> > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote:
> >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >
> >> > On 12/12/22 01:37, Yafang Shao wrote:
> >> > > Currently there's no way to get BPF memory usage, while we can only
> >> > > estimate the usage by bpftool or memcg, both of which are not reliable.
> >> > >
> >> > > - bpftool
> >> > >   `bpftool {map,prog} show` can show us the memlock of each map and
> >> > >   prog, but the memlock is vary from the real memory size. The memlock
> >> > >   of a bpf object is approximately
> >> > >   `round_up(key_size + value_size, 8) * max_entries`,
> >> > >   so 1) it can't apply to the non-preallocated bpf map which may
> >> > >   increase or decrease the real memory size dynamically. 2) the element
> >> > >   size of some bpf map is not `key_size + value_size`, for example the
> >> > >   element size of htab is
> >> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >> > >   That said the differece between these two values may be very great if
> >> > >   the key_size and value_size is small. For example in my verifaction,
> >> > >   the size of memlock and real memory of a preallocated hash map are,
> >> > >
> >> > >   $ grep BPF /proc/meminfo
> >> > >   BPF:             1026048 B <<< the size of preallocated memalloc pool
> >> > >
> >> > >   (create hash map)
> >> > >
> >> > >   $ bpftool map show
> >> > >   3: hash  name count_map  flags 0x0
> >> > >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >> > >
> >> > >   $ grep BPF /proc/meminfo
> >> > >   BPF:            84919344 B
> >> > >
> >> > >   So the real memory size is $((84919344 - 1026048)) which is 83893296
> >> > >   bytes while the memlock is only 8388608 bytes.
> >> > >
> >> > > - memcg
> >> > >   With memcg we only know that the BPF memory usage is less than
> >> > >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
> >> > >   know that the BPF memory usage is less than $MemTotal if the BPF
> >> > >   object is charged into root memcg :)
> >> > >
> >> > > So we need a way to get the BPF memory usage especially there will be
> >> > > more and more bpf programs running on the production environment. The
> >> > > memory usage of BPF memory is not trivial, which deserves a new item in
> >> > > /proc/meminfo.
> >> > >
> >> > > This patchset introduce a solution to calculate the BPF memory usage.
> >> > > This solution is similar to how memory is charged into memcg, so it is
> >> > > easy to understand. It counts three types of memory usage -
> >> > >  - page
> >> > >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
> >> > >    their families.
> >> > >    When a page is allocated, we will count its size and mark the head
> >> > >    page, and then check the head page at page freeing.
> >> > >  - slab
> >> > >    via kmalloc, kmem_cache_alloc and their families.
> >> > >    When a slab object is allocated, we will mark this object in this
> >> > >    slab and check it at slab object freeing. That said we need extra memory
> >> > >    to store the information of each object in a slab.
> >> > >  - percpu
> >> > >    via alloc_percpu and its family.
> >> > >    When a percpu area is allocated, we will mark this area in this
> >> > >    percpu chunk and check it at percpu area freeing. That said we need
> >> > >    extra memory to store the information of each area in a percpu chunk.
> >> > >
> >> > > So we only need to annotate the allcation to add the BPF memory size,
> >> > > and the sub of the BPF memory size will be handled automatically at
> >> > > freeing. We can annotate it in irq, softirq or process context. To avoid
> >> > > counting the nested allcations, for example the percpu backing allocator,
> >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> >> > > the count consistent with memcg accounting.
> >> >
> >> > So you can't easily annotate the freeing places as well, to avoid the whole
> >> > tracking infrastructure?
> >>
> >> The trouble is kfree_rcu().  for example,
> >>     old_item = active_vm_item_set(ACTIVE_VM_BPF);
> >>     kfree_rcu();
> >>     active_vm_item_set(old_item);
> >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
> >> will change lots of code in the RCU subsystem. I'm not sure if it is
> >> worth it.
> >
> > (+Cc rcu folks)
> >
> > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory
> > usage would be much less churn :)
>
> Alternatively, just account the bpf memory as freed already when calling
> kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is
> a separate issue (if it's actually an issue) and not something each
> kfree_rcu() user should think about separately?
>

Not sure if it is a problem for other users as well. But I can explain
why it is an issue for BPF accounting.
In BPF accounting, we need to store something into the task_struct and
use it later. So if the 'storer' and the 'user' are different tasks,
the information will be lost.

> >>
> >> >  I thought there was a patchset for a whole
> >> > bfp-specific memory allocator, where accounting would be implemented
> >> > naturally, I would imagine.
> >> >
> >>
> >> I posted a patchset[1] which annotates both allocating and freeing
> >> several months ago.
> >> But unfortunately after more investigation and verification I found
> >> the deferred freeing context is a problem, which can't be resolved
> >> easily.
> >> That's why I finally decided to annotate allocating only.
> >>
> >> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/
> >>
> >> > > To store the information of a slab or a page, we need to create a new
> >> > > member in struct page, but we can do it in page extension which can
> >> > > avoid changing the size of struct page. So a new page extension
> >> > > active_vm is introduced. Each page and each slab which is allocated as
> >> > > BPF memory will have a struct active_vm. The reason it is named as
> >> > > active_vm is that we can extend it to other areas easily, for example in
> >> > > the future we may use it to count other memory usage.
> >> > >
> >> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at
> >> > > compile time or kernel parameter `active_vm=` at runtime.
> >> >
> >> > The issue with page_ext is the extra memory usage, so it was rather intended
> >> > for debugging features that can be always compiled in, but only enabled at
> >> > runtime when debugging is needed. The overhead is only paid when enabled.
> >> > That's at least the case of page_owner and page_table_check. The 32bit
> >> > page_idle is rather an oddity that could have instead stayed 64-bit only.
> >> >
> >>
> >> Right, it seems currently page_ext is for debugging purposes only.
> >>
> >> > But this is proposing a page_ext functionality supposed to be enabled at all
> >> > times in production, with the goal of improved accounting. Not an on-demand
> >> > debugging. I'm afraid the costs will outweight the benefits.
> >> >
> >>
> >> The memory overhead of this new page extension is (8/4096), which is
> >> 0.2% of total memory. Not too big to be acceptable.
> >
> > It's generally unacceptable to increase sizeof(struct page)
> > (nor enabling page_ext by default, and that's the why page_ext is for
> > debugging purposes only)
> >
> >> If the user really
> >> thinks this overhead is not accepted, he can set "active_vm=off" to
> >> disable it.
> >
> > I'd say many people won't welcome adding 0.2% of total memory by default
> > to get BPF memory usage.
>
> Agreed.
>
> >> To reduce the memory overhead further, I have a bold idea.
> >> Actually we don't need to allocate such a page extension for every
> >> page,  while we only need to allocate it if the user needs to access
> >> it. That said it seems that we can allocate some kind of page
> >> extensions dynamically rather than preallocate at booting, but I
> >> haven't investigated it deeply to check if it can work. What do you
> >> think?
>
> There's lots of benefits (simplicity) of page_ext being allocated as it is
> today.

These benefits also lead it to debugging purposes only :)
If we can make it run on production env, it will be more useful.

>  What you're suggesting will be better solved (in few years :) by
> Matthew's bold ideas about shrinking the current struct page and allocating
> usecase-specific descriptors.
>

So the memory overhead won't be a problem in the future, right ?

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo
  2022-12-13 19:21         ` Paul E. McKenney
@ 2022-12-14 10:46           ` Yafang Shao
  0 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-14 10:46 UTC (permalink / raw)
  To: paulmck
  Cc: Vlastimil Babka, Hyeonggon Yoo, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	tj, dennis, cl, akpm, penberg, rientjes, iamjoonsoo.kim,
	roman.gushchin, linux-mm, bpf, rcu, Matthew Wilcox

On Wed, Dec 14, 2022 at 3:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Dec 13, 2022 at 04:52:09PM +0100, Vlastimil Babka wrote:
> > On 12/13/22 15:56, Hyeonggon Yoo wrote:
> > > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote:
> > >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >> >
> > >> > On 12/12/22 01:37, Yafang Shao wrote:
> > >> > > Currently there's no way to get BPF memory usage, while we can only
> > >> > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > >> > >
> > >> > > - bpftool
> > >> > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > >> > >   prog, but the memlock is vary from the real memory size. The memlock
> > >> > >   of a bpf object is approximately
> > >> > >   `round_up(key_size + value_size, 8) * max_entries`,
> > >> > >   so 1) it can't apply to the non-preallocated bpf map which may
> > >> > >   increase or decrease the real memory size dynamically. 2) the element
> > >> > >   size of some bpf map is not `key_size + value_size`, for example the
> > >> > >   element size of htab is
> > >> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > >> > >   That said the differece between these two values may be very great if
> > >> > >   the key_size and value_size is small. For example in my verifaction,
> > >> > >   the size of memlock and real memory of a preallocated hash map are,
> > >> > >
> > >> > >   $ grep BPF /proc/meminfo
> > >> > >   BPF:             1026048 B <<< the size of preallocated memalloc pool
> > >> > >
> > >> > >   (create hash map)
> > >> > >
> > >> > >   $ bpftool map show
> > >> > >   3: hash  name count_map  flags 0x0
> > >> > >           key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > >> > >
> > >> > >   $ grep BPF /proc/meminfo
> > >> > >   BPF:            84919344 B
> > >> > >
> > >> > >   So the real memory size is $((84919344 - 1026048)) which is 83893296
> > >> > >   bytes while the memlock is only 8388608 bytes.
> > >> > >
> > >> > > - memcg
> > >> > >   With memcg we only know that the BPF memory usage is less than
> > >> > >   memory.usage_in_bytes (or memory.current in v2). Furthermore, we only
> > >> > >   know that the BPF memory usage is less than $MemTotal if the BPF
> > >> > >   object is charged into root memcg :)
> > >> > >
> > >> > > So we need a way to get the BPF memory usage especially there will be
> > >> > > more and more bpf programs running on the production environment. The
> > >> > > memory usage of BPF memory is not trivial, which deserves a new item in
> > >> > > /proc/meminfo.
> > >> > >
> > >> > > This patchset introduce a solution to calculate the BPF memory usage.
> > >> > > This solution is similar to how memory is charged into memcg, so it is
> > >> > > easy to understand. It counts three types of memory usage -
> > >> > >  - page
> > >> > >    via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and
> > >> > >    their families.
> > >> > >    When a page is allocated, we will count its size and mark the head
> > >> > >    page, and then check the head page at page freeing.
> > >> > >  - slab
> > >> > >    via kmalloc, kmem_cache_alloc and their families.
> > >> > >    When a slab object is allocated, we will mark this object in this
> > >> > >    slab and check it at slab object freeing. That said we need extra memory
> > >> > >    to store the information of each object in a slab.
> > >> > >  - percpu
> > >> > >    via alloc_percpu and its family.
> > >> > >    When a percpu area is allocated, we will mark this area in this
> > >> > >    percpu chunk and check it at percpu area freeing. That said we need
> > >> > >    extra memory to store the information of each area in a percpu chunk.
> > >> > >
> > >> > > So we only need to annotate the allcation to add the BPF memory size,
> > >> > > and the sub of the BPF memory size will be handled automatically at
> > >> > > freeing. We can annotate it in irq, softirq or process context. To avoid
> > >> > > counting the nested allcations, for example the percpu backing allocator,
> > >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make
> > >> > > the count consistent with memcg accounting.
> > >> >
> > >> > So you can't easily annotate the freeing places as well, to avoid the whole
> > >> > tracking infrastructure?
> > >>
> > >> The trouble is kfree_rcu().  for example,
> > >>     old_item = active_vm_item_set(ACTIVE_VM_BPF);
> > >>     kfree_rcu();
> > >>     active_vm_item_set(old_item);
> > >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we
> > >> will change lots of code in the RCU subsystem. I'm not sure if it is
> > >> worth it.
> > >
> > > (+Cc rcu folks)
> > >
> > > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory
> > > usage would be much less churn :)
> >
> > Alternatively, just account the bpf memory as freed already when calling
> > kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is
> > a separate issue (if it's actually an issue) and not something each
> > kfree_rcu() user should think about separately?
>
> If the in-flight memory really does need to be accounted for, then one
> straightforward approach is to use call_rcu() and do the first part of
> the needed accounting at the call_rcu() callsite and the rest of the
> accounting when the callback is invoked.  Or, if memory must be freed
> quickly even on ChromeOS and Android, use call_rcu_hurry() instead
> of call_rcu().
>

Right, call_rcu() can make it work.
But I'm not sure if all kfree_rcu() in kernel/bpf can be replaced by call_rcu().
Alexei, any comment on it ?

$ grep -r "kfree_rcu" kernel/bpf/
kernel/bpf/local_storage.c:     kfree_rcu(new, rcu);
kernel/bpf/lpm_trie.c:          kfree_rcu(node, rcu);
kernel/bpf/lpm_trie.c:          kfree_rcu(parent, rcu);
kernel/bpf/lpm_trie.c:          kfree_rcu(node, rcu);
kernel/bpf/lpm_trie.c:  kfree_rcu(node, rcu);
kernel/bpf/bpf_inode_storage.c:         kfree_rcu(local_storage, rcu);
kernel/bpf/bpf_task_storage.c:          kfree_rcu(local_storage, rcu);
kernel/bpf/trampoline.c:        kfree_rcu(im, rcu);
kernel/bpf/core.c:      kfree_rcu(progs, rcu);
kernel/bpf/core.c:       * no need to call kfree_rcu(), just call
kfree() directly.
kernel/bpf/core.c:              kfree_rcu(progs, rcu);
kernel/bpf/bpf_local_storage.c:  * kfree(), else do kfree_rcu().
kernel/bpf/bpf_local_storage.c:         kfree_rcu(local_storage, rcu);
kernel/bpf/bpf_local_storage.c:         kfree_rcu(selem, rcu);
kernel/bpf/bpf_local_storage.c:         kfree_rcu(selem, rcu);
kernel/bpf/bpf_local_storage.c:                 kfree_rcu(local_storage, rcu);

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage
       [not found]   ` <202212141512.469bca4-yujie.liu@intel.com>
@ 2022-12-14 12:01     ` Yafang Shao
  0 siblings, 0 replies; 19+ messages in thread
From: Yafang Shao @ 2022-12-14 12:01 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, linux-fsdevel, bpf, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, tj, dennis, cl, akpm, penberg, rientjes,
	iamjoonsoo.kim, vbabka, roman.gushchin, 42.hyeyoo, linux-mm

On Wed, Dec 14, 2022 at 4:48 PM kernel test robot <yujie.liu@intel.com> wrote:
>
> Greeting,
>
> FYI, we noticed WARNING:suspicious_RCU_usage due to commit (built with gcc-11):
>
> commit: 8f13ff79ed924e23a36eb5c610ce48998ed69fd5 ("[RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage")
> url: https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-bpf-Add-BPF-into-proc-meminfo/20221212-083842
> base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf-next.git master
> patch link: https://lore.kernel.org/all/20221212003711.24977-10-laoar.shao@gmail.com/
> patch subject: [RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> [   31.975760][    T1] WARNING: suspicious RCU usage
> [   31.976682][    T1] 6.1.0-rc7-01609-g8f13ff79ed92 #5 Not tainted
> [   31.977802][    T1] -----------------------------
> [   31.978710][    T1] include/linux/rcupdate.h:376 Illegal context switch in RCU read-side critical section!
> [   31.980465][    T1]
> [   31.980465][    T1] other info that might help us debug this:
> [   31.980465][    T1]
> [   31.982355][    T1]
> [   31.982355][    T1] rcu_scheduler_active = 2, debug_locks = 1
> [   31.983818][    T1] 1 lock held by swapper/0/1:
> [ 31.984695][ T1] #0: ffffffff853269a0 (rcu_read_lock){....}-{1:2}, at: page_ext_get (??:?)
> [   31.986346][    T1]
> [   31.986346][    T1] stack backtrace:
> [   31.987467][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc7-01609-g8f13ff79ed92 #5
> [   31.989054][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
> [   31.990880][    T1] Call Trace:
> [   31.991554][    T1]  <TASK>
> [ 31.992173][ T1] dump_stack_lvl (??:?)
> [ 31.993034][ T1] __might_resched (??:?)
> [ 31.993970][ T1] __kmem_cache_alloc_node (??:?)
> [ 31.994993][ T1] ? active_vm_slab_add (??:?)
> [ 31.995976][ T1] ? active_vm_slab_add (??:?)
> [ 31.996918][ T1] __kmalloc_node (??:?)
> [ 31.997789][ T1] active_vm_slab_add (??:?)
> [ 31.998727][ T1] ? kasan_unpoison (??:?)
> [ 31.999615][ T1] __kmem_cache_alloc_node (??:?)
> [ 32.000615][ T1] ? __bpf_map_area_alloc (syscall.c:?)
> [ 32.001599][ T1] ? __bpf_map_area_alloc (syscall.c:?)
> [ 32.002575][ T1] __kmalloc_node (??:?)
> [ 32.003439][ T1] __bpf_map_area_alloc (syscall.c:?)
> [ 32.004417][ T1] array_map_alloc (arraymap.c:?)
> [ 32.005326][ T1] map_create (syscall.c:?)
> [ 32.006173][ T1] __sys_bpf (syscall.c:?)
> [ 32.006988][ T1] ? link_create (syscall.c:?)
> [ 32.007873][ T1] ? lock_downgrade (lockdep.c:?)
> [ 32.008790][ T1] kern_sys_bpf (??:?)
> [ 32.009636][ T1] ? bpf_sys_bpf (??:?)
> [ 32.010469][ T1] ? trace_hardirqs_on (??:?)
> [ 32.011395][ T1] ? _raw_spin_unlock_irqrestore (??:?)
> [ 32.012432][ T1] ? __stack_depot_save (??:?)
> [ 32.013391][ T1] skel_map_create+0xba/0xeb
> [ 32.014423][ T1] ? skel_map_update_elem+0xe3/0xe3
> [ 32.015527][ T1] ? kasan_save_stack (??:?)
> [ 32.016422][ T1] ? kasan_set_track (??:?)
> [ 32.017308][ T1] ? __kasan_kmalloc (??:?)
> [ 32.018233][ T1] ? kernel_init (main.c:?)
> [ 32.019090][ T1] ? lock_acquire (??:?)
> [ 32.019968][ T1] ? find_held_lock (lockdep.c:?)
> [ 32.020858][ T1] ? __kmem_cache_alloc_node (??:?)
> [ 32.021875][ T1] bpf_load_and_run+0x93/0x3f5
> [ 32.022920][ T1] ? skel_map_create+0xeb/0xeb
> [ 32.023959][ T1] ? lock_downgrade (lockdep.c:?)
> [ 32.024885][ T1] ? __kmem_cache_alloc_node (??:?)
> [ 32.025919][ T1] ? load_skel (bpf_preload_kern.c:?)
> [ 32.026767][ T1] ? rcu_read_lock_sched_held (??:?)
> [ 32.027781][ T1] ? __kmalloc_node (??:?)
> [ 32.030065][ T1] load_skel (bpf_preload_kern.c:?)
> [ 32.030869][ T1] ? bpf_load_and_run+0x3f5/0x3f5
> [ 32.031963][ T1] ? kvm_clock_get_cycles (kvmclock.c:?)
> [ 32.032914][ T1] ? btf_vmlinux_init (bpf_preload_kern.c:?)
> [ 32.033801][ T1] load (bpf_preload_kern.c:?)
> [ 32.034501][ T1] ? btf_vmlinux_init (bpf_preload_kern.c:?)
> [ 32.035407][ T1] do_one_initcall (??:?)
> [ 32.036266][ T1] ? trace_event_raw_event_initcall_level (??:?)
> [ 32.037446][ T1] ? parse_one (??:?)
> [ 32.038320][ T1] ? __kmem_cache_alloc_node (??:?)
> [ 32.039369][ T1] do_initcalls (main.c:?)
> [ 32.040314][ T1] kernel_init_freeable (main.c:?)
> [ 32.041304][ T1] ? console_on_rootfs (main.c:?)
> [ 32.042213][ T1] ? usleep_range_state (??:?)
> [ 32.043197][ T1] ? rest_init (main.c:?)
> [ 32.044036][ T1] ? rest_init (main.c:?)
> [ 32.044879][ T1] kernel_init (main.c:?)
> [ 32.045715][ T1] ret_from_fork (??:?)
> [   32.046587][    T1]  </TASK>
> [   32.047273][    T1] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
> [   32.048966][    T1] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
> [   32.050596][    T1] preempt_count: 1, expected: 0
> [   32.051521][    T1] 1 lock held by swapper/0/1:
> [ 32.052424][ T1] #0: ffffffff853269a0 (rcu_read_lock){....}-{1:2}, at: page_ext_get (??:?)
> [   32.054113][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc7-01609-g8f13ff79ed92 #5
> [   32.055686][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
> [   32.057527][    T1] Call Trace:
> [   32.058191][    T1]  <TASK>
> [ 32.058803][ T1] dump_stack_lvl (??:?)
> [ 32.059668][ T1] __might_resched.cold (core.c:?)
> [ 32.060638][ T1] __kmem_cache_alloc_node (??:?)
> [ 32.061654][ T1] ? active_vm_slab_add (??:?)
> [ 32.062615][ T1] ? active_vm_slab_add (??:?)
> [ 32.063557][ T1] __kmalloc_node (??:?)
> [ 32.064421][ T1] active_vm_slab_add (??:?)
> [ 32.065373][ T1] ? kasan_unpoison (??:?)
> [ 32.066294][ T1] __kmem_cache_alloc_node (??:?)
> [ 32.067294][ T1] ? __bpf_map_area_alloc (syscall.c:?)
> [ 32.068314][ T1] ? __bpf_map_area_alloc (syscall.c:?)
> [ 32.069306][ T1] __kmalloc_node (??:?)
> [ 32.070215][ T1] __bpf_map_area_alloc (syscall.c:?)
> [ 32.071210][ T1] array_map_alloc (arraymap.c:?)
> [ 32.072134][ T1] map_create (syscall.c:?)
> [ 32.072972][ T1] __sys_bpf (syscall.c:?)
> [ 32.073810][ T1] ? link_create (syscall.c:?)
> [ 32.074693][ T1] ? lock_downgrade (lockdep.c:?)
> [ 32.075609][ T1] kern_sys_bpf (??:?)
> [ 32.076455][ T1] ? bpf_sys_bpf (??:?)
> [ 32.077295][ T1] ? trace_hardirqs_on (??:?)
> [ 32.078232][ T1] ? _raw_spin_unlock_irqrestore (??:?)
> [ 32.079288][ T1] ? __stack_depot_save (??:?)
> [ 32.080258][ T1] skel_map_create+0xba/0xeb
> [ 32.081264][ T1] ? skel_map_update_elem+0xe3/0xe3
> [ 32.082356][ T1] ? kasan_save_stack (??:?)
> [ 32.083234][ T1] ? kasan_set_track (??:?)
> [ 32.084107][ T1] ? __kasan_kmalloc (??:?)
> [ 32.085024][ T1] ? kernel_init (main.c:?)
> [ 32.085901][ T1] ? lock_acquire (??:?)
> [ 32.086784][ T1] ? find_held_lock (lockdep.c:?)
> [ 32.087674][ T1] ? __kmem_cache_alloc_node (??:?)
> [ 32.088715][ T1] bpf_load_and_run+0x93/0x3f5
> [ 32.090649][ T1] ? skel_map_create+0xeb/0xeb
> [ 32.091749][ T1] ? lock_downgrade (lockdep.c:?)
> [ 32.092728][ T1] ? __kmem_cache_alloc_node (??:?)
> [ 32.093794][ T1] ? load_skel (bpf_preload_kern.c:?)
> [ 32.094612][ T1] ? rcu_read_lock_sched_held (??:?)
> [ 32.095606][ T1] ? __kmalloc_node (??:?)
> [ 32.096490][ T1] load_skel (bpf_preload_kern.c:?)
> [ 32.097314][ T1] ? bpf_load_and_run+0x3f5/0x3f5
> [ 32.098412][ T1] ? kvm_clock_get_cycles (kvmclock.c:?)
> [ 32.099362][ T1] ? btf_vmlinux_init (bpf_preload_kern.c:?)
> [ 32.100271][ T1] load (bpf_preload_kern.c:?)
> [ 32.100966][ T1] ? btf_vmlinux_init (bpf_preload_kern.c:?)
> [ 32.101872][ T1] do_one_initcall (??:?)
> [ 32.102719][ T1] ? trace_event_raw_event_initcall_level (??:?)
> [ 32.103859][ T1] ? parse_one (??:?)
> [ 32.104645][ T1] ? __kmem_cache_alloc_node (??:?)
> [ 32.105625][ T1] do_initcalls (main.c:?)
> [ 32.106438][ T1] kernel_init_freeable (main.c:?)
> [ 32.107333][ T1] ? console_on_rootfs (main.c:?)
> [ 32.108213][ T1] ? usleep_range_state (??:?)
> [ 32.109175][ T1] ? rest_init (main.c:?)
> [ 32.110000][ T1] ? rest_init (main.c:?)
> [ 32.110836][ T1] kernel_init (main.c:?)
> [ 32.111633][ T1] ret_from_fork (??:?)
> [   32.112419][    T1]  </TASK>
> [ 32.144051][ T1] initcall load+0x0/0x4a returned 0 after 169883 usecs
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <yujie.liu@intel.com>
> | Link: https://lore.kernel.org/oe-lkp/202212141512.469bca4-yujie.liu@intel.com
>
>
> To reproduce:
>
>         # build kernel
>         cd linux
>         cp config-6.1.0-rc7-01609-g8f13ff79ed92 .config
>         make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
>         make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
>         cd <mod-install-dir>
>         find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
>
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
>
>         # if come across any failure that blocks the test,
>         # please remove ~/.lkp and /lkp dir to run from a clean state.
>
>

Many thanks for the report. Should add GFP_ATOMIC to fix it. I missed
the rcu_read_lock() in page_ext_get().

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-12-14 12:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-12  0:37 [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 1/9] mm: Introduce active vm item Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 2/9] mm: Allow using active vm in all contexts Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 3/9] mm: percpu: Account active vm for percpu Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 4/9] mm: slab: Account active vm for slab Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 5/9] mm: Account active vm for page Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 6/9] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 7/9] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 8/9] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
2022-12-12  0:37 ` [RFC PATCH bpf-next 9/9] bpf: Use active vm to account bpf map memory usage Yafang Shao
     [not found]   ` <202212141512.469bca4-yujie.liu@intel.com>
2022-12-14 12:01     ` Yafang Shao
2022-12-12 17:54 ` [RFC PATCH bpf-next 0/9] mm, bpf: Add BPF into /proc/meminfo Vlastimil Babka
2022-12-13 11:52   ` Yafang Shao
2022-12-13 14:56     ` Hyeonggon Yoo
2022-12-13 15:52       ` Vlastimil Babka
2022-12-13 19:21         ` Paul E. McKenney
2022-12-14 10:46           ` Yafang Shao
2022-12-14 10:43         ` Yafang Shao
2022-12-14 10:34       ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox