* [PATCH 0/1] alloc_tag: add per-numa node stats
@ 2025-05-30 0:39 Casey Chen
2025-05-30 0:39 ` [PATCH] " Casey Chen
2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet
0 siblings, 2 replies; 20+ messages in thread
From: Casey Chen @ 2025-05-30 0:39 UTC (permalink / raw)
To: linux-mm, surenb, kent.overstreet; +Cc: yzhong, cachen
The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
Also percpu allocation is marked and its stats is stored on NUMA node 0.
For example, the resulting file looks like below.
percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
...
percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
To save memory, we dynamically allocate per-NUMA node stats counter once the
system boots up and knows how many NUMA nodes available. percpu allocators
are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
these counters are not accounted in profiling stats.
For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
percpu y total 17024 532 numa0 17024 532 numa1 0 0 lib/alloc_tag.c:564 func:load_module
Casey Chen (1):
alloc_tag: add per-numa node stats
include/linux/alloc_tag.h | 48 +++++++++++++++++++++++++++------------
include/linux/codetag.h | 4 ++++
include/linux/percpu.h | 2 +-
lib/alloc_tag.c | 41 +++++++++++++++++++++++++++------
mm/page_alloc.c | 35 ++++++++++++++--------------
mm/percpu.c | 8 +++++--
mm/show_mem.c | 21 ++++++++++++-----
mm/slub.c | 10 +++++---
8 files changed, 119 insertions(+), 50 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH] alloc_tag: add per-numa node stats
2025-05-30 0:39 [PATCH 0/1] alloc_tag: add per-numa node stats Casey Chen
@ 2025-05-30 0:39 ` Casey Chen
2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet
1 sibling, 0 replies; 20+ messages in thread
From: Casey Chen @ 2025-05-30 0:39 UTC (permalink / raw)
To: linux-mm, surenb, kent.overstreet; +Cc: yzhong, cachen
Add per-numa stats for each alloc_tag. We used to have only one
alloc_tag_counters per CPU, now each CPU has one per numa node.
bytes/calls in total and for each numa node are now displayed
together in a single row for each alloc_tag in /proc/allocinfo.
Note for percpu allocation, per-numa stats doesn't make sense.
Numa nodes for per-CPU memory vary. Each CPU usually gets copy
from its local numa node. We don't have a way to count numa node
stats by CPU, so just store all stats in numa 0. Also, the 'bytes'
field is just the number needed by a single CPU, to get the total
bytes, multiply it by number of possible CPUs. Added boolean field
'percpu' to mark all percpu allocations in /proc/allocinfo.
To minimize memory usage, alloc_tag stats counters are dynamically
allocated with percpu allocator. Increase PERCPU_DYNAMIC_RESERVE to
accommodate counters for in-kernel alloc_tags.
For in-kernel alloc_tag, pcpu_alloc_noprof() is called to allocate
stats counters, which is not accounted for in profiling stats.
Signed-off-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
---
include/linux/alloc_tag.h | 49 ++++++++++++++++++++++++++++-----------
include/linux/codetag.h | 4 ++++
include/linux/percpu.h | 2 +-
lib/alloc_tag.c | 43 ++++++++++++++++++++++++++++------
mm/page_alloc.c | 35 ++++++++++++++--------------
mm/percpu.c | 8 +++++--
mm/show_mem.c | 20 ++++++++++++----
mm/slub.c | 11 ++++++---
8 files changed, 123 insertions(+), 49 deletions(-)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 8f7931eb7d16..99d4a1823e51 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -15,6 +15,8 @@
#include <linux/static_key.h>
#include <linux/irqflags.h>
+extern int num_numa_nodes;
+
struct alloc_tag_counters {
u64 bytes;
u64 calls;
@@ -134,16 +136,34 @@ static inline bool mem_alloc_profiling_enabled(void)
&mem_alloc_profiling_key);
}
+static inline struct alloc_tag_counters alloc_tag_read_nid(struct alloc_tag *tag, int nid)
+{
+ struct alloc_tag_counters v = { 0, 0 };
+ struct alloc_tag_counters *counters;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ counters = per_cpu_ptr(tag->counters, cpu);
+ v.bytes += counters[nid].bytes;
+ v.calls += counters[nid].calls;
+ }
+
+ return v;
+}
+
static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
{
struct alloc_tag_counters v = { 0, 0 };
- struct alloc_tag_counters *counter;
+ struct alloc_tag_counters *counters;
int cpu;
+ int nid;
for_each_possible_cpu(cpu) {
- counter = per_cpu_ptr(tag->counters, cpu);
- v.bytes += counter->bytes;
- v.calls += counter->calls;
+ counters = per_cpu_ptr(tag->counters, cpu);
+ for (nid = 0; nid < num_numa_nodes; nid++) {
+ v.bytes += counters[nid].bytes;
+ v.calls += counters[nid].calls;
+ }
}
return v;
@@ -179,7 +199,7 @@ static inline bool __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag
return true;
}
-static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
+static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag, int nid)
{
if (unlikely(!__alloc_tag_ref_set(ref, tag)))
return false;
@@ -190,17 +210,18 @@ static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *t
* Each new reference for every sub-allocation needs to increment call
* counter because when we free each part the counter will be decremented.
*/
- this_cpu_inc(tag->counters->calls);
+ this_cpu_inc(tag->counters[nid].calls);
return true;
}
-static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
+ int nid, size_t bytes)
{
- if (likely(alloc_tag_ref_set(ref, tag)))
- this_cpu_add(tag->counters->bytes, bytes);
+ if (likely(alloc_tag_ref_set(ref, tag, nid)))
+ this_cpu_add(tag->counters[nid].bytes, bytes);
}
-static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+static inline void alloc_tag_sub(union codetag_ref *ref, int nid, size_t bytes)
{
struct alloc_tag *tag;
@@ -215,8 +236,8 @@ static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
tag = ct_to_alloc_tag(ref->ct);
- this_cpu_sub(tag->counters->bytes, bytes);
- this_cpu_dec(tag->counters->calls);
+ this_cpu_sub(tag->counters[nid].bytes, bytes);
+ this_cpu_dec(tag->counters[nid].calls);
ref->ct = NULL;
}
@@ -228,8 +249,8 @@ static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
#define DEFINE_ALLOC_TAG(_alloc_tag)
static inline bool mem_alloc_profiling_enabled(void) { return false; }
static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
- size_t bytes) {}
-static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
+ int nid, size_t bytes) {}
+static inline void alloc_tag_sub(union codetag_ref *ref, int nid, size_t bytes) {}
#define alloc_tag_record(p) do {} while (0)
#endif /* CONFIG_MEM_ALLOC_PROFILING */
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 5f2b9a1f722c..79d6b96c61f6 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -16,6 +16,10 @@ struct module;
#define CODETAG_SECTION_START_PREFIX "__start_"
#define CODETAG_SECTION_STOP_PREFIX "__stop_"
+enum codetag_flags {
+ CODETAG_PERCPU_ALLOC = (1 << 0), /* codetag tracking percpu allocation */
+};
+
/*
* An instance of this structure is created in a special ELF section at every
* code location being tagged. At runtime, the special section is treated as
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 85bf8dd9f087..d92c27fbcd0d 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -43,7 +43,7 @@
# define PERCPU_DYNAMIC_SIZE_SHIFT 12
#endif /* LOCKDEP and PAGE_SIZE > 4KiB */
#else
-#define PERCPU_DYNAMIC_SIZE_SHIFT 10
+#define PERCPU_DYNAMIC_SIZE_SHIFT 13
#endif
/*
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index d48b80f3f007..b4d2d5663c4c 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -42,6 +42,9 @@ struct allocinfo_private {
bool print_header;
};
+int num_numa_nodes;
+static unsigned long pcpu_counters_size;
+
static void *allocinfo_start(struct seq_file *m, loff_t *pos)
{
struct allocinfo_private *priv;
@@ -95,9 +98,16 @@ static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
{
struct alloc_tag *tag = ct_to_alloc_tag(ct);
struct alloc_tag_counters counter = alloc_tag_read(tag);
- s64 bytes = counter.bytes;
+ int nid;
+
+ seq_buf_printf(out, "percpu %c total %12lli %8llu ",
+ ct->flags & CODETAG_PERCPU_ALLOC ? 'y' : 'n',
+ counter.bytes, counter.calls);
+ for (nid = 0; nid < num_numa_nodes; nid++) {
+ counter = alloc_tag_read_nid(tag, nid);
+ seq_buf_printf(out, "numa%d %12lli %8llu ", nid, counter.bytes, counter.calls);
+ }
- seq_buf_printf(out, "%12lli %8llu ", bytes, counter.calls);
codetag_to_text(out, ct);
seq_buf_putc(out, ' ');
seq_buf_putc(out, '\n');
@@ -184,7 +194,7 @@ void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
if (get_page_tag_ref(folio_page(folio, i), &ref, &handle)) {
/* Set new reference to point to the original tag */
- alloc_tag_ref_set(&ref, tag);
+ alloc_tag_ref_set(&ref, tag, folio_nid(folio));
update_page_tag_ref(handle, &ref);
put_page_tag_ref(handle);
}
@@ -247,19 +257,36 @@ static void shutdown_mem_profiling(bool remove_file)
void __init alloc_tag_sec_init(void)
{
struct alloc_tag *last_codetag;
+ int i;
if (!mem_profiling_support)
return;
- if (!static_key_enabled(&mem_profiling_compressed))
- return;
-
kernel_tags.first_tag = (struct alloc_tag *)kallsyms_lookup_name(
SECTION_START(ALLOC_TAG_SECTION_NAME));
last_codetag = (struct alloc_tag *)kallsyms_lookup_name(
SECTION_STOP(ALLOC_TAG_SECTION_NAME));
kernel_tags.count = last_codetag - kernel_tags.first_tag;
+ num_numa_nodes = num_possible_nodes();
+ pcpu_counters_size = num_numa_nodes * sizeof(struct alloc_tag_counters);
+ for (i = 0; i < kernel_tags.count; i++) {
+ /* Each CPU has one counter per numa node */
+ kernel_tags.first_tag[i].counters =
+ pcpu_alloc_noprof(pcpu_counters_size,
+ sizeof(struct alloc_tag_counters),
+ false, GFP_KERNEL | __GFP_ZERO);
+ if (!kernel_tags.first_tag[i].counters) {
+ while (--i >= 0)
+ free_percpu(kernel_tags.first_tag[i].counters);
+ pr_info("Failed to allocate per-cpu alloc_tag counters\n");
+ return;
+ }
+ }
+
+ if (!static_key_enabled(&mem_profiling_compressed))
+ return;
+
/* Check if kernel tags fit into page flags */
if (kernel_tags.count > (1UL << NR_UNUSED_PAGEFLAG_BITS)) {
shutdown_mem_profiling(false); /* allocinfo file does not exist yet */
@@ -622,7 +649,9 @@ static int load_module(struct module *mod, struct codetag *start, struct codetag
stop_tag = ct_to_alloc_tag(stop);
for (tag = start_tag; tag < stop_tag; tag++) {
WARN_ON(tag->counters);
- tag->counters = alloc_percpu(struct alloc_tag_counters);
+ tag->counters = __alloc_percpu_gfp(pcpu_counters_size,
+ sizeof(struct alloc_tag_counters),
+ GFP_KERNEL | __GFP_ZERO);
if (!tag->counters) {
while (--tag >= start_tag) {
free_percpu(tag->counters);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 90b06f3d004c..8219d8de6f97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1107,58 +1107,59 @@ void __clear_page_tag_ref(struct page *page)
/* Should be called only if mem_alloc_profiling_enabled() */
static noinline
void __pgalloc_tag_add(struct page *page, struct task_struct *task,
- unsigned int nr)
+ int nid, unsigned int nr)
{
union pgtag_ref_handle handle;
union codetag_ref ref;
if (get_page_tag_ref(page, &ref, &handle)) {
- alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr);
+ alloc_tag_add(&ref, task->alloc_tag, nid, PAGE_SIZE * nr);
update_page_tag_ref(handle, &ref);
put_page_tag_ref(handle);
}
}
static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
- unsigned int nr)
+ int nid, unsigned int nr)
{
if (mem_alloc_profiling_enabled())
- __pgalloc_tag_add(page, task, nr);
+ __pgalloc_tag_add(page, task, nid, nr);
}
/* Should be called only if mem_alloc_profiling_enabled() */
static noinline
-void __pgalloc_tag_sub(struct page *page, unsigned int nr)
+void __pgalloc_tag_sub(struct page *page, int nid, unsigned int nr)
{
union pgtag_ref_handle handle;
union codetag_ref ref;
if (get_page_tag_ref(page, &ref, &handle)) {
- alloc_tag_sub(&ref, PAGE_SIZE * nr);
+ alloc_tag_sub(&ref, nid, PAGE_SIZE * nr);
update_page_tag_ref(handle, &ref);
put_page_tag_ref(handle);
}
}
-static inline void pgalloc_tag_sub(struct page *page, unsigned int nr)
+static inline void pgalloc_tag_sub(struct page *page, int nid, unsigned int nr)
{
if (mem_alloc_profiling_enabled())
- __pgalloc_tag_sub(page, nr);
+ __pgalloc_tag_sub(page, nid, nr);
}
/* When tag is not NULL, assuming mem_alloc_profiling_enabled */
-static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
+static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag,
+ int nid, unsigned int nr)
{
if (tag)
- this_cpu_sub(tag->counters->bytes, PAGE_SIZE * nr);
+ this_cpu_sub(tag->counters[nid].bytes, PAGE_SIZE * nr);
}
#else /* CONFIG_MEM_ALLOC_PROFILING */
static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
- unsigned int nr) {}
-static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {}
-static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {}
+ int nid, unsigned int nr) {}
+static inline void pgalloc_tag_sub(struct page *page, int nid, unsigned int nr) {}
+static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, int nid, unsigned int nr) {}
#endif /* CONFIG_MEM_ALLOC_PROFILING */
@@ -1197,7 +1198,7 @@ __always_inline bool free_pages_prepare(struct page *page,
/* Do not let hwpoison pages hit pcplists/buddy */
reset_page_owner(page, order);
page_table_check_free(page, order);
- pgalloc_tag_sub(page, 1 << order);
+ pgalloc_tag_sub(page, page_to_nid(page), 1 << order);
/*
* The page is isolated and accounted for.
@@ -1251,7 +1252,7 @@ __always_inline bool free_pages_prepare(struct page *page,
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
- pgalloc_tag_sub(page, 1 << order);
+ pgalloc_tag_sub(page, page_to_nid(page), 1 << order);
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
@@ -1707,7 +1708,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
- pgalloc_tag_add(page, current, 1 << order);
+ pgalloc_tag_add(page, current, page_to_nid(page), 1 << order);
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -5064,7 +5065,7 @@ static void ___free_pages(struct page *page, unsigned int order,
if (put_page_testzero(page))
__free_frozen_pages(page, order, fpi_flags);
else if (!head) {
- pgalloc_tag_sub_pages(tag, (1 << order) - 1);
+ pgalloc_tag_sub_pages(tag, page_to_nid(page), (1 << order) - 1);
while (order-- > 0)
__free_frozen_pages(page + (1 << order), order,
fpi_flags);
diff --git a/mm/percpu.c b/mm/percpu.c
index b35494c8ede2..130450e9718e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1691,15 +1691,19 @@ static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
size_t size)
{
if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
+ /* For percpu allocation, store all alloc_tag stats on numa node 0 */
alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
- current->alloc_tag, size);
+ current->alloc_tag, 0, size);
+ if (current->alloc_tag)
+ current->alloc_tag->ct.flags |= CODETAG_PERCPU_ALLOC;
}
}
static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
{
+ /* percpu alloc_tag stats is stored on numa node 0 so subtract from node 0 */
if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
- alloc_tag_sub(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size);
+ alloc_tag_sub(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, 0, size);
}
#else
static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 03e8d968fd1a..132b3aa82d83 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -5,6 +5,7 @@
* Copyright (C) 2008 Johannes Weiner <hannes@saeurebad.de>
*/
+#include <linux/alloc_tag.h>
#include <linux/blkdev.h>
#include <linux/cma.h>
#include <linux/cpuset.h>
@@ -433,18 +434,27 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
struct alloc_tag *tag = ct_to_alloc_tag(ct);
struct alloc_tag_counters counter = alloc_tag_read(tag);
char bytes[10];
+ int nid;
string_get_size(counter.bytes, 1, STRING_UNITS_2, bytes, sizeof(bytes));
+ pr_notice("percpu %c total %12s %8llu ",
+ ct->flags & CODETAG_PERCPU_ALLOC ? 'y' : 'n',
+ bytes, counter.calls);
+
+ for (nid = 0; nid < num_numa_nodes; nid++) {
+ counter = alloc_tag_read_nid(tag, nid);
+ string_get_size(counter.bytes, 1, STRING_UNITS_2,
+ bytes, sizeof(bytes));
+ pr_notice("numa%d %12s %8llu ", nid, bytes, counter.calls);
+ }
/* Same as alloc_tag_to_text() but w/o intermediate buffer */
if (ct->modname)
- pr_notice("%12s %8llu %s:%u [%s] func:%s\n",
- bytes, counter.calls, ct->filename,
+ pr_notice("%s:%u [%s] func:%s\n", ct->filename,
ct->lineno, ct->modname, ct->function);
else
- pr_notice("%12s %8llu %s:%u func:%s\n",
- bytes, counter.calls, ct->filename,
- ct->lineno, ct->function);
+ pr_notice("%s:%u func:%s\n",
+ ct->filename, ct->lineno, ct->function);
}
}
}
diff --git a/mm/slub.c b/mm/slub.c
index be8b09e09d30..068b88b85d80 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2104,8 +2104,12 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
* If other users appear then mem_alloc_profiling_enabled()
* check should be added before alloc_tag_add().
*/
- if (likely(obj_exts))
- alloc_tag_add(&obj_exts->ref, current->alloc_tag, s->size);
+ if (likely(obj_exts)) {
+ struct page *page = virt_to_page(object);
+
+ alloc_tag_add(&obj_exts->ref, current->alloc_tag,
+ page_to_nid(page), s->size);
+ }
}
static inline void
@@ -2133,8 +2137,9 @@ __alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p
for (i = 0; i < objects; i++) {
unsigned int off = obj_to_index(s, slab, p[i]);
+ struct page *page = virt_to_page(p[i]);
- alloc_tag_sub(&obj_exts[off].ref, s->size);
+ alloc_tag_sub(&obj_exts[off].ref, page_to_nid(page), s->size);
}
}
--
2.34.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-05-30 0:39 [PATCH 0/1] alloc_tag: add per-numa node stats Casey Chen
2025-05-30 0:39 ` [PATCH] " Casey Chen
@ 2025-05-30 1:11 ` Kent Overstreet
2025-05-30 21:45 ` Casey Chen
1 sibling, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2025-05-30 1:11 UTC (permalink / raw)
To: Casey Chen; +Cc: linux-mm, surenb, yzhong
On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>
> The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> Also percpu allocation is marked and its stats is stored on NUMA node 0.
> For example, the resulting file looks like below.
>
> percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> ...
> percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
Err, what is 'percpu y/n'?
>
> To save memory, we dynamically allocate per-NUMA node stats counter once the
> system boots up and knows how many NUMA nodes available. percpu allocators
> are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
>
> For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> these counters are not accounted in profiling stats.
>
> For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
of interest to people looking at optimizing allocations to make sure
they're on the right numa node?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet
@ 2025-05-30 21:45 ` Casey Chen
2025-05-31 0:05 ` Kent Overstreet
0 siblings, 1 reply; 20+ messages in thread
From: Casey Chen @ 2025-05-30 21:45 UTC (permalink / raw)
To: Kent Overstreet; +Cc: linux-mm, surenb, yzhong
On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >
> > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > For example, the resulting file looks like below.
> >
> > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > ...
> > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
>
> Err, what is 'percpu y/n'?
>
Mark percpu allocation with 'percpu y/n' because for percpu allocation
stats, 'bytes' is per-cpu, we have to multiply it by the number of
CPUs to get the total bytes. Mark it so we know the exact amount of
memory used. Any /proc/allocinfo parser can understand it and make
correct calculations.
> >
> > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > system boots up and knows how many NUMA nodes available. percpu allocators
> > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> >
> > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > these counters are not accounted in profiling stats.
> >
> > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
>
> Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> of interest to people looking at optimizing allocations to make sure
> they're on the right numa node?
Yes, to help us know if there is an NUMA imbalance issue and make some
optimizations. I can make it a kconfig. Does anybody else have any
opinion about this feature ? Thanks!
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-05-30 21:45 ` Casey Chen
@ 2025-05-31 0:05 ` Kent Overstreet
2025-06-02 20:48 ` Casey Chen
0 siblings, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2025-05-31 0:05 UTC (permalink / raw)
To: Casey Chen; +Cc: linux-mm, surenb, yzhong
On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > >
> > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > For example, the resulting file looks like below.
> > >
> > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > ...
> > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> >
> > Err, what is 'percpu y/n'?
> >
>
> Mark percpu allocation with 'percpu y/n' because for percpu allocation
> stats, 'bytes' is per-cpu, we have to multiply it by the number of
> CPUs to get the total bytes. Mark it so we know the exact amount of
> memory used. Any /proc/allocinfo parser can understand it and make
> correct calculations.
Ok, just wanted to be sure it wasn't something else. Let's shorten that
though, a single character should suffice (we already have a header that
can explain what it is) - if you're growing the width we don't want to
overflow.
>
> > >
> > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > >
> > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > these counters are not accounted in profiling stats.
> > >
> > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> >
> > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > of interest to people looking at optimizing allocations to make sure
> > they're on the right numa node?
>
> Yes, to help us know if there is an NUMA imbalance issue and make some
> optimizations. I can make it a kconfig. Does anybody else have any
> opinion about this feature ? Thanks!
I would like to see some other opinions from potential users, have you
been circulating it?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-05-31 0:05 ` Kent Overstreet
@ 2025-06-02 20:48 ` Casey Chen
2025-06-02 21:32 ` Suren Baghdasaryan
2025-06-02 21:52 ` Kent Overstreet
0 siblings, 2 replies; 20+ messages in thread
From: Casey Chen @ 2025-06-02 20:48 UTC (permalink / raw)
To: Kent Overstreet; +Cc: linux-mm, surenb, yzhong
On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > >
> > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > For example, the resulting file looks like below.
> > > >
> > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > ...
> > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > >
> > > Err, what is 'percpu y/n'?
> > >
> >
> > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > CPUs to get the total bytes. Mark it so we know the exact amount of
> > memory used. Any /proc/allocinfo parser can understand it and make
> > correct calculations.
>
> Ok, just wanted to be sure it wasn't something else. Let's shorten that
> though, a single character should suffice (we already have a header that
> can explain what it is) - if you're growing the width we don't want to
> overflow.
>
Does it have a header ?
> >
> > > >
> > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > >
> > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > these counters are not accounted in profiling stats.
> > > >
> > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > >
> > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > of interest to people looking at optimizing allocations to make sure
> > > they're on the right numa node?
> >
> > Yes, to help us know if there is an NUMA imbalance issue and make some
> > optimizations. I can make it a kconfig. Does anybody else have any
> > opinion about this feature ? Thanks!
>
> I would like to see some other opinions from potential users, have you
> been circulating it?
We have been using it internally for a while. I don't know who the
potential users are and how to reach them so I am sharing it here to
collect opinions from others.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 20:48 ` Casey Chen
@ 2025-06-02 21:32 ` Suren Baghdasaryan
2025-06-03 15:00 ` Suren Baghdasaryan
2025-06-03 20:00 ` Casey Chen
2025-06-02 21:52 ` Kent Overstreet
1 sibling, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-02 21:32 UTC (permalink / raw)
To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong
On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
>
> On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > >
> > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > For example, the resulting file looks like below.
> > > > >
> > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > ...
> > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > >
> > > > Err, what is 'percpu y/n'?
> > > >
> > >
> > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > memory used. Any /proc/allocinfo parser can understand it and make
> > > correct calculations.
> >
> > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > though, a single character should suffice (we already have a header that
> > can explain what it is) - if you're growing the width we don't want to
> > overflow.
> >
>
> Does it have a header ?
Yes. See print_allocinfo_header().
>
> > >
> > > > >
> > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > >
> > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > these counters are not accounted in profiling stats.
> > > > >
> > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > >
> > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > of interest to people looking at optimizing allocations to make sure
> > > > they're on the right numa node?
> > >
> > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > optimizations. I can make it a kconfig. Does anybody else have any
> > > opinion about this feature ? Thanks!
> >
> > I would like to see some other opinions from potential users, have you
> > been circulating it?
>
> We have been using it internally for a while. I don't know who the
> potential users are and how to reach them so I am sharing it here to
> collect opinions from others.
Should definitely have a separate Kconfig option. Have you measured
the memory and performance overhead of this change?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 20:48 ` Casey Chen
2025-06-02 21:32 ` Suren Baghdasaryan
@ 2025-06-02 21:52 ` Kent Overstreet
2025-06-02 22:08 ` Steven Rostedt
1 sibling, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2025-06-02 21:52 UTC (permalink / raw)
To: Casey Chen
Cc: linux-mm, surenb, yzhong, Steven Rostedt, Peter Zijlstra,
Ingo Molnar
+cc Steven, Peter, Ingo
On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote:
> On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > >
> > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > For example, the resulting file looks like below.
> > > > >
> > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > ...
> > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > >
> > > > Err, what is 'percpu y/n'?
> > > >
> > >
> > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > memory used. Any /proc/allocinfo parser can understand it and make
> > > correct calculations.
> >
> > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > though, a single character should suffice (we already have a header that
> > can explain what it is) - if you're growing the width we don't want to
> > overflow.
> >
>
> Does it have a header ?
>
> > >
> > > > >
> > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > >
> > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > these counters are not accounted in profiling stats.
> > > > >
> > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > >
> > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > of interest to people looking at optimizing allocations to make sure
> > > > they're on the right numa node?
> > >
> > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > optimizations. I can make it a kconfig. Does anybody else have any
> > > opinion about this feature ? Thanks!
> >
> > I would like to see some other opinions from potential users, have you
> > been circulating it?
>
> We have been using it internally for a while. I don't know who the
> potential users are and how to reach them so I am sharing it here to
> collect opinions from others.
I'd ask the tracing and profiling people for their thoughts, and anyone
working on tooling that might consume this.
I'm wondering if there might be some way of feeding more info into perf,
since profiling cache misses is a big thing that it does.
It might be a long shot, since we're just accounting usage, or it might
spark some useful ideas.
Can you share a bit about how you're using this internally?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 21:52 ` Kent Overstreet
@ 2025-06-02 22:08 ` Steven Rostedt
2025-06-02 23:35 ` Kent Overstreet
0 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2025-06-02 22:08 UTC (permalink / raw)
To: Kent Overstreet
Cc: Casey Chen, linux-mm, surenb, yzhong, Peter Zijlstra, Ingo Molnar,
Namhyung Kim, Masami Hiramatsu, Arnaldo Carvalho de Melo,
Ian Rogers
On Mon, 2 Jun 2025 17:52:49 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:
> +cc Steven, Peter, Ingo
>
> On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote:
> > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > >
> > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > For example, the resulting file looks like below.
> > > > > >
> > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > ...
> > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > >
> > > > > Err, what is 'percpu y/n'?
> > > > >
> > > >
> > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > correct calculations.
> > >
> > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > though, a single character should suffice (we already have a header that
> > > can explain what it is) - if you're growing the width we don't want to
> > > overflow.
> > >
> >
> > Does it have a header ?
> >
> > > >
> > > > > >
> > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > >
> > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > these counters are not accounted in profiling stats.
> > > > > >
> > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > >
> > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > of interest to people looking at optimizing allocations to make sure
> > > > > they're on the right numa node?
> > > >
> > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > opinion about this feature ? Thanks!
> > >
> > > I would like to see some other opinions from potential users, have you
> > > been circulating it?
> >
> > We have been using it internally for a while. I don't know who the
> > potential users are and how to reach them so I am sharing it here to
> > collect opinions from others.
>
> I'd ask the tracing and profiling people for their thoughts, and anyone
> working on tooling that might consume this.
>
> I'm wondering if there might be some way of feeding more info into perf,
> since profiling cache misses is a big thing that it does.
>
> It might be a long shot, since we're just accounting usage, or it might
> spark some useful ideas.
>
> Can you share a bit about how you're using this internally?
I'm guessing this is to show where in the kernel functions are using memory?
I added to the Cc people who tend to use perf for analysis then just having
those that maintain the kernel side of perf.
-- Steve
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 22:08 ` Steven Rostedt
@ 2025-06-02 23:35 ` Kent Overstreet
2025-06-03 6:46 ` Ian Rogers
0 siblings, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2025-06-02 23:35 UTC (permalink / raw)
To: Steven Rostedt
Cc: Casey Chen, linux-mm, surenb, yzhong, Peter Zijlstra, Ingo Molnar,
Namhyung Kim, Masami Hiramatsu, Arnaldo Carvalho de Melo,
Ian Rogers
On Mon, Jun 02, 2025 at 06:08:26PM -0400, Steven Rostedt wrote:
> On Mon, 2 Jun 2025 17:52:49 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> > +cc Steven, Peter, Ingo
> >
> > On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote:
> > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > >
> > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > For example, the resulting file looks like below.
> > > > > > >
> > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > ...
> > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > >
> > > > > > Err, what is 'percpu y/n'?
> > > > > >
> > > > >
> > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > correct calculations.
> > > >
> > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > though, a single character should suffice (we already have a header that
> > > > can explain what it is) - if you're growing the width we don't want to
> > > > overflow.
> > > >
> > >
> > > Does it have a header ?
> > >
> > > > >
> > > > > > >
> > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > >
> > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > these counters are not accounted in profiling stats.
> > > > > > >
> > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > >
> > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > they're on the right numa node?
> > > > >
> > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > opinion about this feature ? Thanks!
> > > >
> > > > I would like to see some other opinions from potential users, have you
> > > > been circulating it?
> > >
> > > We have been using it internally for a while. I don't know who the
> > > potential users are and how to reach them so I am sharing it here to
> > > collect opinions from others.
> >
> > I'd ask the tracing and profiling people for their thoughts, and anyone
> > working on tooling that might consume this.
> >
> > I'm wondering if there might be some way of feeding more info into perf,
> > since profiling cache misses is a big thing that it does.
> >
> > It might be a long shot, since we're just accounting usage, or it might
> > spark some useful ideas.
> >
> > Can you share a bit about how you're using this internally?
>
> I'm guessing this is to show where in the kernel functions are using memory?
Exactly
Now that we've got a mapping from address to source location that owns
it, I'm wondering if there's anything else we can do with it.
> I added to the Cc people who tend to use perf for analysis then just having
> those that maintain the kernel side of perf.
Perfect, thanks
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 23:35 ` Kent Overstreet
@ 2025-06-03 6:46 ` Ian Rogers
0 siblings, 0 replies; 20+ messages in thread
From: Ian Rogers @ 2025-06-03 6:46 UTC (permalink / raw)
To: Kent Overstreet
Cc: Steven Rostedt, Casey Chen, linux-mm, surenb, yzhong,
Peter Zijlstra, Ingo Molnar, Namhyung Kim, Masami Hiramatsu,
Arnaldo Carvalho de Melo
On Mon, Jun 2, 2025 at 4:35 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Mon, Jun 02, 2025 at 06:08:26PM -0400, Steven Rostedt wrote:
> > On Mon, 2 Jun 2025 17:52:49 -0400
> > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > > +cc Steven, Peter, Ingo
> > >
> > > On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote:
> > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > > >
> > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > > For example, the resulting file looks like below.
> > > > > > > >
> > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > > ...
> > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > > >
> > > > > > > Err, what is 'percpu y/n'?
> > > > > > >
> > > > > >
> > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > > correct calculations.
> > > > >
> > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > > though, a single character should suffice (we already have a header that
> > > > > can explain what it is) - if you're growing the width we don't want to
> > > > > overflow.
> > > > >
> > > >
> > > > Does it have a header ?
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > > >
> > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > > these counters are not accounted in profiling stats.
> > > > > > > >
> > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > > >
> > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > > they're on the right numa node?
> > > > > >
> > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > > opinion about this feature ? Thanks!
> > > > >
> > > > > I would like to see some other opinions from potential users, have you
> > > > > been circulating it?
> > > >
> > > > We have been using it internally for a while. I don't know who the
> > > > potential users are and how to reach them so I am sharing it here to
> > > > collect opinions from others.
> > >
> > > I'd ask the tracing and profiling people for their thoughts, and anyone
> > > working on tooling that might consume this.
> > >
> > > I'm wondering if there might be some way of feeding more info into perf,
> > > since profiling cache misses is a big thing that it does.
> > >
> > > It might be a long shot, since we're just accounting usage, or it might
> > > spark some useful ideas.
> > >
> > > Can you share a bit about how you're using this internally?
> >
> > I'm guessing this is to show where in the kernel functions are using memory?
>
> Exactly
>
> Now that we've got a mapping from address to source location that owns
> it, I'm wondering if there's anything else we can do with it.
>
> > I added to the Cc people who tend to use perf for analysis then just having
> > those that maintain the kernel side of perf.
>
> Perfect, thanks
This looks nice! In the perf tool we already do some /proc processing
and map the data into what looks like a perf event:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/tool_pmu.c?h=perf-tools-next#n261
```
$ perf stat -e user_time,system_time true
Performance counter stats for 'true':
350,000 user_time
2,054,000 system_time
0.002222811 seconds time elapsed
0.000350000 seconds user
0.002054000 seconds sys
```
There's no reason we can't do memory information, the patch series I
sent adding DRM information (unmerged) contains it:
https://lore.kernel.org/lkml/20250403202439.57791-4-irogers@google.com/
Perf supports per-NUMA node aggregation and even has patches to make
it more accurate on Intel for sub-NUMA systems:
https://lore.kernel.org/lkml/20250515181417.491401-1-irogers@google.com/
It may be there are advantages to having the perf tool only events be
kernel events longer term (supporting sampling being a key one) but
making changes in the tool is fast and convenient. It can be nice to
do things like dump counts every second:
```
$ perf stat -e temp_cpu,fan1 -I 1000
# time counts unit events
1.001152826 34.00 'C temp_cpu
1.001152826 2,570 rpm fan1
2.008358661 34.00 'C temp_cpu
2.008358661 2,572 rpm fan1
3.015209566 34.00 'C temp_cpu
3.015209566 2,570 rpm fan1
...
```
Thanks,
Ian
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 21:32 ` Suren Baghdasaryan
@ 2025-06-03 15:00 ` Suren Baghdasaryan
2025-06-03 17:34 ` Kent Overstreet
2025-06-04 0:55 ` Casey Chen
2025-06-03 20:00 ` Casey Chen
1 sibling, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-03 15:00 UTC (permalink / raw)
To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong
On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> >
> > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > >
> > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > For example, the resulting file looks like below.
> > > > > >
> > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > ...
> > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > >
> > > > > Err, what is 'percpu y/n'?
> > > > >
> > > >
> > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > correct calculations.
> > >
> > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > though, a single character should suffice (we already have a header that
> > > can explain what it is) - if you're growing the width we don't want to
> > > overflow.
> > >
> >
> > Does it have a header ?
>
> Yes. See print_allocinfo_header().
I was thinking if instead of changing /proc/allocinfo format to
contain both total and per-node information we can keep it as is
(containing only totals) while exposing per-node information inside
new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
cleaner to me.
I'm also not a fan of "percpu y" tags as that requires the reader to
know how many CPUs were in the system to make the calculation (you
might get the allocinfo content from a system you have no access to
and no additional information). Maybe we can have "per-cpu bytes" and
"total bytes" columns instead? For per-cpu allocations these will be
different, for all other allocations these two columns will contain
the same number.
>
> >
> > > >
> > > > > >
> > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > >
> > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > these counters are not accounted in profiling stats.
> > > > > >
> > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > >
> > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > of interest to people looking at optimizing allocations to make sure
> > > > > they're on the right numa node?
> > > >
> > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > opinion about this feature ? Thanks!
> > >
> > > I would like to see some other opinions from potential users, have you
> > > been circulating it?
> >
> > We have been using it internally for a while. I don't know who the
> > potential users are and how to reach them so I am sharing it here to
> > collect opinions from others.
>
> Should definitely have a separate Kconfig option. Have you measured
> the memory and performance overhead of this change?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-03 15:00 ` Suren Baghdasaryan
@ 2025-06-03 17:34 ` Kent Overstreet
2025-06-04 0:55 ` Casey Chen
1 sibling, 0 replies; 20+ messages in thread
From: Kent Overstreet @ 2025-06-03 17:34 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: Casey Chen, linux-mm, yzhong
On Tue, Jun 03, 2025 at 08:00:59AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > >
> > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > >
> > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > For example, the resulting file looks like below.
> > > > > > >
> > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > ...
> > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > >
> > > > > > Err, what is 'percpu y/n'?
> > > > > >
> > > > >
> > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > correct calculations.
> > > >
> > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > though, a single character should suffice (we already have a header that
> > > > can explain what it is) - if you're growing the width we don't want to
> > > > overflow.
> > > >
> > >
> > > Does it have a header ?
> >
> > Yes. See print_allocinfo_header().
>
> I was thinking if instead of changing /proc/allocinfo format to
> contain both total and per-node information we can keep it as is
> (containing only totals) while exposing per-node information inside
> new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> cleaner to me.
>
> I'm also not a fan of "percpu y" tags as that requires the reader to
> know how many CPUs were in the system to make the calculation (you
> might get the allocinfo content from a system you have no access to
> and no additional information). Maybe we can have "per-cpu bytes" and
> "total bytes" columns instead? For per-cpu allocations these will be
> different, for all other allocations these two columns will contain
> the same number.
Maybe we can just report a single byte count, and multiply it by the
number of CPUs for percpu allocations?
Do we really need to know if a given allocation is percpu that often?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-02 21:32 ` Suren Baghdasaryan
2025-06-03 15:00 ` Suren Baghdasaryan
@ 2025-06-03 20:00 ` Casey Chen
2025-06-03 20:18 ` Suren Baghdasaryan
1 sibling, 1 reply; 20+ messages in thread
From: Casey Chen @ 2025-06-03 20:00 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: Kent Overstreet, linux-mm, yzhong
On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> >
> > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > >
> > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > For example, the resulting file looks like below.
> > > > > >
> > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > ...
> > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > >
> > > > > Err, what is 'percpu y/n'?
> > > > >
> > > >
> > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > correct calculations.
> > >
> > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > though, a single character should suffice (we already have a header that
> > > can explain what it is) - if you're growing the width we don't want to
> > > overflow.
> > >
> >
> > Does it have a header ?
>
> Yes. See print_allocinfo_header().
>
> >
> > > >
> > > > > >
> > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > >
> > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > these counters are not accounted in profiling stats.
> > > > > >
> > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > >
> > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > of interest to people looking at optimizing allocations to make sure
> > > > > they're on the right numa node?
> > > >
> > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > opinion about this feature ? Thanks!
> > >
> > > I would like to see some other opinions from potential users, have you
> > > been circulating it?
> >
> > We have been using it internally for a while. I don't know who the
> > potential users are and how to reach them so I am sharing it here to
> > collect opinions from others.
>
> Should definitely have a separate Kconfig option. Have you measured
> the memory and performance overhead of this change?
I can make it a Kconfig option. Let's say,
CONFIG_MEM_ALLOC_PER_NUMA_STATS=y/n, which controls the number of
counter per CPU.
If CONFIG_MEM_ALLOC_PER_NUMA_STATS=y, num_counter_percpu =
num_possible_nodes(), else num_counter_percpu = 1.
There is some memory cost. Additional memory used = Number of
additional NUMA nodes * Number of CPUs * Number of tags * Size of each
counter
For example, in one of my testbeds, additional memory used = 1 (two
nodes in total) * 112 (number of CPUs) * 4540 (number of tags) * 16
(size of counter), ~8MiB. This testbed has a total 755 GiB of memory.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-03 20:00 ` Casey Chen
@ 2025-06-03 20:18 ` Suren Baghdasaryan
0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-03 20:18 UTC (permalink / raw)
To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong
On Tue, Jun 3, 2025 at 1:01 PM Casey Chen <cachen@purestorage.com> wrote:
>
> On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > >
> > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > >
> > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > For example, the resulting file looks like below.
> > > > > > >
> > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > ...
> > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > >
> > > > > > Err, what is 'percpu y/n'?
> > > > > >
> > > > >
> > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > correct calculations.
> > > >
> > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > though, a single character should suffice (we already have a header that
> > > > can explain what it is) - if you're growing the width we don't want to
> > > > overflow.
> > > >
> > >
> > > Does it have a header ?
> >
> > Yes. See print_allocinfo_header().
> >
> > >
> > > > >
> > > > > > >
> > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > >
> > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > these counters are not accounted in profiling stats.
> > > > > > >
> > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > >
> > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > they're on the right numa node?
> > > > >
> > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > opinion about this feature ? Thanks!
> > > >
> > > > I would like to see some other opinions from potential users, have you
> > > > been circulating it?
> > >
> > > We have been using it internally for a while. I don't know who the
> > > potential users are and how to reach them so I am sharing it here to
> > > collect opinions from others.
> >
> > Should definitely have a separate Kconfig option. Have you measured
> > the memory and performance overhead of this change?
>
> I can make it a Kconfig option. Let's say,
> CONFIG_MEM_ALLOC_PER_NUMA_STATS=y/n, which controls the number of
> counter per CPU.
> If CONFIG_MEM_ALLOC_PER_NUMA_STATS=y, num_counter_percpu =
> num_possible_nodes(), else num_counter_percpu = 1.
>
> There is some memory cost. Additional memory used = Number of
> additional NUMA nodes * Number of CPUs * Number of tags * Size of each
> counter
> For example, in one of my testbeds, additional memory used = 1 (two
> nodes in total) * 112 (number of CPUs) * 4540 (number of tags) * 16
> (size of counter), ~8MiB. This testbed has a total 755 GiB of memory.
Please add these numbers in the changelog when you post the next version.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-03 15:00 ` Suren Baghdasaryan
2025-06-03 17:34 ` Kent Overstreet
@ 2025-06-04 0:55 ` Casey Chen
2025-06-04 15:21 ` Suren Baghdasaryan
1 sibling, 1 reply; 20+ messages in thread
From: Casey Chen @ 2025-06-04 0:55 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: Kent Overstreet, linux-mm, yzhong
On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > >
> > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > >
> > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > For example, the resulting file looks like below.
> > > > > > >
> > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > ...
> > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > >
> > > > > > Err, what is 'percpu y/n'?
> > > > > >
> > > > >
> > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > correct calculations.
> > > >
> > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > though, a single character should suffice (we already have a header that
> > > > can explain what it is) - if you're growing the width we don't want to
> > > > overflow.
> > > >
> > >
> > > Does it have a header ?
> >
> > Yes. See print_allocinfo_header().
>
> I was thinking if instead of changing /proc/allocinfo format to
> contain both total and per-node information we can keep it as is
> (containing only totals) while exposing per-node information inside
> new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> cleaner to me.
>
The output of /sys/devices/system/node/node<node_no>/allocinfo is
strictly limited to a single PAGE_SIZE and it cannot display stats for
all tags.
> I'm also not a fan of "percpu y" tags as that requires the reader to
> know how many CPUs were in the system to make the calculation (you
> might get the allocinfo content from a system you have no access to
> and no additional information). Maybe we can have "per-cpu bytes" and
> "total bytes" columns instead? For per-cpu allocations these will be
> different, for all other allocations these two columns will contain
> the same number.
I plan to remove 'percpu y/n' from this patch and implement it later.
>
> >
> > >
> > > > >
> > > > > > >
> > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > >
> > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > these counters are not accounted in profiling stats.
> > > > > > >
> > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > >
> > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > they're on the right numa node?
> > > > >
> > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > opinion about this feature ? Thanks!
> > > >
> > > > I would like to see some other opinions from potential users, have you
> > > > been circulating it?
> > >
> > > We have been using it internally for a while. I don't know who the
> > > potential users are and how to reach them so I am sharing it here to
> > > collect opinions from others.
> >
> > Should definitely have a separate Kconfig option. Have you measured
> > the memory and performance overhead of this change?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-04 0:55 ` Casey Chen
@ 2025-06-04 15:21 ` Suren Baghdasaryan
2025-06-04 15:50 ` Kent Overstreet
2025-06-10 0:21 ` Casey Chen
0 siblings, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 15:21 UTC (permalink / raw)
To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong
On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote:
>
> On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > > >
> > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > > >
> > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > > For example, the resulting file looks like below.
> > > > > > > >
> > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > > ...
> > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > > >
> > > > > > > Err, what is 'percpu y/n'?
> > > > > > >
> > > > > >
> > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > > correct calculations.
> > > > >
> > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > > though, a single character should suffice (we already have a header that
> > > > > can explain what it is) - if you're growing the width we don't want to
> > > > > overflow.
> > > > >
> > > >
> > > > Does it have a header ?
> > >
> > > Yes. See print_allocinfo_header().
> >
> > I was thinking if instead of changing /proc/allocinfo format to
> > contain both total and per-node information we can keep it as is
> > (containing only totals) while exposing per-node information inside
> > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> > cleaner to me.
> >
>
> The output of /sys/devices/system/node/node<node_no>/allocinfo is
> strictly limited to a single PAGE_SIZE and it cannot display stats for
> all tags.
Ugh, that's a pity. Another option would be to add "nid" column like
this when this config is specified:
nid bytes calls
0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc
1 0 0 kernel/irq/irqdesc.c:425
func:alloc_desc
...
It bloats the file size but looks more structured to me.
>
> > I'm also not a fan of "percpu y" tags as that requires the reader to
> > know how many CPUs were in the system to make the calculation (you
> > might get the allocinfo content from a system you have no access to
> > and no additional information). Maybe we can have "per-cpu bytes" and
> > "total bytes" columns instead? For per-cpu allocations these will be
> > different, for all other allocations these two columns will contain
> > the same number.
>
> I plan to remove 'percpu y/n' from this patch and implement it later.
>
> >
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > > >
> > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > > these counters are not accounted in profiling stats.
> > > > > > > >
> > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > > >
> > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > > they're on the right numa node?
> > > > > >
> > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > > opinion about this feature ? Thanks!
> > > > >
> > > > > I would like to see some other opinions from potential users, have you
> > > > > been circulating it?
> > > >
> > > > We have been using it internally for a while. I don't know who the
> > > > potential users are and how to reach them so I am sharing it here to
> > > > collect opinions from others.
> > >
> > > Should definitely have a separate Kconfig option. Have you measured
> > > the memory and performance overhead of this change?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-04 15:21 ` Suren Baghdasaryan
@ 2025-06-04 15:50 ` Kent Overstreet
2025-06-10 0:21 ` Casey Chen
1 sibling, 0 replies; 20+ messages in thread
From: Kent Overstreet @ 2025-06-04 15:50 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: Casey Chen, linux-mm, yzhong
On Wed, Jun 04, 2025 at 08:21:48AM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > > > >
> > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > > > >
> > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > > > For example, the resulting file looks like below.
> > > > > > > > >
> > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > > > ...
> > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > > > >
> > > > > > > > Err, what is 'percpu y/n'?
> > > > > > > >
> > > > > > >
> > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > > > correct calculations.
> > > > > >
> > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > > > though, a single character should suffice (we already have a header that
> > > > > > can explain what it is) - if you're growing the width we don't want to
> > > > > > overflow.
> > > > > >
> > > > >
> > > > > Does it have a header ?
> > > >
> > > > Yes. See print_allocinfo_header().
> > >
> > > I was thinking if instead of changing /proc/allocinfo format to
> > > contain both total and per-node information we can keep it as is
> > > (containing only totals) while exposing per-node information inside
> > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> > > cleaner to me.
> > >
> >
> > The output of /sys/devices/system/node/node<node_no>/allocinfo is
> > strictly limited to a single PAGE_SIZE and it cannot display stats for
> > all tags.
>
> Ugh, that's a pity. Another option would be to add "nid" column like
> this when this config is specified:
>
> nid bytes calls
> 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc
> 1 0 0 kernel/irq/irqdesc.c:425
> func:alloc_desc
> ...
>
> It bloats the file size but looks more structured to me.
Debugfs is also an option.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-04 15:21 ` Suren Baghdasaryan
2025-06-04 15:50 ` Kent Overstreet
@ 2025-06-10 0:21 ` Casey Chen
2025-06-10 15:56 ` Suren Baghdasaryan
1 sibling, 1 reply; 20+ messages in thread
From: Casey Chen @ 2025-06-10 0:21 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: Kent Overstreet, linux-mm, yzhong
On Wed, Jun 4, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > > > >
> > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > > > >
> > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > > > For example, the resulting file looks like below.
> > > > > > > > >
> > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > > > ...
> > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > > > >
> > > > > > > > Err, what is 'percpu y/n'?
> > > > > > > >
> > > > > > >
> > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > > > correct calculations.
> > > > > >
> > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > > > though, a single character should suffice (we already have a header that
> > > > > > can explain what it is) - if you're growing the width we don't want to
> > > > > > overflow.
> > > > > >
> > > > >
> > > > > Does it have a header ?
> > > >
> > > > Yes. See print_allocinfo_header().
> > >
> > > I was thinking if instead of changing /proc/allocinfo format to
> > > contain both total and per-node information we can keep it as is
> > > (containing only totals) while exposing per-node information inside
> > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> > > cleaner to me.
> > >
> >
> > The output of /sys/devices/system/node/node<node_no>/allocinfo is
> > strictly limited to a single PAGE_SIZE and it cannot display stats for
> > all tags.
>
> Ugh, that's a pity. Another option would be to add "nid" column like
> this when this config is specified:
>
> nid bytes calls
> 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc
> 1 0 0 kernel/irq/irqdesc.c:425
> func:alloc_desc
> ...
>
> It bloats the file size but looks more structured to me.
>
How about this format ?
With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=y, /proc/allocinfo looks like:
allocinfo - version: 1.0
<nid> <size> <calls> <tag info>
0 0 init/main.c:1310 func:do_initcalls
0 0 0
1 0 0
...
776704 1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq
0 348672 681
1 428032 836
6144 6 kernel/workqueue.c:4133 func:get_unbound_pool
0 4096 4
1 2048 2
With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n, /proc/allocinfo
stays same as before:
allocinfo - version: 1.0
<nid> <size> <calls> <tag info>
0 0 init/main.c:1310 func:do_initcalls
0 0 init/do_mounts.c:350 func:mount_nodev_root
0 0 init/do_mounts.c:187 func:mount_root_generic
...
> >
> > > I'm also not a fan of "percpu y" tags as that requires the reader to
> > > know how many CPUs were in the system to make the calculation (you
> > > might get the allocinfo content from a system you have no access to
> > > and no additional information). Maybe we can have "per-cpu bytes" and
> > > "total bytes" columns instead? For per-cpu allocations these will be
> > > different, for all other allocations these two columns will contain
> > > the same number.
> >
> > I plan to remove 'percpu y/n' from this patch and implement it later.
> >
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > > > >
> > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > > > these counters are not accounted in profiling stats.
> > > > > > > > >
> > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > > > >
> > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > > > they're on the right numa node?
> > > > > > >
> > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > > > opinion about this feature ? Thanks!
> > > > > >
> > > > > > I would like to see some other opinions from potential users, have you
> > > > > > been circulating it?
> > > > >
> > > > > We have been using it internally for a while. I don't know who the
> > > > > potential users are and how to reach them so I am sharing it here to
> > > > > collect opinions from others.
> > > >
> > > > Should definitely have a separate Kconfig option. Have you measured
> > > > the memory and performance overhead of this change?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats
2025-06-10 0:21 ` Casey Chen
@ 2025-06-10 15:56 ` Suren Baghdasaryan
0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-10 15:56 UTC (permalink / raw)
To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong
On Mon, Jun 9, 2025 at 5:22 PM Casey Chen <cachen@purestorage.com> wrote:
>
> On Wed, Jun 4, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote:
> > >
> > > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > > > > >
> > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > > > > >
> > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > > > > For example, the resulting file looks like below.
> > > > > > > > > >
> > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > > > > ...
> > > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > > > > >
> > > > > > > > > Err, what is 'percpu y/n'?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > > > > correct calculations.
> > > > > > >
> > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > > > > though, a single character should suffice (we already have a header that
> > > > > > > can explain what it is) - if you're growing the width we don't want to
> > > > > > > overflow.
> > > > > > >
> > > > > >
> > > > > > Does it have a header ?
> > > > >
> > > > > Yes. See print_allocinfo_header().
> > > >
> > > > I was thinking if instead of changing /proc/allocinfo format to
> > > > contain both total and per-node information we can keep it as is
> > > > (containing only totals) while exposing per-node information inside
> > > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> > > > cleaner to me.
> > > >
> > >
> > > The output of /sys/devices/system/node/node<node_no>/allocinfo is
> > > strictly limited to a single PAGE_SIZE and it cannot display stats for
> > > all tags.
> >
> > Ugh, that's a pity. Another option would be to add "nid" column like
> > this when this config is specified:
> >
> > nid bytes calls
> > 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc
> > 1 0 0 kernel/irq/irqdesc.c:425
> > func:alloc_desc
> > ...
> >
> > It bloats the file size but looks more structured to me.
> >
>
> How about this format ?
>
> With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=y, /proc/allocinfo looks like:
> allocinfo - version: 1.0
> <nid> <size> <calls> <tag info>
> 0 0 init/main.c:1310 func:do_initcalls
> 0 0 0
> 1 0 0
If we go that way then why not:
allocinfo - version: 2.0
<size> <calls> <tag info>
776704 1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq
nid0 348672 681
nid1 428032 836
6144 6 kernel/workqueue.c:4133 func:get_unbound_pool
nid0 4096 4
nid1 2048 2
...
If CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n the file format will not change.
> ...
> 776704 1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq
> 0 348672 681
> 1 428032 836
> 6144 6 kernel/workqueue.c:4133 func:get_unbound_pool
> 0 4096 4
> 1 2048 2
>
> With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n, /proc/allocinfo
> stays same as before:
> allocinfo - version: 1.0
> <nid> <size> <calls> <tag info>
> 0 0 init/main.c:1310 func:do_initcalls
> 0 0 init/do_mounts.c:350 func:mount_nodev_root
> 0 0 init/do_mounts.c:187 func:mount_root_generic
> ...
>
> > >
> > > > I'm also not a fan of "percpu y" tags as that requires the reader to
> > > > know how many CPUs were in the system to make the calculation (you
> > > > might get the allocinfo content from a system you have no access to
> > > > and no additional information). Maybe we can have "per-cpu bytes" and
> > > > "total bytes" columns instead? For per-cpu allocations these will be
> > > > different, for all other allocations these two columns will contain
> > > > the same number.
> > >
> > > I plan to remove 'percpu y/n' from this patch and implement it later.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > > > > >
> > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > > > > these counters are not accounted in profiling stats.
> > > > > > > > > >
> > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > > > > >
> > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > > > > they're on the right numa node?
> > > > > > > >
> > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > > > > opinion about this feature ? Thanks!
> > > > > > >
> > > > > > > I would like to see some other opinions from potential users, have you
> > > > > > > been circulating it?
> > > > > >
> > > > > > We have been using it internally for a while. I don't know who the
> > > > > > potential users are and how to reach them so I am sharing it here to
> > > > > > collect opinions from others.
> > > > >
> > > > > Should definitely have a separate Kconfig option. Have you measured
> > > > > the memory and performance overhead of this change?
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-06-10 15:56 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 0:39 [PATCH 0/1] alloc_tag: add per-numa node stats Casey Chen
2025-05-30 0:39 ` [PATCH] " Casey Chen
2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet
2025-05-30 21:45 ` Casey Chen
2025-05-31 0:05 ` Kent Overstreet
2025-06-02 20:48 ` Casey Chen
2025-06-02 21:32 ` Suren Baghdasaryan
2025-06-03 15:00 ` Suren Baghdasaryan
2025-06-03 17:34 ` Kent Overstreet
2025-06-04 0:55 ` Casey Chen
2025-06-04 15:21 ` Suren Baghdasaryan
2025-06-04 15:50 ` Kent Overstreet
2025-06-10 0:21 ` Casey Chen
2025-06-10 15:56 ` Suren Baghdasaryan
2025-06-03 20:00 ` Casey Chen
2025-06-03 20:18 ` Suren Baghdasaryan
2025-06-02 21:52 ` Kent Overstreet
2025-06-02 22:08 ` Steven Rostedt
2025-06-02 23:35 ` Kent Overstreet
2025-06-03 6:46 ` Ian Rogers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).