* [PATCH 0/1] alloc_tag: add per-numa node stats @ 2025-05-30 0:39 Casey Chen 2025-05-30 0:39 ` [PATCH] " Casey Chen 2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet 0 siblings, 2 replies; 20+ messages in thread From: Casey Chen @ 2025-05-30 0:39 UTC (permalink / raw) To: linux-mm, surenb, kent.overstreet; +Cc: yzhong, cachen The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. Also percpu allocation is marked and its stats is stored on NUMA node 0. For example, the resulting file looks like below. percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one ... percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box To save memory, we dynamically allocate per-NUMA node stats counter once the system boots up and knows how many NUMA nodes available. percpu allocators are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for these counters are not accounted in profiling stats. For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. percpu y total 17024 532 numa0 17024 532 numa1 0 0 lib/alloc_tag.c:564 func:load_module Casey Chen (1): alloc_tag: add per-numa node stats include/linux/alloc_tag.h | 48 +++++++++++++++++++++++++++------------ include/linux/codetag.h | 4 ++++ include/linux/percpu.h | 2 +- lib/alloc_tag.c | 41 +++++++++++++++++++++++++++------ mm/page_alloc.c | 35 ++++++++++++++-------------- mm/percpu.c | 8 +++++-- mm/show_mem.c | 21 ++++++++++++----- mm/slub.c | 10 +++++--- 8 files changed, 119 insertions(+), 50 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH] alloc_tag: add per-numa node stats 2025-05-30 0:39 [PATCH 0/1] alloc_tag: add per-numa node stats Casey Chen @ 2025-05-30 0:39 ` Casey Chen 2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet 1 sibling, 0 replies; 20+ messages in thread From: Casey Chen @ 2025-05-30 0:39 UTC (permalink / raw) To: linux-mm, surenb, kent.overstreet; +Cc: yzhong, cachen Add per-numa stats for each alloc_tag. We used to have only one alloc_tag_counters per CPU, now each CPU has one per numa node. bytes/calls in total and for each numa node are now displayed together in a single row for each alloc_tag in /proc/allocinfo. Note for percpu allocation, per-numa stats doesn't make sense. Numa nodes for per-CPU memory vary. Each CPU usually gets copy from its local numa node. We don't have a way to count numa node stats by CPU, so just store all stats in numa 0. Also, the 'bytes' field is just the number needed by a single CPU, to get the total bytes, multiply it by number of possible CPUs. Added boolean field 'percpu' to mark all percpu allocations in /proc/allocinfo. To minimize memory usage, alloc_tag stats counters are dynamically allocated with percpu allocator. Increase PERCPU_DYNAMIC_RESERVE to accommodate counters for in-kernel alloc_tags. For in-kernel alloc_tag, pcpu_alloc_noprof() is called to allocate stats counters, which is not accounted for in profiling stats. Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com> --- include/linux/alloc_tag.h | 49 ++++++++++++++++++++++++++++----------- include/linux/codetag.h | 4 ++++ include/linux/percpu.h | 2 +- lib/alloc_tag.c | 43 ++++++++++++++++++++++++++++------ mm/page_alloc.c | 35 ++++++++++++++-------------- mm/percpu.c | 8 +++++-- mm/show_mem.c | 20 ++++++++++++---- mm/slub.c | 11 ++++++--- 8 files changed, 123 insertions(+), 49 deletions(-) diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 8f7931eb7d16..99d4a1823e51 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -15,6 +15,8 @@ #include <linux/static_key.h> #include <linux/irqflags.h> +extern int num_numa_nodes; + struct alloc_tag_counters { u64 bytes; u64 calls; @@ -134,16 +136,34 @@ static inline bool mem_alloc_profiling_enabled(void) &mem_alloc_profiling_key); } +static inline struct alloc_tag_counters alloc_tag_read_nid(struct alloc_tag *tag, int nid) +{ + struct alloc_tag_counters v = { 0, 0 }; + struct alloc_tag_counters *counters; + int cpu; + + for_each_possible_cpu(cpu) { + counters = per_cpu_ptr(tag->counters, cpu); + v.bytes += counters[nid].bytes; + v.calls += counters[nid].calls; + } + + return v; +} + static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag) { struct alloc_tag_counters v = { 0, 0 }; - struct alloc_tag_counters *counter; + struct alloc_tag_counters *counters; int cpu; + int nid; for_each_possible_cpu(cpu) { - counter = per_cpu_ptr(tag->counters, cpu); - v.bytes += counter->bytes; - v.calls += counter->calls; + counters = per_cpu_ptr(tag->counters, cpu); + for (nid = 0; nid < num_numa_nodes; nid++) { + v.bytes += counters[nid].bytes; + v.calls += counters[nid].calls; + } } return v; @@ -179,7 +199,7 @@ static inline bool __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag return true; } -static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) +static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag, int nid) { if (unlikely(!__alloc_tag_ref_set(ref, tag))) return false; @@ -190,17 +210,18 @@ static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *t * Each new reference for every sub-allocation needs to increment call * counter because when we free each part the counter will be decremented. */ - this_cpu_inc(tag->counters->calls); + this_cpu_inc(tag->counters[nid].calls); return true; } -static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes) +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, + int nid, size_t bytes) { - if (likely(alloc_tag_ref_set(ref, tag))) - this_cpu_add(tag->counters->bytes, bytes); + if (likely(alloc_tag_ref_set(ref, tag, nid))) + this_cpu_add(tag->counters[nid].bytes, bytes); } -static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) +static inline void alloc_tag_sub(union codetag_ref *ref, int nid, size_t bytes) { struct alloc_tag *tag; @@ -215,8 +236,8 @@ static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) tag = ct_to_alloc_tag(ref->ct); - this_cpu_sub(tag->counters->bytes, bytes); - this_cpu_dec(tag->counters->calls); + this_cpu_sub(tag->counters[nid].bytes, bytes); + this_cpu_dec(tag->counters[nid].calls); ref->ct = NULL; } @@ -228,8 +249,8 @@ static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) #define DEFINE_ALLOC_TAG(_alloc_tag) static inline bool mem_alloc_profiling_enabled(void) { return false; } static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, - size_t bytes) {} -static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {} + int nid, size_t bytes) {} +static inline void alloc_tag_sub(union codetag_ref *ref, int nid, size_t bytes) {} #define alloc_tag_record(p) do {} while (0) #endif /* CONFIG_MEM_ALLOC_PROFILING */ diff --git a/include/linux/codetag.h b/include/linux/codetag.h index 5f2b9a1f722c..79d6b96c61f6 100644 --- a/include/linux/codetag.h +++ b/include/linux/codetag.h @@ -16,6 +16,10 @@ struct module; #define CODETAG_SECTION_START_PREFIX "__start_" #define CODETAG_SECTION_STOP_PREFIX "__stop_" +enum codetag_flags { + CODETAG_PERCPU_ALLOC = (1 << 0), /* codetag tracking percpu allocation */ +}; + /* * An instance of this structure is created in a special ELF section at every * code location being tagged. At runtime, the special section is treated as diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 85bf8dd9f087..d92c27fbcd0d 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -43,7 +43,7 @@ # define PERCPU_DYNAMIC_SIZE_SHIFT 12 #endif /* LOCKDEP and PAGE_SIZE > 4KiB */ #else -#define PERCPU_DYNAMIC_SIZE_SHIFT 10 +#define PERCPU_DYNAMIC_SIZE_SHIFT 13 #endif /* diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index d48b80f3f007..b4d2d5663c4c 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -42,6 +42,9 @@ struct allocinfo_private { bool print_header; }; +int num_numa_nodes; +static unsigned long pcpu_counters_size; + static void *allocinfo_start(struct seq_file *m, loff_t *pos) { struct allocinfo_private *priv; @@ -95,9 +98,16 @@ static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct) { struct alloc_tag *tag = ct_to_alloc_tag(ct); struct alloc_tag_counters counter = alloc_tag_read(tag); - s64 bytes = counter.bytes; + int nid; + + seq_buf_printf(out, "percpu %c total %12lli %8llu ", + ct->flags & CODETAG_PERCPU_ALLOC ? 'y' : 'n', + counter.bytes, counter.calls); + for (nid = 0; nid < num_numa_nodes; nid++) { + counter = alloc_tag_read_nid(tag, nid); + seq_buf_printf(out, "numa%d %12lli %8llu ", nid, counter.bytes, counter.calls); + } - seq_buf_printf(out, "%12lli %8llu ", bytes, counter.calls); codetag_to_text(out, ct); seq_buf_putc(out, ' '); seq_buf_putc(out, '\n'); @@ -184,7 +194,7 @@ void pgalloc_tag_split(struct folio *folio, int old_order, int new_order) if (get_page_tag_ref(folio_page(folio, i), &ref, &handle)) { /* Set new reference to point to the original tag */ - alloc_tag_ref_set(&ref, tag); + alloc_tag_ref_set(&ref, tag, folio_nid(folio)); update_page_tag_ref(handle, &ref); put_page_tag_ref(handle); } @@ -247,19 +257,36 @@ static void shutdown_mem_profiling(bool remove_file) void __init alloc_tag_sec_init(void) { struct alloc_tag *last_codetag; + int i; if (!mem_profiling_support) return; - if (!static_key_enabled(&mem_profiling_compressed)) - return; - kernel_tags.first_tag = (struct alloc_tag *)kallsyms_lookup_name( SECTION_START(ALLOC_TAG_SECTION_NAME)); last_codetag = (struct alloc_tag *)kallsyms_lookup_name( SECTION_STOP(ALLOC_TAG_SECTION_NAME)); kernel_tags.count = last_codetag - kernel_tags.first_tag; + num_numa_nodes = num_possible_nodes(); + pcpu_counters_size = num_numa_nodes * sizeof(struct alloc_tag_counters); + for (i = 0; i < kernel_tags.count; i++) { + /* Each CPU has one counter per numa node */ + kernel_tags.first_tag[i].counters = + pcpu_alloc_noprof(pcpu_counters_size, + sizeof(struct alloc_tag_counters), + false, GFP_KERNEL | __GFP_ZERO); + if (!kernel_tags.first_tag[i].counters) { + while (--i >= 0) + free_percpu(kernel_tags.first_tag[i].counters); + pr_info("Failed to allocate per-cpu alloc_tag counters\n"); + return; + } + } + + if (!static_key_enabled(&mem_profiling_compressed)) + return; + /* Check if kernel tags fit into page flags */ if (kernel_tags.count > (1UL << NR_UNUSED_PAGEFLAG_BITS)) { shutdown_mem_profiling(false); /* allocinfo file does not exist yet */ @@ -622,7 +649,9 @@ static int load_module(struct module *mod, struct codetag *start, struct codetag stop_tag = ct_to_alloc_tag(stop); for (tag = start_tag; tag < stop_tag; tag++) { WARN_ON(tag->counters); - tag->counters = alloc_percpu(struct alloc_tag_counters); + tag->counters = __alloc_percpu_gfp(pcpu_counters_size, + sizeof(struct alloc_tag_counters), + GFP_KERNEL | __GFP_ZERO); if (!tag->counters) { while (--tag >= start_tag) { free_percpu(tag->counters); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 90b06f3d004c..8219d8de6f97 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1107,58 +1107,59 @@ void __clear_page_tag_ref(struct page *page) /* Should be called only if mem_alloc_profiling_enabled() */ static noinline void __pgalloc_tag_add(struct page *page, struct task_struct *task, - unsigned int nr) + int nid, unsigned int nr) { union pgtag_ref_handle handle; union codetag_ref ref; if (get_page_tag_ref(page, &ref, &handle)) { - alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr); + alloc_tag_add(&ref, task->alloc_tag, nid, PAGE_SIZE * nr); update_page_tag_ref(handle, &ref); put_page_tag_ref(handle); } } static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, - unsigned int nr) + int nid, unsigned int nr) { if (mem_alloc_profiling_enabled()) - __pgalloc_tag_add(page, task, nr); + __pgalloc_tag_add(page, task, nid, nr); } /* Should be called only if mem_alloc_profiling_enabled() */ static noinline -void __pgalloc_tag_sub(struct page *page, unsigned int nr) +void __pgalloc_tag_sub(struct page *page, int nid, unsigned int nr) { union pgtag_ref_handle handle; union codetag_ref ref; if (get_page_tag_ref(page, &ref, &handle)) { - alloc_tag_sub(&ref, PAGE_SIZE * nr); + alloc_tag_sub(&ref, nid, PAGE_SIZE * nr); update_page_tag_ref(handle, &ref); put_page_tag_ref(handle); } } -static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) +static inline void pgalloc_tag_sub(struct page *page, int nid, unsigned int nr) { if (mem_alloc_profiling_enabled()) - __pgalloc_tag_sub(page, nr); + __pgalloc_tag_sub(page, nid, nr); } /* When tag is not NULL, assuming mem_alloc_profiling_enabled */ -static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) +static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, + int nid, unsigned int nr) { if (tag) - this_cpu_sub(tag->counters->bytes, PAGE_SIZE * nr); + this_cpu_sub(tag->counters[nid].bytes, PAGE_SIZE * nr); } #else /* CONFIG_MEM_ALLOC_PROFILING */ static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, - unsigned int nr) {} -static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {} -static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {} + int nid, unsigned int nr) {} +static inline void pgalloc_tag_sub(struct page *page, int nid, unsigned int nr) {} +static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, int nid, unsigned int nr) {} #endif /* CONFIG_MEM_ALLOC_PROFILING */ @@ -1197,7 +1198,7 @@ __always_inline bool free_pages_prepare(struct page *page, /* Do not let hwpoison pages hit pcplists/buddy */ reset_page_owner(page, order); page_table_check_free(page, order); - pgalloc_tag_sub(page, 1 << order); + pgalloc_tag_sub(page, page_to_nid(page), 1 << order); /* * The page is isolated and accounted for. @@ -1251,7 +1252,7 @@ __always_inline bool free_pages_prepare(struct page *page, page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; reset_page_owner(page, order); page_table_check_free(page, order); - pgalloc_tag_sub(page, 1 << order); + pgalloc_tag_sub(page, page_to_nid(page), 1 << order); if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page), @@ -1707,7 +1708,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_owner(page, order, gfp_flags); page_table_check_alloc(page, order); - pgalloc_tag_add(page, current, 1 << order); + pgalloc_tag_add(page, current, page_to_nid(page), 1 << order); } static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, @@ -5064,7 +5065,7 @@ static void ___free_pages(struct page *page, unsigned int order, if (put_page_testzero(page)) __free_frozen_pages(page, order, fpi_flags); else if (!head) { - pgalloc_tag_sub_pages(tag, (1 << order) - 1); + pgalloc_tag_sub_pages(tag, page_to_nid(page), (1 << order) - 1); while (order-- > 0) __free_frozen_pages(page + (1 << order), order, fpi_flags); diff --git a/mm/percpu.c b/mm/percpu.c index b35494c8ede2..130450e9718e 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1691,15 +1691,19 @@ static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off, size_t size) { if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) { + /* For percpu allocation, store all alloc_tag stats on numa node 0 */ alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, - current->alloc_tag, size); + current->alloc_tag, 0, size); + if (current->alloc_tag) + current->alloc_tag->ct.flags |= CODETAG_PERCPU_ALLOC; } } static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size) { + /* percpu alloc_tag stats is stored on numa node 0 so subtract from node 0 */ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) - alloc_tag_sub(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size); + alloc_tag_sub(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, 0, size); } #else static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off, diff --git a/mm/show_mem.c b/mm/show_mem.c index 03e8d968fd1a..132b3aa82d83 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -5,6 +5,7 @@ * Copyright (C) 2008 Johannes Weiner <hannes@saeurebad.de> */ +#include <linux/alloc_tag.h> #include <linux/blkdev.h> #include <linux/cma.h> #include <linux/cpuset.h> @@ -433,18 +434,27 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) struct alloc_tag *tag = ct_to_alloc_tag(ct); struct alloc_tag_counters counter = alloc_tag_read(tag); char bytes[10]; + int nid; string_get_size(counter.bytes, 1, STRING_UNITS_2, bytes, sizeof(bytes)); + pr_notice("percpu %c total %12s %8llu ", + ct->flags & CODETAG_PERCPU_ALLOC ? 'y' : 'n', + bytes, counter.calls); + + for (nid = 0; nid < num_numa_nodes; nid++) { + counter = alloc_tag_read_nid(tag, nid); + string_get_size(counter.bytes, 1, STRING_UNITS_2, + bytes, sizeof(bytes)); + pr_notice("numa%d %12s %8llu ", nid, bytes, counter.calls); + } /* Same as alloc_tag_to_text() but w/o intermediate buffer */ if (ct->modname) - pr_notice("%12s %8llu %s:%u [%s] func:%s\n", - bytes, counter.calls, ct->filename, + pr_notice("%s:%u [%s] func:%s\n", ct->filename, ct->lineno, ct->modname, ct->function); else - pr_notice("%12s %8llu %s:%u func:%s\n", - bytes, counter.calls, ct->filename, - ct->lineno, ct->function); + pr_notice("%s:%u func:%s\n", + ct->filename, ct->lineno, ct->function); } } } diff --git a/mm/slub.c b/mm/slub.c index be8b09e09d30..068b88b85d80 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2104,8 +2104,12 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags) * If other users appear then mem_alloc_profiling_enabled() * check should be added before alloc_tag_add(). */ - if (likely(obj_exts)) - alloc_tag_add(&obj_exts->ref, current->alloc_tag, s->size); + if (likely(obj_exts)) { + struct page *page = virt_to_page(object); + + alloc_tag_add(&obj_exts->ref, current->alloc_tag, + page_to_nid(page), s->size); + } } static inline void @@ -2133,8 +2137,9 @@ __alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p for (i = 0; i < objects; i++) { unsigned int off = obj_to_index(s, slab, p[i]); + struct page *page = virt_to_page(p[i]); - alloc_tag_sub(&obj_exts[off].ref, s->size); + alloc_tag_sub(&obj_exts[off].ref, page_to_nid(page), s->size); } } -- 2.34.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-05-30 0:39 [PATCH 0/1] alloc_tag: add per-numa node stats Casey Chen 2025-05-30 0:39 ` [PATCH] " Casey Chen @ 2025-05-30 1:11 ` Kent Overstreet 2025-05-30 21:45 ` Casey Chen 1 sibling, 1 reply; 20+ messages in thread From: Kent Overstreet @ 2025-05-30 1:11 UTC (permalink / raw) To: Casey Chen; +Cc: linux-mm, surenb, yzhong On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > Also percpu allocation is marked and its stats is stored on NUMA node 0. > For example, the resulting file looks like below. > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > ... > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box Err, what is 'percpu y/n'? > > To save memory, we dynamically allocate per-NUMA node stats counter once the > system boots up and knows how many NUMA nodes available. percpu allocators > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > these counters are not accounted in profiling stats. > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be of interest to people looking at optimizing allocations to make sure they're on the right numa node? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet @ 2025-05-30 21:45 ` Casey Chen 2025-05-31 0:05 ` Kent Overstreet 0 siblings, 1 reply; 20+ messages in thread From: Casey Chen @ 2025-05-30 21:45 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-mm, surenb, yzhong On Thu, May 29, 2025 at 6:11 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > For example, the resulting file looks like below. > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > ... > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > Err, what is 'percpu y/n'? > Mark percpu allocation with 'percpu y/n' because for percpu allocation stats, 'bytes' is per-cpu, we have to multiply it by the number of CPUs to get the total bytes. Mark it so we know the exact amount of memory used. Any /proc/allocinfo parser can understand it and make correct calculations. > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > system boots up and knows how many NUMA nodes available. percpu allocators > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > these counters are not accounted in profiling stats. > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > of interest to people looking at optimizing allocations to make sure > they're on the right numa node? Yes, to help us know if there is an NUMA imbalance issue and make some optimizations. I can make it a kconfig. Does anybody else have any opinion about this feature ? Thanks! ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-05-30 21:45 ` Casey Chen @ 2025-05-31 0:05 ` Kent Overstreet 2025-06-02 20:48 ` Casey Chen 0 siblings, 1 reply; 20+ messages in thread From: Kent Overstreet @ 2025-05-31 0:05 UTC (permalink / raw) To: Casey Chen; +Cc: linux-mm, surenb, yzhong On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > For example, the resulting file looks like below. > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > ... > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > Err, what is 'percpu y/n'? > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > stats, 'bytes' is per-cpu, we have to multiply it by the number of > CPUs to get the total bytes. Mark it so we know the exact amount of > memory used. Any /proc/allocinfo parser can understand it and make > correct calculations. Ok, just wanted to be sure it wasn't something else. Let's shorten that though, a single character should suffice (we already have a header that can explain what it is) - if you're growing the width we don't want to overflow. > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > these counters are not accounted in profiling stats. > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > of interest to people looking at optimizing allocations to make sure > > they're on the right numa node? > > Yes, to help us know if there is an NUMA imbalance issue and make some > optimizations. I can make it a kconfig. Does anybody else have any > opinion about this feature ? Thanks! I would like to see some other opinions from potential users, have you been circulating it? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-05-31 0:05 ` Kent Overstreet @ 2025-06-02 20:48 ` Casey Chen 2025-06-02 21:32 ` Suren Baghdasaryan 2025-06-02 21:52 ` Kent Overstreet 0 siblings, 2 replies; 20+ messages in thread From: Casey Chen @ 2025-06-02 20:48 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-mm, surenb, yzhong On Fri, May 30, 2025 at 5:05 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > For example, the resulting file looks like below. > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > ... > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > Err, what is 'percpu y/n'? > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > CPUs to get the total bytes. Mark it so we know the exact amount of > > memory used. Any /proc/allocinfo parser can understand it and make > > correct calculations. > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > though, a single character should suffice (we already have a header that > can explain what it is) - if you're growing the width we don't want to > overflow. > Does it have a header ? > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > these counters are not accounted in profiling stats. > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > of interest to people looking at optimizing allocations to make sure > > > they're on the right numa node? > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > optimizations. I can make it a kconfig. Does anybody else have any > > opinion about this feature ? Thanks! > > I would like to see some other opinions from potential users, have you > been circulating it? We have been using it internally for a while. I don't know who the potential users are and how to reach them so I am sharing it here to collect opinions from others. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 20:48 ` Casey Chen @ 2025-06-02 21:32 ` Suren Baghdasaryan 2025-06-03 15:00 ` Suren Baghdasaryan 2025-06-03 20:00 ` Casey Chen 2025-06-02 21:52 ` Kent Overstreet 1 sibling, 2 replies; 20+ messages in thread From: Suren Baghdasaryan @ 2025-06-02 21:32 UTC (permalink / raw) To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > For example, the resulting file looks like below. > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > ... > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > memory used. Any /proc/allocinfo parser can understand it and make > > > correct calculations. > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > though, a single character should suffice (we already have a header that > > can explain what it is) - if you're growing the width we don't want to > > overflow. > > > > Does it have a header ? Yes. See print_allocinfo_header(). > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > of interest to people looking at optimizing allocations to make sure > > > > they're on the right numa node? > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > optimizations. I can make it a kconfig. Does anybody else have any > > > opinion about this feature ? Thanks! > > > > I would like to see some other opinions from potential users, have you > > been circulating it? > > We have been using it internally for a while. I don't know who the > potential users are and how to reach them so I am sharing it here to > collect opinions from others. Should definitely have a separate Kconfig option. Have you measured the memory and performance overhead of this change? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 21:32 ` Suren Baghdasaryan @ 2025-06-03 15:00 ` Suren Baghdasaryan 2025-06-03 17:34 ` Kent Overstreet 2025-06-04 0:55 ` Casey Chen 2025-06-03 20:00 ` Casey Chen 1 sibling, 2 replies; 20+ messages in thread From: Suren Baghdasaryan @ 2025-06-03 15:00 UTC (permalink / raw) To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > ... > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > correct calculations. > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > though, a single character should suffice (we already have a header that > > > can explain what it is) - if you're growing the width we don't want to > > > overflow. > > > > > > > Does it have a header ? > > Yes. See print_allocinfo_header(). I was thinking if instead of changing /proc/allocinfo format to contain both total and per-node information we can keep it as is (containing only totals) while exposing per-node information inside new /sys/devices/system/node/node<node_no>/allocinfo files. That seems cleaner to me. I'm also not a fan of "percpu y" tags as that requires the reader to know how many CPUs were in the system to make the calculation (you might get the allocinfo content from a system you have no access to and no additional information). Maybe we can have "per-cpu bytes" and "total bytes" columns instead? For per-cpu allocations these will be different, for all other allocations these two columns will contain the same number. > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > of interest to people looking at optimizing allocations to make sure > > > > > they're on the right numa node? > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > opinion about this feature ? Thanks! > > > > > > I would like to see some other opinions from potential users, have you > > > been circulating it? > > > > We have been using it internally for a while. I don't know who the > > potential users are and how to reach them so I am sharing it here to > > collect opinions from others. > > Should definitely have a separate Kconfig option. Have you measured > the memory and performance overhead of this change? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-03 15:00 ` Suren Baghdasaryan @ 2025-06-03 17:34 ` Kent Overstreet 2025-06-04 0:55 ` Casey Chen 1 sibling, 0 replies; 20+ messages in thread From: Kent Overstreet @ 2025-06-03 17:34 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Casey Chen, linux-mm, yzhong On Tue, Jun 03, 2025 at 08:00:59AM -0700, Suren Baghdasaryan wrote: > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > ... > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > correct calculations. > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > though, a single character should suffice (we already have a header that > > > > can explain what it is) - if you're growing the width we don't want to > > > > overflow. > > > > > > > > > > Does it have a header ? > > > > Yes. See print_allocinfo_header(). > > I was thinking if instead of changing /proc/allocinfo format to > contain both total and per-node information we can keep it as is > (containing only totals) while exposing per-node information inside > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems > cleaner to me. > > I'm also not a fan of "percpu y" tags as that requires the reader to > know how many CPUs were in the system to make the calculation (you > might get the allocinfo content from a system you have no access to > and no additional information). Maybe we can have "per-cpu bytes" and > "total bytes" columns instead? For per-cpu allocations these will be > different, for all other allocations these two columns will contain > the same number. Maybe we can just report a single byte count, and multiply it by the number of CPUs for percpu allocations? Do we really need to know if a given allocation is percpu that often? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-03 15:00 ` Suren Baghdasaryan 2025-06-03 17:34 ` Kent Overstreet @ 2025-06-04 0:55 ` Casey Chen 2025-06-04 15:21 ` Suren Baghdasaryan 1 sibling, 1 reply; 20+ messages in thread From: Casey Chen @ 2025-06-04 0:55 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Kent Overstreet, linux-mm, yzhong On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > ... > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > correct calculations. > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > though, a single character should suffice (we already have a header that > > > > can explain what it is) - if you're growing the width we don't want to > > > > overflow. > > > > > > > > > > Does it have a header ? > > > > Yes. See print_allocinfo_header(). > > I was thinking if instead of changing /proc/allocinfo format to > contain both total and per-node information we can keep it as is > (containing only totals) while exposing per-node information inside > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems > cleaner to me. > The output of /sys/devices/system/node/node<node_no>/allocinfo is strictly limited to a single PAGE_SIZE and it cannot display stats for all tags. > I'm also not a fan of "percpu y" tags as that requires the reader to > know how many CPUs were in the system to make the calculation (you > might get the allocinfo content from a system you have no access to > and no additional information). Maybe we can have "per-cpu bytes" and > "total bytes" columns instead? For per-cpu allocations these will be > different, for all other allocations these two columns will contain > the same number. I plan to remove 'percpu y/n' from this patch and implement it later. > > > > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > they're on the right numa node? > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > opinion about this feature ? Thanks! > > > > > > > > I would like to see some other opinions from potential users, have you > > > > been circulating it? > > > > > > We have been using it internally for a while. I don't know who the > > > potential users are and how to reach them so I am sharing it here to > > > collect opinions from others. > > > > Should definitely have a separate Kconfig option. Have you measured > > the memory and performance overhead of this change? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-04 0:55 ` Casey Chen @ 2025-06-04 15:21 ` Suren Baghdasaryan 2025-06-04 15:50 ` Kent Overstreet 2025-06-10 0:21 ` Casey Chen 0 siblings, 2 replies; 20+ messages in thread From: Suren Baghdasaryan @ 2025-06-04 15:21 UTC (permalink / raw) To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote: > > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > > ... > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > > correct calculations. > > > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > > though, a single character should suffice (we already have a header that > > > > > can explain what it is) - if you're growing the width we don't want to > > > > > overflow. > > > > > > > > > > > > > Does it have a header ? > > > > > > Yes. See print_allocinfo_header(). > > > > I was thinking if instead of changing /proc/allocinfo format to > > contain both total and per-node information we can keep it as is > > (containing only totals) while exposing per-node information inside > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems > > cleaner to me. > > > > The output of /sys/devices/system/node/node<node_no>/allocinfo is > strictly limited to a single PAGE_SIZE and it cannot display stats for > all tags. Ugh, that's a pity. Another option would be to add "nid" column like this when this config is specified: nid bytes calls 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc 1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc ... It bloats the file size but looks more structured to me. > > > I'm also not a fan of "percpu y" tags as that requires the reader to > > know how many CPUs were in the system to make the calculation (you > > might get the allocinfo content from a system you have no access to > > and no additional information). Maybe we can have "per-cpu bytes" and > > "total bytes" columns instead? For per-cpu allocations these will be > > different, for all other allocations these two columns will contain > > the same number. > > I plan to remove 'percpu y/n' from this patch and implement it later. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > > they're on the right numa node? > > > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > > opinion about this feature ? Thanks! > > > > > > > > > > I would like to see some other opinions from potential users, have you > > > > > been circulating it? > > > > > > > > We have been using it internally for a while. I don't know who the > > > > potential users are and how to reach them so I am sharing it here to > > > > collect opinions from others. > > > > > > Should definitely have a separate Kconfig option. Have you measured > > > the memory and performance overhead of this change? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-04 15:21 ` Suren Baghdasaryan @ 2025-06-04 15:50 ` Kent Overstreet 2025-06-10 0:21 ` Casey Chen 1 sibling, 0 replies; 20+ messages in thread From: Kent Overstreet @ 2025-06-04 15:50 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Casey Chen, linux-mm, yzhong On Wed, Jun 04, 2025 at 08:21:48AM -0700, Suren Baghdasaryan wrote: > On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote: > > > > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > > > ... > > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > > > correct calculations. > > > > > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > > > though, a single character should suffice (we already have a header that > > > > > > can explain what it is) - if you're growing the width we don't want to > > > > > > overflow. > > > > > > > > > > > > > > > > Does it have a header ? > > > > > > > > Yes. See print_allocinfo_header(). > > > > > > I was thinking if instead of changing /proc/allocinfo format to > > > contain both total and per-node information we can keep it as is > > > (containing only totals) while exposing per-node information inside > > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems > > > cleaner to me. > > > > > > > The output of /sys/devices/system/node/node<node_no>/allocinfo is > > strictly limited to a single PAGE_SIZE and it cannot display stats for > > all tags. > > Ugh, that's a pity. Another option would be to add "nid" column like > this when this config is specified: > > nid bytes calls > 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc > 1 0 0 kernel/irq/irqdesc.c:425 > func:alloc_desc > ... > > It bloats the file size but looks more structured to me. Debugfs is also an option. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-04 15:21 ` Suren Baghdasaryan 2025-06-04 15:50 ` Kent Overstreet @ 2025-06-10 0:21 ` Casey Chen 2025-06-10 15:56 ` Suren Baghdasaryan 1 sibling, 1 reply; 20+ messages in thread From: Casey Chen @ 2025-06-10 0:21 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Kent Overstreet, linux-mm, yzhong On Wed, Jun 4, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote: > > > > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > > > ... > > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > > > correct calculations. > > > > > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > > > though, a single character should suffice (we already have a header that > > > > > > can explain what it is) - if you're growing the width we don't want to > > > > > > overflow. > > > > > > > > > > > > > > > > Does it have a header ? > > > > > > > > Yes. See print_allocinfo_header(). > > > > > > I was thinking if instead of changing /proc/allocinfo format to > > > contain both total and per-node information we can keep it as is > > > (containing only totals) while exposing per-node information inside > > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems > > > cleaner to me. > > > > > > > The output of /sys/devices/system/node/node<node_no>/allocinfo is > > strictly limited to a single PAGE_SIZE and it cannot display stats for > > all tags. > > Ugh, that's a pity. Another option would be to add "nid" column like > this when this config is specified: > > nid bytes calls > 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc > 1 0 0 kernel/irq/irqdesc.c:425 > func:alloc_desc > ... > > It bloats the file size but looks more structured to me. > How about this format ? With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=y, /proc/allocinfo looks like: allocinfo - version: 1.0 <nid> <size> <calls> <tag info> 0 0 init/main.c:1310 func:do_initcalls 0 0 0 1 0 0 ... 776704 1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq 0 348672 681 1 428032 836 6144 6 kernel/workqueue.c:4133 func:get_unbound_pool 0 4096 4 1 2048 2 With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n, /proc/allocinfo stays same as before: allocinfo - version: 1.0 <nid> <size> <calls> <tag info> 0 0 init/main.c:1310 func:do_initcalls 0 0 init/do_mounts.c:350 func:mount_nodev_root 0 0 init/do_mounts.c:187 func:mount_root_generic ... > > > > > I'm also not a fan of "percpu y" tags as that requires the reader to > > > know how many CPUs were in the system to make the calculation (you > > > might get the allocinfo content from a system you have no access to > > > and no additional information). Maybe we can have "per-cpu bytes" and > > > "total bytes" columns instead? For per-cpu allocations these will be > > > different, for all other allocations these two columns will contain > > > the same number. > > > > I plan to remove 'percpu y/n' from this patch and implement it later. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > > > they're on the right numa node? > > > > > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > > > opinion about this feature ? Thanks! > > > > > > > > > > > > I would like to see some other opinions from potential users, have you > > > > > > been circulating it? > > > > > > > > > > We have been using it internally for a while. I don't know who the > > > > > potential users are and how to reach them so I am sharing it here to > > > > > collect opinions from others. > > > > > > > > Should definitely have a separate Kconfig option. Have you measured > > > > the memory and performance overhead of this change? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-10 0:21 ` Casey Chen @ 2025-06-10 15:56 ` Suren Baghdasaryan 0 siblings, 0 replies; 20+ messages in thread From: Suren Baghdasaryan @ 2025-06-10 15:56 UTC (permalink / raw) To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong On Mon, Jun 9, 2025 at 5:22 PM Casey Chen <cachen@purestorage.com> wrote: > > On Wed, Jun 4, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > > > > ... > > > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > > > > correct calculations. > > > > > > > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > > > > though, a single character should suffice (we already have a header that > > > > > > > can explain what it is) - if you're growing the width we don't want to > > > > > > > overflow. > > > > > > > > > > > > > > > > > > > Does it have a header ? > > > > > > > > > > Yes. See print_allocinfo_header(). > > > > > > > > I was thinking if instead of changing /proc/allocinfo format to > > > > contain both total and per-node information we can keep it as is > > > > (containing only totals) while exposing per-node information inside > > > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems > > > > cleaner to me. > > > > > > > > > > The output of /sys/devices/system/node/node<node_no>/allocinfo is > > > strictly limited to a single PAGE_SIZE and it cannot display stats for > > > all tags. > > > > Ugh, that's a pity. Another option would be to add "nid" column like > > this when this config is specified: > > > > nid bytes calls > > 0 8588 2147 kernel/irq/irqdesc.c:425 func:alloc_desc > > 1 0 0 kernel/irq/irqdesc.c:425 > > func:alloc_desc > > ... > > > > It bloats the file size but looks more structured to me. > > > > How about this format ? > > With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=y, /proc/allocinfo looks like: > allocinfo - version: 1.0 > <nid> <size> <calls> <tag info> > 0 0 init/main.c:1310 func:do_initcalls > 0 0 0 > 1 0 0 If we go that way then why not: allocinfo - version: 2.0 <size> <calls> <tag info> 776704 1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq nid0 348672 681 nid1 428032 836 6144 6 kernel/workqueue.c:4133 func:get_unbound_pool nid0 4096 4 nid1 2048 2 ... If CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n the file format will not change. > ... > 776704 1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq > 0 348672 681 > 1 428032 836 > 6144 6 kernel/workqueue.c:4133 func:get_unbound_pool > 0 4096 4 > 1 2048 2 > > With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n, /proc/allocinfo > stays same as before: > allocinfo - version: 1.0 > <nid> <size> <calls> <tag info> > 0 0 init/main.c:1310 func:do_initcalls > 0 0 init/do_mounts.c:350 func:mount_nodev_root > 0 0 init/do_mounts.c:187 func:mount_root_generic > ... > > > > > > > > I'm also not a fan of "percpu y" tags as that requires the reader to > > > > know how many CPUs were in the system to make the calculation (you > > > > might get the allocinfo content from a system you have no access to > > > > and no additional information). Maybe we can have "per-cpu bytes" and > > > > "total bytes" columns instead? For per-cpu allocations these will be > > > > different, for all other allocations these two columns will contain > > > > the same number. > > > > > > I plan to remove 'percpu y/n' from this patch and implement it later. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > > > > they're on the right numa node? > > > > > > > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > > > > opinion about this feature ? Thanks! > > > > > > > > > > > > > > I would like to see some other opinions from potential users, have you > > > > > > > been circulating it? > > > > > > > > > > > > We have been using it internally for a while. I don't know who the > > > > > > potential users are and how to reach them so I am sharing it here to > > > > > > collect opinions from others. > > > > > > > > > > Should definitely have a separate Kconfig option. Have you measured > > > > > the memory and performance overhead of this change? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 21:32 ` Suren Baghdasaryan 2025-06-03 15:00 ` Suren Baghdasaryan @ 2025-06-03 20:00 ` Casey Chen 2025-06-03 20:18 ` Suren Baghdasaryan 1 sibling, 1 reply; 20+ messages in thread From: Casey Chen @ 2025-06-03 20:00 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: Kent Overstreet, linux-mm, yzhong On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > ... > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > correct calculations. > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > though, a single character should suffice (we already have a header that > > > can explain what it is) - if you're growing the width we don't want to > > > overflow. > > > > > > > Does it have a header ? > > Yes. See print_allocinfo_header(). > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > of interest to people looking at optimizing allocations to make sure > > > > > they're on the right numa node? > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > opinion about this feature ? Thanks! > > > > > > I would like to see some other opinions from potential users, have you > > > been circulating it? > > > > We have been using it internally for a while. I don't know who the > > potential users are and how to reach them so I am sharing it here to > > collect opinions from others. > > Should definitely have a separate Kconfig option. Have you measured > the memory and performance overhead of this change? I can make it a Kconfig option. Let's say, CONFIG_MEM_ALLOC_PER_NUMA_STATS=y/n, which controls the number of counter per CPU. If CONFIG_MEM_ALLOC_PER_NUMA_STATS=y, num_counter_percpu = num_possible_nodes(), else num_counter_percpu = 1. There is some memory cost. Additional memory used = Number of additional NUMA nodes * Number of CPUs * Number of tags * Size of each counter For example, in one of my testbeds, additional memory used = 1 (two nodes in total) * 112 (number of CPUs) * 4540 (number of tags) * 16 (size of counter), ~8MiB. This testbed has a total 755 GiB of memory. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-03 20:00 ` Casey Chen @ 2025-06-03 20:18 ` Suren Baghdasaryan 0 siblings, 0 replies; 20+ messages in thread From: Suren Baghdasaryan @ 2025-06-03 20:18 UTC (permalink / raw) To: Casey Chen; +Cc: Kent Overstreet, linux-mm, yzhong On Tue, Jun 3, 2025 at 1:01 PM Casey Chen <cachen@purestorage.com> wrote: > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote: > > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > ... > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > correct calculations. > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > though, a single character should suffice (we already have a header that > > > > can explain what it is) - if you're growing the width we don't want to > > > > overflow. > > > > > > > > > > Does it have a header ? > > > > Yes. See print_allocinfo_header(). > > > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > they're on the right numa node? > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > opinion about this feature ? Thanks! > > > > > > > > I would like to see some other opinions from potential users, have you > > > > been circulating it? > > > > > > We have been using it internally for a while. I don't know who the > > > potential users are and how to reach them so I am sharing it here to > > > collect opinions from others. > > > > Should definitely have a separate Kconfig option. Have you measured > > the memory and performance overhead of this change? > > I can make it a Kconfig option. Let's say, > CONFIG_MEM_ALLOC_PER_NUMA_STATS=y/n, which controls the number of > counter per CPU. > If CONFIG_MEM_ALLOC_PER_NUMA_STATS=y, num_counter_percpu = > num_possible_nodes(), else num_counter_percpu = 1. > > There is some memory cost. Additional memory used = Number of > additional NUMA nodes * Number of CPUs * Number of tags * Size of each > counter > For example, in one of my testbeds, additional memory used = 1 (two > nodes in total) * 112 (number of CPUs) * 4540 (number of tags) * 16 > (size of counter), ~8MiB. This testbed has a total 755 GiB of memory. Please add these numbers in the changelog when you post the next version. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 20:48 ` Casey Chen 2025-06-02 21:32 ` Suren Baghdasaryan @ 2025-06-02 21:52 ` Kent Overstreet 2025-06-02 22:08 ` Steven Rostedt 1 sibling, 1 reply; 20+ messages in thread From: Kent Overstreet @ 2025-06-02 21:52 UTC (permalink / raw) To: Casey Chen Cc: linux-mm, surenb, yzhong, Steven Rostedt, Peter Zijlstra, Ingo Molnar +cc Steven, Peter, Ingo On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote: > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > For example, the resulting file looks like below. > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > ... > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > memory used. Any /proc/allocinfo parser can understand it and make > > > correct calculations. > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > though, a single character should suffice (we already have a header that > > can explain what it is) - if you're growing the width we don't want to > > overflow. > > > > Does it have a header ? > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > of interest to people looking at optimizing allocations to make sure > > > > they're on the right numa node? > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > optimizations. I can make it a kconfig. Does anybody else have any > > > opinion about this feature ? Thanks! > > > > I would like to see some other opinions from potential users, have you > > been circulating it? > > We have been using it internally for a while. I don't know who the > potential users are and how to reach them so I am sharing it here to > collect opinions from others. I'd ask the tracing and profiling people for their thoughts, and anyone working on tooling that might consume this. I'm wondering if there might be some way of feeding more info into perf, since profiling cache misses is a big thing that it does. It might be a long shot, since we're just accounting usage, or it might spark some useful ideas. Can you share a bit about how you're using this internally? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 21:52 ` Kent Overstreet @ 2025-06-02 22:08 ` Steven Rostedt 2025-06-02 23:35 ` Kent Overstreet 0 siblings, 1 reply; 20+ messages in thread From: Steven Rostedt @ 2025-06-02 22:08 UTC (permalink / raw) To: Kent Overstreet Cc: Casey Chen, linux-mm, surenb, yzhong, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Masami Hiramatsu, Arnaldo Carvalho de Melo, Ian Rogers On Mon, 2 Jun 2025 17:52:49 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > +cc Steven, Peter, Ingo > > On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote: > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > ... > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > correct calculations. > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > though, a single character should suffice (we already have a header that > > > can explain what it is) - if you're growing the width we don't want to > > > overflow. > > > > > > > Does it have a header ? > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > of interest to people looking at optimizing allocations to make sure > > > > > they're on the right numa node? > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > opinion about this feature ? Thanks! > > > > > > I would like to see some other opinions from potential users, have you > > > been circulating it? > > > > We have been using it internally for a while. I don't know who the > > potential users are and how to reach them so I am sharing it here to > > collect opinions from others. > > I'd ask the tracing and profiling people for their thoughts, and anyone > working on tooling that might consume this. > > I'm wondering if there might be some way of feeding more info into perf, > since profiling cache misses is a big thing that it does. > > It might be a long shot, since we're just accounting usage, or it might > spark some useful ideas. > > Can you share a bit about how you're using this internally? I'm guessing this is to show where in the kernel functions are using memory? I added to the Cc people who tend to use perf for analysis then just having those that maintain the kernel side of perf. -- Steve ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 22:08 ` Steven Rostedt @ 2025-06-02 23:35 ` Kent Overstreet 2025-06-03 6:46 ` Ian Rogers 0 siblings, 1 reply; 20+ messages in thread From: Kent Overstreet @ 2025-06-02 23:35 UTC (permalink / raw) To: Steven Rostedt Cc: Casey Chen, linux-mm, surenb, yzhong, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Masami Hiramatsu, Arnaldo Carvalho de Melo, Ian Rogers On Mon, Jun 02, 2025 at 06:08:26PM -0400, Steven Rostedt wrote: > On Mon, 2 Jun 2025 17:52:49 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > +cc Steven, Peter, Ingo > > > > On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote: > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > ... > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > correct calculations. > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > though, a single character should suffice (we already have a header that > > > > can explain what it is) - if you're growing the width we don't want to > > > > overflow. > > > > > > > > > > Does it have a header ? > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > they're on the right numa node? > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > opinion about this feature ? Thanks! > > > > > > > > I would like to see some other opinions from potential users, have you > > > > been circulating it? > > > > > > We have been using it internally for a while. I don't know who the > > > potential users are and how to reach them so I am sharing it here to > > > collect opinions from others. > > > > I'd ask the tracing and profiling people for their thoughts, and anyone > > working on tooling that might consume this. > > > > I'm wondering if there might be some way of feeding more info into perf, > > since profiling cache misses is a big thing that it does. > > > > It might be a long shot, since we're just accounting usage, or it might > > spark some useful ideas. > > > > Can you share a bit about how you're using this internally? > > I'm guessing this is to show where in the kernel functions are using memory? Exactly Now that we've got a mapping from address to source location that owns it, I'm wondering if there's anything else we can do with it. > I added to the Cc people who tend to use perf for analysis then just having > those that maintain the kernel side of perf. Perfect, thanks ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] alloc_tag: add per-numa node stats 2025-06-02 23:35 ` Kent Overstreet @ 2025-06-03 6:46 ` Ian Rogers 0 siblings, 0 replies; 20+ messages in thread From: Ian Rogers @ 2025-06-03 6:46 UTC (permalink / raw) To: Kent Overstreet Cc: Steven Rostedt, Casey Chen, linux-mm, surenb, yzhong, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Masami Hiramatsu, Arnaldo Carvalho de Melo On Mon, Jun 2, 2025 at 4:35 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Mon, Jun 02, 2025 at 06:08:26PM -0400, Steven Rostedt wrote: > > On Mon, 2 Jun 2025 17:52:49 -0400 > > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > +cc Steven, Peter, Ingo > > > > > > On Mon, Jun 02, 2025 at 01:48:43PM -0700, Casey Chen wrote: > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote: > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet > > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote: > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list") > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > > > > > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo. > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0. > > > > > > > > For example, the resulting file looks like below. > > > > > > > > > > > > > > > > percpu y total 8588 2147 numa0 8588 2147 numa1 0 0 kernel/irq/irqdesc.c:425 func:alloc_desc > > > > > > > > percpu n total 447232 1747 numa0 269568 1053 numa1 177664 694 lib/maple_tree.c:165 func:mt_alloc_bulk > > > > > > > > percpu n total 83200 325 numa0 30976 121 numa1 52224 204 lib/maple_tree.c:160 func:mt_alloc_one > > > > > > > > ... > > > > > > > > percpu n total 364800 5700 numa0 109440 1710 numa1 255360 3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg > > > > > > > > percpu n total 1249280 39040 numa0 374784 11712 numa1 874496 27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box > > > > > > > > > > > > > > Err, what is 'percpu y/n'? > > > > > > > > > > > > > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of > > > > > > memory used. Any /proc/allocinfo parser can understand it and make > > > > > > correct calculations. > > > > > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that > > > > > though, a single character should suffice (we already have a header that > > > > > can explain what it is) - if you're growing the width we don't want to > > > > > overflow. > > > > > > > > > > > > > Does it have a header ? > > > > > > > > > > > > > > > > > > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE. > > > > > > > > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for > > > > > > > > these counters are not accounted in profiling stats. > > > > > > > > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted. > > > > > > > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be > > > > > > > of interest to people looking at optimizing allocations to make sure > > > > > > > they're on the right numa node? > > > > > > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some > > > > > > optimizations. I can make it a kconfig. Does anybody else have any > > > > > > opinion about this feature ? Thanks! > > > > > > > > > > I would like to see some other opinions from potential users, have you > > > > > been circulating it? > > > > > > > > We have been using it internally for a while. I don't know who the > > > > potential users are and how to reach them so I am sharing it here to > > > > collect opinions from others. > > > > > > I'd ask the tracing and profiling people for their thoughts, and anyone > > > working on tooling that might consume this. > > > > > > I'm wondering if there might be some way of feeding more info into perf, > > > since profiling cache misses is a big thing that it does. > > > > > > It might be a long shot, since we're just accounting usage, or it might > > > spark some useful ideas. > > > > > > Can you share a bit about how you're using this internally? > > > > I'm guessing this is to show where in the kernel functions are using memory? > > Exactly > > Now that we've got a mapping from address to source location that owns > it, I'm wondering if there's anything else we can do with it. > > > I added to the Cc people who tend to use perf for analysis then just having > > those that maintain the kernel side of perf. > > Perfect, thanks This looks nice! In the perf tool we already do some /proc processing and map the data into what looks like a perf event: https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/tool_pmu.c?h=perf-tools-next#n261 ``` $ perf stat -e user_time,system_time true Performance counter stats for 'true': 350,000 user_time 2,054,000 system_time 0.002222811 seconds time elapsed 0.000350000 seconds user 0.002054000 seconds sys ``` There's no reason we can't do memory information, the patch series I sent adding DRM information (unmerged) contains it: https://lore.kernel.org/lkml/20250403202439.57791-4-irogers@google.com/ Perf supports per-NUMA node aggregation and even has patches to make it more accurate on Intel for sub-NUMA systems: https://lore.kernel.org/lkml/20250515181417.491401-1-irogers@google.com/ It may be there are advantages to having the perf tool only events be kernel events longer term (supporting sampling being a key one) but making changes in the tool is fast and convenient. It can be nice to do things like dump counts every second: ``` $ perf stat -e temp_cpu,fan1 -I 1000 # time counts unit events 1.001152826 34.00 'C temp_cpu 1.001152826 2,570 rpm fan1 2.008358661 34.00 'C temp_cpu 2.008358661 2,572 rpm fan1 3.015209566 34.00 'C temp_cpu 3.015209566 2,570 rpm fan1 ... ``` Thanks, Ian ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-06-10 15:56 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-30 0:39 [PATCH 0/1] alloc_tag: add per-numa node stats Casey Chen 2025-05-30 0:39 ` [PATCH] " Casey Chen 2025-05-30 1:11 ` [PATCH 0/1] " Kent Overstreet 2025-05-30 21:45 ` Casey Chen 2025-05-31 0:05 ` Kent Overstreet 2025-06-02 20:48 ` Casey Chen 2025-06-02 21:32 ` Suren Baghdasaryan 2025-06-03 15:00 ` Suren Baghdasaryan 2025-06-03 17:34 ` Kent Overstreet 2025-06-04 0:55 ` Casey Chen 2025-06-04 15:21 ` Suren Baghdasaryan 2025-06-04 15:50 ` Kent Overstreet 2025-06-10 0:21 ` Casey Chen 2025-06-10 15:56 ` Suren Baghdasaryan 2025-06-03 20:00 ` Casey Chen 2025-06-03 20:18 ` Suren Baghdasaryan 2025-06-02 21:52 ` Kent Overstreet 2025-06-02 22:08 ` Steven Rostedt 2025-06-02 23:35 ` Kent Overstreet 2025-06-03 6:46 ` Ian Rogers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).