public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering
@ 2026-04-19 15:55 Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 1/3] mm/page_owner: add filter infrastructure Zhen Ni
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Zhen Ni @ 2026-04-19 15:55 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni

This patch series introduces filtering capabilities to the page_owner
feature to address storage and performance challenges in production
environments.

Changes from v1:
- Renamed 'compact' to 'print_mode' with enum type for better clarity
  * PAGE_OWNER_PRINT_FULL_STACK (0): print full stack traces
  * PAGE_OWNER_PRINT_STACK_HANDLE (1): print only stack handles
- Changed NUMA filter from single node to nodelist with bitmask support
  * Uses nodelist_parse() to support "0", "0,2", "0-3", "0,2-4,7" formats
  * Uses nodemask_t internally for efficient multi-node filtering
  * Output uses %*pbl format (e.g., "0-2", "0,2-4,7")
- Improved memory handling in nid_filter_write using dynamic allocation
  * Limit: (100 + 6 * MAX_NUMNODES) to handle worst-case input

These changes address feedback from v1 review:
- "compact" was too vague → use descriptive enum (PAGE_OWNER_PRINT_*)
- Single node filter was limiting → use nodelist_parse() for multi-node support

Problem Statement
=================

In production environments with large memory configurations (e.g., 250GB+),
collecting page_owner information often results in files ranging from
several gigabytes to over 10GB. This creates significant challenges:

1. Storage pressure on production systems
2. Difficulty transferring large files from production environments
3. Post-processing overhead with tools/mm/page_owner_sort.c

The primary contributor to file size is redundant stack trace
information. While the kernel already deduplicates stacks via
stackdepot, page_owner retrieves and stores full stack traces for
each page, only to deduplicate them again during post-processing.

Additionally, in NUMA-aware environments (e.g., DPDK-based cloud
deployments where QEMU processes are bound to specific NUMA nodes),
OOM events are often node-specific rather than system-wide.
Currently, page_owner cannot filter by NUMA node, forcing users to
collect and analyze data for all nodes.

Solution
========

This patch series introduces a flexible filter infrastructure with
two initial filters:

1. **Print Mode Filter**: Outputs only stack handles instead of
   full stack traces. The handle-to-stack mapping can be retrieved
   from the existing show_stacks_handles interface. This dramatically
   reduces output size while preserving all allocation metadata.

2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s)
   using flexible nodelist format, enabling targeted analysis of memory
   issues in NUMA-aware deployments.

Implementation
==============

The series is structured as follows:

- Patch 1: Add filter infrastructure (data structures and
  debugfs directory)
- Patch 2: Implement print_mode filter
- Patch 3: Implement NUMA node filter with nodelist support

Usage Example
=============

Enable print_mode and filter for NUMA nodes 0,2-3:

    # cd /sys/kernel/debug/page_owner_filter/
    # echo 1 > print_mode
    # echo "0,2-3" > nid
    # cat /sys/kernel/debug/page_owner > page_owner.txt

Sample print_mode output (showing handles only):

    Page allocated via order 0, mask 0x0(), pid 0, tgid 0 (swapper),
    ts 0 ns PFN 0x40000 type Unmovable Block 512 type Unmovable
    Flags 0x3fffe0000000000(node=0|zone=0|lastcpupid=0x1ffff)
    handle: 1048577

    Page allocated via order 0, mask 0x252000(__GFP_NOWARN|
    __GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 0, tgid 0 (swapper),
    ts 0 ns PFN 0x40002 type Unmovable Block 512 type Unmovable
    Flags 0x23fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff)
    handle: 1048577

Testing
=======

Tested on a system with multiple NUMA nodes. Verified that:
- Filters work independently and in combination
- Print_mode output correlates correctly with show_stacks_handles
- Default behavior (filters disabled) remains unchanged
- NUMA filter works with single node, multiple nodes, and ranges

Example test session:
    # cat print_mode
    0
    # echo "0,1-2" > nid
    # cat nid
    0-2
    # echo "0,2-3" > nid
    # cat nid
    0,2-3
    # echo 1 > print_mode
    # head -n 100 /sys/kernel/debug/page_owner
    [Shows compact mode output with handles only]

Future Enhancements
==================

The filter infrastructure is designed to be extensible. Potential
future filters could include:
- PID/TGID filtering
- Time range filtering (allocation timestamp windows)
- GFP flag filtering
- Migration type filtering

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>

---

Zhen Ni (3):
  mm/page_owner: add filter infrastructure
  mm/page_owner: add print_mode filter
  mm/page_owner: add NUMA node filter with nodelist support

 mm/page_owner.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 122 insertions(+), 2 deletions(-)

--
2.20.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/3] mm/page_owner: add filter infrastructure
  2026-04-19 15:55 [PATCH v2 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
@ 2026-04-19 15:55 ` Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 2/3] mm/page_owner: add print_mode filter Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 3/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
  2 siblings, 0 replies; 4+ messages in thread
From: Zhen Ni @ 2026-04-19 15:55 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni

Add data structure for page_owner filtering functionality and create
debugfs directory for filter controls.

This adds:
- enum page_owner_print_mode with values for full_stack and stack_handle
- struct page_owner_filter with print_mode and nid_mask fields
- Static owner_filter instance initialized with default values
- page_owner_filter debugfs directory

The filter infrastructure will be used to add print_mode and NUMA node
filtering capabilities in subsequent commits.

Link: https://lore.kernel.org/linux-mm/20260417154638.22370-2-zhen.ni@easystack.cn/
Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Changes in v2:
- Use enum page_owner_print_mode instead of bool 'compact' for better clarity
- Use nodemask_t instead of int 'nid' to support multi-node filtering
---
 mm/page_owner.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 8178e0be557f..5884d883837e 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -54,6 +54,21 @@ struct stack_print_ctx {
 	u8 flags;
 };
 
+enum page_owner_print_mode {
+	PAGE_OWNER_PRINT_FULL_STACK,
+	PAGE_OWNER_PRINT_STACK_HANDLE,
+};
+
+struct page_owner_filter {
+	enum page_owner_print_mode print_mode;
+	nodemask_t nid_mask;
+};
+
+static struct page_owner_filter owner_filter = {
+	.print_mode = PAGE_OWNER_PRINT_FULL_STACK,
+	.nid_mask = NODE_MASK_NONE,
+};
+
 static bool page_owner_enabled __initdata;
 DEFINE_STATIC_KEY_FALSE(page_owner_inited);
 
@@ -973,7 +988,7 @@ DEFINE_SIMPLE_ATTRIBUTE(page_owner_threshold_fops, &page_owner_threshold_get,
 
 static int __init pageowner_init(void)
 {
-	struct dentry *dir;
+	struct dentry *dir, *filter_dir;
 
 	if (!static_branch_unlikely(&page_owner_inited)) {
 		pr_info("page_owner is disabled\n");
@@ -981,6 +996,9 @@ static int __init pageowner_init(void)
 	}
 
 	debugfs_create_file("page_owner", 0400, NULL, NULL, &page_owner_fops);
+
+	filter_dir = debugfs_create_dir("page_owner_filter", NULL);
+
 	dir = debugfs_create_dir("page_owner_stacks", NULL);
 	debugfs_create_file("show_stacks", 0400, dir,
 			    (void *)(STACK_PRINT_FLAG_STACK |
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 2/3] mm/page_owner: add print_mode filter
  2026-04-19 15:55 [PATCH v2 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 1/3] mm/page_owner: add filter infrastructure Zhen Ni
@ 2026-04-19 15:55 ` Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 3/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
  2 siblings, 0 replies; 4+ messages in thread
From: Zhen Ni @ 2026-04-19 15:55 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni

Add print_mode functionality to reduce page_owner output size by
printing only the stack handle instead of the full stack trace.

Example output with print_mode enabled:
  Page allocated via order 0, mask 0x42800(GFP_NOWAIT|__GFP_COMP),
  pid 1, tgid 1 (systemd), ts 349667370 ns
  PFN 0xa00a2 type Unmovable Block 1280 type Unmovable
  Flags 0x33fffe0000004124(referenced|lru|active|private|node=3|zone=0|
lastcpupid=0x1ffff)
  handle: 17432583
  Charged to memcg /

Print mode significantly reduces output size while preserving all
other page allocation information. The correspondence between handles
and stack traces can be obtained through the show_stacks_handles interface.

Link: https://lore.kernel.org/linux-mm/20260417154638.22370-3-zhen.ni@easystack.cn/
Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Changes in v2:
- Renamed from 'compact mode' to 'print_mode' for better clarity
- Use enum values (0=full_stack, 1=stack_handle) instead of boolean
- Update debugfs filename from 'compact' to 'print_mode'
---
 mm/page_owner.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5884d883837e..6d87b6948cfa 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -590,7 +590,13 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migratetype_names[pageblock_mt],
 			&page->flags);
 
-	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
+	/* Print mode: full stack or stack handle */
+	if (READ_ONCE(owner_filter.print_mode) == PAGE_OWNER_PRINT_STACK_HANDLE) {
+		ret += scnprintf(kbuf + ret, count - ret,
+				"handle: %d\n", handle);
+	} else {
+		ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
+	}
 	if (ret >= count)
 		goto err;
 
@@ -985,6 +991,24 @@ static int page_owner_threshold_set(void *data, u64 val)
 DEFINE_SIMPLE_ATTRIBUTE(page_owner_threshold_fops, &page_owner_threshold_get,
 			&page_owner_threshold_set, "%llu");
 
+static int page_owner_print_mode_get(void *data, u64 *val)
+{
+	*val = READ_ONCE(owner_filter.print_mode);
+	return 0;
+}
+
+static int page_owner_print_mode_set(void *data, u64 val)
+{
+	if (val > PAGE_OWNER_PRINT_STACK_HANDLE)
+		return -EINVAL;
+	WRITE_ONCE(owner_filter.print_mode, val);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(page_owner_print_mode_fops,
+			&page_owner_print_mode_get,
+			&page_owner_print_mode_set, "%lld");
+
 
 static int __init pageowner_init(void)
 {
@@ -998,6 +1022,8 @@ static int __init pageowner_init(void)
 	debugfs_create_file("page_owner", 0400, NULL, NULL, &page_owner_fops);
 
 	filter_dir = debugfs_create_dir("page_owner_filter", NULL);
+	debugfs_create_file("print_mode", 0600, filter_dir, NULL,
+			    &page_owner_print_mode_fops);
 
 	dir = debugfs_create_dir("page_owner_stacks", NULL);
 	debugfs_create_file("show_stacks", 0400, dir,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 3/3] mm/page_owner: add NUMA node filter with nodelist support
  2026-04-19 15:55 [PATCH v2 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 1/3] mm/page_owner: add filter infrastructure Zhen Ni
  2026-04-19 15:55 ` [PATCH v2 2/3] mm/page_owner: add print_mode filter Zhen Ni
@ 2026-04-19 15:55 ` Zhen Ni
  2 siblings, 0 replies; 4+ messages in thread
From: Zhen Ni @ 2026-04-19 15:55 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni

Add NUMA node filtering functionality to page_owner to allow
filtering pages by specific NUMA node(s) using nodelist format.

The filter allows users to focus on pages from specific NUMA nodes,
which is useful for NUMA-aware memory allocation analysis and debugging.

Supported input formats:
- Single node: echo "2" > nid
- Multiple nodes: echo "0,2,3" > nid
- Node range: echo "0-3" > nid
- Mixed format: echo "0,2-4,7" > nid
- Disable filter: echo "-1" > nid

Link: https://lore.kernel.org/linux-mm/20260417154638.22370-4-zhen.ni@easystack.cn/
Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Changes in v2:
- Use nodemask_t instead of int to support multiple nodes
- Implement nodelist_parse() to support flexible input formats
  * Single node: "0", "2"
  * Multiple nodes: "0,2,3"
  * Ranges: "0-3"
  * Mixed: "0,2-4,7"
- Use %*pbl format for output (e.g., "0-2", "0,2-4,7")
- Use dynamic memory allocation (kmalloc) to handle variable-length input
- Follow cpuset's max_write_len pattern: (100 + 6 * MAX_NUMNODES)
---
 mm/page_owner.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 6d87b6948cfa..8c13bb3798d8 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -707,6 +707,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 		 * user through copy_to_user() or GFP_KERNEL allocations.
 		 */
 		struct page_owner page_owner_tmp;
+		nodemask_t mask;
 
 		/*
 		 * If the new page is in a new MAX_ORDER_NR_PAGES area,
@@ -730,6 +731,15 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 		if (unlikely(!page_ext))
 			continue;
 
+		/* NUMA node filter using bitmask */
+		mask = READ_ONCE(owner_filter.nid_mask);
+		if (!nodes_empty(mask)) {
+			int nid = page_to_nid(page);
+
+			if (!node_isset(nid, mask))
+				goto ext_put_continue;
+		}
+
 		/*
 		 * Some pages could be missed by concurrent allocation or free,
 		 * because we don't hold the zone lock.
@@ -1009,6 +1019,70 @@ DEFINE_SIMPLE_ATTRIBUTE(page_owner_print_mode_fops,
 			&page_owner_print_mode_get,
 			&page_owner_print_mode_set, "%lld");
 
+static ssize_t nid_filter_write(struct file *file,
+				 const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	nodemask_t mask;
+	int ret;
+
+	/* Limit input size to handle worst-case nodelist (all nodes) */
+	if (count > (100 + 6 * MAX_NUMNODES))
+		return -EINVAL;
+
+	kbuf = kmalloc(count + 1, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	if (copy_from_user(kbuf, buf, count)) {
+		ret = -EFAULT;
+		goto out_free;
+	}
+	kbuf[count] = '\0';
+
+	/* Support: "-1" to clear, or nodelist format like "0", "0,2", "0-3" */
+	if (strcmp(kbuf, "-1\n") == 0 || strcmp(kbuf, "-1") == 0)
+		nodes_clear(mask);
+	else if (nodelist_parse(kbuf, mask)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	WRITE_ONCE(owner_filter.nid_mask, mask);
+	ret = count;
+
+out_free:
+	kfree(kbuf);
+	return ret;
+}
+
+static int nid_filter_show(struct seq_file *m, void *v)
+{
+	nodemask_t mask = READ_ONCE(owner_filter.nid_mask);
+
+	if (nodes_empty(mask))
+		seq_puts(m, "-1\n");
+	else
+		seq_printf(m, "%*pbl\n", nodemask_pr_args(&mask));
+
+	return 0;
+}
+
+static int nid_filter_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nid_filter_show, NULL);
+}
+
+static const struct file_operations nid_filter_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nid_filter_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.write		= nid_filter_write,
+	.release	= single_release,
+};
+
 
 static int __init pageowner_init(void)
 {
@@ -1024,6 +1098,8 @@ static int __init pageowner_init(void)
 	filter_dir = debugfs_create_dir("page_owner_filter", NULL);
 	debugfs_create_file("print_mode", 0600, filter_dir, NULL,
 			    &page_owner_print_mode_fops);
+	debugfs_create_file("nid", 0600, filter_dir, NULL,
+			    &nid_filter_fops);
 
 	dir = debugfs_create_dir("page_owner_stacks", NULL);
 	debugfs_create_file("show_stacks", 0400, dir,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-19 17:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-19 15:55 [PATCH v2 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
2026-04-19 15:55 ` [PATCH v2 1/3] mm/page_owner: add filter infrastructure Zhen Ni
2026-04-19 15:55 ` [PATCH v2 2/3] mm/page_owner: add print_mode filter Zhen Ni
2026-04-19 15:55 ` [PATCH v2 3/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox