Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering
@ 2026-05-11  3:30 Zhen Ni
  2026-05-11  3:30 ` [PATCH v6 1/3] mm/page_owner: add print_mode filter Zhen Ni
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Zhen Ni @ 2026-05-11  3:30 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni

This patch series introduces filtering capabilities to the page_owner
feature to address storage and performance challenges in production
environments.

Changes from v5:
- Address SeongJae Park's review comments for patch 1/3:
  * Remove unnecessary braces in if/else statement
  * Use stack array instead of kmalloc for input buffer
- Address SeongJae Park's review comments for patch 2/3:
  * Add node validity check using nodes_subset() to reject non-existent nodes
  * Separate variable declaration and statement
  * Use kmalloc_objs() for consistency with kernel patterns
  * Remove 100 bytes overhead
- Add lore links to all previous versions

Changes from v4:
- Optimize nodes_empty() check in page iteration loop
- Add __data_racy qualifier to nid_mask field

Changes from v3:
- Change print_mode from numeric (0/1) to string-based interface
  * Use "full_stack"/"stack_handle" strings instead of numbers
  * Display current mode with bracket notation: "[full_stack] stack_handle"
- Remove "-1" support from NUMA filter
  * Use empty string to clear filter (echo > nid)
- Use strncpy_from_user() instead of copy_from_user()
- Rename nid_filter_fops to page_owner_nid_filter_fops for consistency
- Merge patch 1 (infrastructure) and patch 2 (print_mode) from v3
- Update documentation to match new interface
  * String-based examples
  * Tab indentation in code blocks

Changes from v2:
- Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors)
  * nodemask_t is a large structure (128 bytes) that triggers compile-time asserts
  * Direct assignment is safe for this use case
- Add comment explaining input length calculation formula
  * 6 bytes = ",NNNNN" (comma + 5-digit node number)
- Simplify "-1" check using kstrtoint() instead of dual strcmp()
- Move nodemask_t mask read outside PFN iteration loop for performance
  * Avoids 128-byte structure copy on each iteration
- Add documentation for filter features (patch 3/3)

Changes from v1:
- Renamed 'compact' to 'print_mode' with enum type for better clarity
  * PAGE_OWNER_PRINT_FULL_STACK (0): print full stack traces
  * PAGE_OWNER_PRINT_STACK_HANDLE (1): print only stack handles
- Changed NUMA filter from single node to nodelist with bitmask support
  * Uses nodelist_parse() to support "0", "0,2", "0-3", "0,2-4,7" formats
  * Uses nodemask_t internally for efficient multi-node filtering
  * Output uses %*pbl format (e.g., "0-2", "0,2-4,7")
- Improved memory handling in nid_filter_write using dynamic allocation
  * Limit: (100 + 6 * MAX_NUMNODES) to handle worst-case input


Problem Statement
=================

In production environments with large memory configurations (e.g., 250GB+),
collecting page_owner information often results in files ranging from
several gigabytes to over 10GB. This creates significant challenges:

1. Storage pressure on production systems
2. Difficulty transferring large files from production environments
3. Post-processing overhead with tools/mm/page_owner_sort.c

The primary contributor to file size is redundant stack trace
information. While the kernel already deduplicates stacks via
stackdepot, page_owner retrieves and stores full stack traces for
each page, only to deduplicate them again during post-processing.

Additionally, in NUMA-aware environments (e.g., DPDK-based cloud
deployments where QEMU processes are bound to specific NUMA nodes),
OOM events are often node-specific rather than system-wide.
Currently, page_owner cannot filter by NUMA node, forcing users to
collect and analyze data for all nodes.

Solution
========

This patch series introduces a flexible filter infrastructure with
two initial filters:

1. **Print Mode Filter**: Outputs only stack handles instead of
   full stack traces. The handle-to-stack mapping can be retrieved
   from the existing show_stacks_handles interface. This dramatically
   reduces output size while preserving all allocation metadata.

2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s)
   using flexible nodelist format, enabling targeted analysis of memory
   issues in NUMA-aware deployments.

Implementation
==============

The series is structured as follows:

- Patch 1: Implement print_mode filter with string-based interface
  (merges infrastructure + print_mode from v3)
- Patch 2: Implement NUMA node filter with nodelist support
  * v6: Add node validity check to reject non-existent nodes
- Patch 3: Document filter features

Usage Example
=============

Enable print_mode and filter for NUMA nodes 0,2-3:

    # cd /sys/kernel/debug/page_owner_filter/
    # echo stack_handle > print_mode
    # echo "0,2-3" > nid
    # cat /sys/kernel/debug/page_owner > page_owner.txt

Sample print_mode output (showing handles only):

    Page allocated via order 0, mask 0x0(), pid 0, tgid 0 (swapper),
    ts 0 ns PFN 0x40000 type Unmovable Block 512 type Unmovable
    Flags 0x3fffe0000000000(node=0|zone=0|lastcpupid=0x1ffff)
    handle: 1048577

    Page allocated via order 0, mask 0x252000(__GFP_NOWARN|
    __GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 0, tgid 0 (swapper),
    ts 0 ns PFN 0x40002 type Unmovable Block 512 type Unmovable
    Flags 0x23fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff)
    handle: 1048577

Testing
=======

Tested on a system with multiple NUMA nodes. Verified that:
- Filters work independently and in combination
- Print_mode output correlates correctly with show_stacks_handles
- Default behavior (filters disabled) remains unchanged
- NUMA filter works with single node, multiple nodes, and ranges
- String-based interface works correctly ("full_stack"/"stack_handle")
- Empty string clears NUMA filter
- Node validity check correctly rejects non-existent nodes
- Code compiles without warnings or errors (allmodconfig tested)

Example test session:
    # cat print_mode
    [full_stack] stack_handle
    # echo stack_handle > print_mode
    # cat print_mode
    full_stack [stack_handle]
    # echo "0,1-2" > nid
    # cat nid
    0-2
    # echo "0,2-3" > nid
    # cat nid
    0,2-3
    # echo "10" > nid
    -bash: echo: write error: Invalid argument
    # echo > nid
    # cat nid

    (empty - filter cleared)

Future Enhancements
===================

The filter infrastructure is designed to be extensible. Potential
future filters could include:
- PID/TGID filtering
- Time range filtering (allocation timestamp windows)
- GFP flag filtering
- Migration type filtering

v5: https://lore.kernel.org/linux-mm/20260507064643.179187-1-zhen.ni@easystack.cn/
v4: https://lore.kernel.org/linux-mm/20260430163247.13628-1-zhen.ni@easystack.cn/
v3: https://lore.kernel.org/linux-mm/20260428071112.1420380-1-zhen.ni@easystack.cn/
v2: https://lore.kernel.org/linux-mm/20260419155540.376847-1-zhen.ni@easystack.cn/
v1: https://lore.kernel.org/linux-mm/20260417154638.22370-1-zhen.ni@easystack.cn/

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Zhen Ni (3):
  mm/page_owner: add print_mode filter
  mm/page_owner: add NUMA node filter with nodelist support
  mm/page_owner: document page_owner filter features

 Documentation/mm/page_owner.rst |  61 ++++++++++-
 mm/page_owner.c                 | 174 +++++++++++++++++++++++++++++++-
 2 files changed, 232 insertions(+), 3 deletions(-)

--
2.20.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v6 1/3] mm/page_owner: add print_mode filter
  2026-05-11  3:30 [PATCH v6 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
@ 2026-05-11  3:30 ` Zhen Ni
  2026-05-11  8:29   ` Oscar Salvador
  2026-05-11  3:30 ` [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
  2026-05-11  3:30 ` [PATCH v6 3/3] mm/page_owner: document page_owner filter features Zhen Ni
  2 siblings, 1 reply; 7+ messages in thread
From: Zhen Ni @ 2026-05-11  3:30 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni, SeongJae Park

Add a print_mode filter to page_owner that allows users to choose between
printing full stack traces or only stack handles, significantly reducing
output size for debugging and analysis.

The filter provides a string-based interface under
/sys/kernel/debug/page_owner_filter/:
- Reading shows the current mode with [] brackets around active option
- Writing accepts "full_stack" or "stack_handle" strings

The default full_stack mode maintains backward compatibility with existing
usage, displaying complete stack traces for each page allocation.

The stack_handle mode dramatically reduces log size by showing only
the handle number instead of the full stack trace. The mapping from
handles to actual stack traces can be obtained via the
show_stacks_handles interface.

Example usage:
  # echo stack_handle > /sys/kernel/debug/page_owner_filter/print_mode
  # cat /sys/kernel/debug/page_owner_filter/print_mode
  full_stack [stack_handle]
  # cat /sys/kernel/debug/page_owner
  Page allocated via order 0, migratetype Unmovable, gfp_mask 0x1100ca,
  pid 1, tgid 1 (systemd), ts 123456789 ns
  PFN 0x1000 type Unmovable Block 1 type Unmovable
  Flags 0x3fffe800000084(referenced|lru|active|private|node=0|zone=1)
  handle: 17432583
  ...

Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Changes in v6:
- Remove unnecessary braces in if/else statement (coding style)
- Use stack array (char kbuf[33]) instead of kmalloc for input buffer

Changes in v5:
- No code changes

Changes in v4:
- Change from numeric (0/1) to string-based interface ("full_stack"/"stack_handle")
- Merge infrastructure patch into this patch

Changes in v3:
- No code changes

Changes in v2:
- Renamed from 'compact mode' to 'print_mode' for better clarity
- Use enum values (0=full_stack, 1=stack_handle) instead of boolean
- Update debugfs filename from 'compact' to 'print_mode'

v5: https://lore.kernel.org/linux-mm/20260507064643.179187-2-zhen.ni@easystack.cn/
v4: https://lore.kernel.org/linux-mm/20260430163247.13628-2-zhen.ni@easystack.cn/
v3: https://lore.kernel.org/linux-mm/20260428071112.1420380-2-zhen.ni@easystack.cn/
    https://lore.kernel.org/linux-mm/20260428071112.1420380-3-zhen.ni@easystack.cn/
v2: https://lore.kernel.org/linux-mm/20260419155540.376847-2-zhen.ni@easystack.cn/
    https://lore.kernel.org/linux-mm/20260419155540.376847-3-zhen.ni@easystack.cn/
v1: https://lore.kernel.org/linux-mm/20260417154638.22370-2-zhen.ni@easystack.cn/
    https://lore.kernel.org/linux-mm/20260417154638.22370-3-zhen.ni@easystack.cn/
---
 mm/page_owner.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 8178e0be557f..27a412c52d41 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/debugfs.h>
+#include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
@@ -54,6 +55,24 @@ struct stack_print_ctx {
 	u8 flags;
 };
 
+enum page_owner_print_mode {
+	PAGE_OWNER_PRINT_FULL_STACK,
+	PAGE_OWNER_PRINT_STACK_HANDLE,
+};
+
+static const char * const page_owner_print_mode_strings[] = {
+	[PAGE_OWNER_PRINT_FULL_STACK]	= "full_stack",
+	[PAGE_OWNER_PRINT_STACK_HANDLE]	= "stack_handle",
+};
+
+struct page_owner_filter {
+	enum page_owner_print_mode print_mode;
+};
+
+static struct page_owner_filter owner_filter = {
+	.print_mode = PAGE_OWNER_PRINT_FULL_STACK,
+};
+
 static bool page_owner_enabled __initdata;
 DEFINE_STATIC_KEY_FALSE(page_owner_inited);
 
@@ -575,7 +594,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migratetype_names[pageblock_mt],
 			&page->flags);
 
-	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
+	if (READ_ONCE(owner_filter.print_mode) == PAGE_OWNER_PRINT_STACK_HANDLE)
+		ret += scnprintf(kbuf + ret, count - ret,
+				"handle: %d\n", handle);
+	else
+		ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
 	if (ret >= count)
 		goto err;
 
@@ -970,10 +993,60 @@ static int page_owner_threshold_set(void *data, u64 val)
 DEFINE_SIMPLE_ATTRIBUTE(page_owner_threshold_fops, &page_owner_threshold_get,
 			&page_owner_threshold_set, "%llu");
 
+static ssize_t print_mode_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	const char *output;
+	int mode;
+
+	mode = READ_ONCE(owner_filter.print_mode);
+
+	if (mode == PAGE_OWNER_PRINT_FULL_STACK)
+		output = "[full_stack] stack_handle\n";
+	else
+		output = "full_stack [stack_handle]\n";
+
+	return simple_read_from_buffer(buf, count, ppos, output, strlen(output));
+}
+
+static ssize_t print_mode_write(struct file *file,
+				 const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	char kbuf[32 + 1];
+	int mode;
+
+	/*
+	 * Limit input size. Maximum valid input is "stack_handle" (12 chars)
+	 * plus newline and null terminator. Use 32 bytes as a reasonable limit.
+	 */
+	if (count > 32)
+		return -EINVAL;
+
+	if (strncpy_from_user(kbuf, buf, count) < 0)
+		return -EFAULT;
+	kbuf[count] = '\0';
+
+	mode = sysfs_match_string(page_owner_print_mode_strings, kbuf);
+	if (mode < 0)
+		return -EINVAL;
+
+	WRITE_ONCE(owner_filter.print_mode, mode);
+
+	return count;
+}
+
+static const struct file_operations page_owner_print_mode_fops = {
+	.owner = THIS_MODULE,
+	.read = print_mode_read,
+	.write = print_mode_write,
+	.llseek = default_llseek,
+};
+
 
 static int __init pageowner_init(void)
 {
-	struct dentry *dir;
+	struct dentry *dir, *filter_dir;
 
 	if (!static_branch_unlikely(&page_owner_inited)) {
 		pr_info("page_owner is disabled\n");
@@ -981,6 +1054,11 @@ static int __init pageowner_init(void)
 	}
 
 	debugfs_create_file("page_owner", 0400, NULL, NULL, &page_owner_fops);
+
+	filter_dir = debugfs_create_dir("page_owner_filter", NULL);
+	debugfs_create_file("print_mode", 0600, filter_dir, NULL,
+			    &page_owner_print_mode_fops);
+
 	dir = debugfs_create_dir("page_owner_stacks", NULL);
 	debugfs_create_file("show_stacks", 0400, dir,
 			    (void *)(STACK_PRINT_FLAG_STACK |
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support
  2026-05-11  3:30 [PATCH v6 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
  2026-05-11  3:30 ` [PATCH v6 1/3] mm/page_owner: add print_mode filter Zhen Ni
@ 2026-05-11  3:30 ` Zhen Ni
  2026-05-11  8:54   ` Oscar Salvador
  2026-05-11  3:30 ` [PATCH v6 3/3] mm/page_owner: document page_owner filter features Zhen Ni
  2 siblings, 1 reply; 7+ messages in thread
From: Zhen Ni @ 2026-05-11  3:30 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni

Add NUMA node filtering functionality to page_owner to allow filtering
pages by specific NUMA node(s). This is useful for NUMA-aware memory
allocation analysis and debugging.

The filter supports flexible nodelist input formats:
- Single node: echo "0" > nid
- Multiple nodes: echo "0,2,3" > nid
- Node range: echo "0-3" > nid
- Mixed format: echo "0,2-4,7" > nid
- Clear filter: echo > nid (empty string)

The implementation uses nodemask_t for efficient multi-node filtering
and nodelist_parse() for flexible input parsing. Empty input clears
the filter.

Note: Access to nid_mask uses plain load/store without locking because
nodemask_t is too large (128 bytes) for READ_ONCE/WRITE_ONCE. This is
safe for debug use: low-frequency changes and torn reads would only
cause temporary inconsistency in debug output.

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Changes in v6:
- Add node validity check using nodes_subset
  to reject invalid node numbers that don't exist in the system
- Move bool filter_by_nid declaration to top of block
- Use kmalloc_objs instead of kmalloc
- Remove 100 bytes overhead

Changes in v5:
- Optimize nodes_empty() check in page iteration loop
- Add __data_racy qualifier to nid_mask field

Changes in v4:
- Remove "-1" support, use empty string to clear filter
- Use strncpy_from_user() instead of copy_from_user()
- Add concurrency safety documentation for nid_mask access
- Rename fops to page_owner_nid_filter_fops for consistency

Changes in v3:
- Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors)
  * nodemask_t is a large structure (128 bytes) that triggers compile-time asserts
  * Direct assignment is safe for this use case
- Add comment explaining input length calculation formula
  * 6 bytes = ",NNNNN" (comma + 5-digit node number)
- Simplify "-1" check using kstrtoint() instead of dual strcmp()
- Move nodemask_t mask read outside PFN iteration loop for performance
  * Avoids 128-byte structure copy on each iteration

Changes in v2:
- Use nodemask_t instead of int to support multiple nodes
- Implement nodelist_parse() to support flexible input formats
  * Single node: "0", "2"
  * Multiple nodes: "0,2,3"
  * Ranges: "0-3"
  * Mixed: "0,2-4,7"
- Use %*pbl format for output (e.g., "0-2", "0,2-4,7")
- Use dynamic memory allocation (kmalloc) to handle variable-length input
- Follow cpuset's max_write_len pattern: (100 + 6 * MAX_NUMNODES)

v5: https://lore.kernel.org/linux-mm/20260507064643.179187-3-zhen.ni@easystack.cn/
v4: https://lore.kernel.org/linux-mm/20260430163247.13628-3-zhen.ni@easystack.cn/
v3: https://lore.kernel.org/linux-mm/20260428071112.1420380-4-zhen.ni@easystack.cn/
v2: https://lore.kernel.org/linux-mm/20260419155540.376847-4-zhen.ni@easystack.cn/
v1: https://lore.kernel.org/linux-mm/20260417154638.22370-4-zhen.ni@easystack.cn/
---
 mm/page_owner.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 27a412c52d41..8a38005539ff 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -67,10 +67,16 @@ static const char * const page_owner_print_mode_strings[] = {
 
 struct page_owner_filter {
 	enum page_owner_print_mode print_mode;
+	/*
+	 * Lockless access: nodemask_t exceeds READ_ONCE/WRITE_ONCE size limit.
+	 * Torn reads acceptable for debug interface with infrequent writes.
+	 */
+	nodemask_t __data_racy nid_mask;
 };
 
 static struct page_owner_filter owner_filter = {
 	.print_mode = PAGE_OWNER_PRINT_FULL_STACK,
+	.nid_mask = NODE_MASK_NONE,
 };
 
 static bool page_owner_enabled __initdata;
@@ -687,6 +693,8 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 	struct page_ext *page_ext;
 	struct page_owner *page_owner;
 	depot_stack_handle_t handle;
+	nodemask_t mask;
+	bool filter_by_nid;
 
 	if (!static_branch_unlikely(&page_owner_inited))
 		return -EINVAL;
@@ -700,6 +708,9 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 	while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0)
 		pfn++;
 
+	mask = owner_filter.nid_mask;
+	filter_by_nid = !nodes_empty(mask);
+
 	/* Find an allocated page */
 	for (; pfn < max_pfn; pfn++) {
 		/*
@@ -732,6 +743,14 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 		if (unlikely(!page_ext))
 			continue;
 
+		/* NUMA node filter using bitmask */
+		if (filter_by_nid) {
+			int nid = page_to_nid(page);
+
+			if (!node_isset(nid, mask))
+				goto ext_put_continue;
+		}
+
 		/*
 		 * Some pages could be missed by concurrent allocation or free,
 		 * because we don't hold the zone lock.
@@ -1043,6 +1062,77 @@ static const struct file_operations page_owner_print_mode_fops = {
 	.llseek = default_llseek,
 };
 
+static ssize_t nid_filter_write(struct file *file,
+				 const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	nodemask_t mask;
+	int ret;
+
+	/*
+	 * Limit input size to handle worst-case nodelist (all nodes).
+	 * Worst case per node: ",NNNNN" (comma + 5-digit node number) = 6 bytes.
+	 */
+	if (count > (6 * MAX_NUMNODES))
+		return -EINVAL;
+
+	kbuf = kmalloc_objs(*kbuf, count + 1);
+	if (!kbuf)
+		return -ENOMEM;
+
+	if (strncpy_from_user(kbuf, buf, count) < 0) {
+		ret = -EFAULT;
+		goto out_free;
+	}
+	kbuf[count] = '\0';
+
+	/* Support nodelist format like "0", "0,2", "0-3", or empty to clear */
+	if (nodelist_parse(kbuf, mask)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	/* Validate that all specified nodes actually exist in the system */
+	if (!nodes_subset(mask, node_states[N_MEMORY])) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	owner_filter.nid_mask = mask;
+	ret = count;
+
+out_free:
+	kfree(kbuf);
+	return ret;
+}
+
+static int nid_filter_show(struct seq_file *m, void *v)
+{
+	nodemask_t mask = owner_filter.nid_mask;
+
+	if (nodes_empty(mask))
+		seq_puts(m, "\n");
+	else
+		seq_printf(m, "%*pbl\n", nodemask_pr_args(&mask));
+
+	return 0;
+}
+
+static int nid_filter_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nid_filter_show, NULL);
+}
+
+static const struct file_operations page_owner_nid_filter_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nid_filter_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.write		= nid_filter_write,
+	.release	= single_release,
+};
+
 
 static int __init pageowner_init(void)
 {
@@ -1058,6 +1148,8 @@ static int __init pageowner_init(void)
 	filter_dir = debugfs_create_dir("page_owner_filter", NULL);
 	debugfs_create_file("print_mode", 0600, filter_dir, NULL,
 			    &page_owner_print_mode_fops);
+	debugfs_create_file("nid", 0600, filter_dir, NULL,
+			    &page_owner_nid_filter_fops);
 
 	dir = debugfs_create_dir("page_owner_stacks", NULL);
 	debugfs_create_file("show_stacks", 0400, dir,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v6 3/3] mm/page_owner: document page_owner filter features
  2026-05-11  3:30 [PATCH v6 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
  2026-05-11  3:30 ` [PATCH v6 1/3] mm/page_owner: add print_mode filter Zhen Ni
  2026-05-11  3:30 ` [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
@ 2026-05-11  3:30 ` Zhen Ni
  2026-05-11  8:33   ` Oscar Salvador
  2 siblings, 1 reply; 7+ messages in thread
From: Zhen Ni @ 2026-05-11  3:30 UTC (permalink / raw)
  To: akpm, vbabka
  Cc: surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel,
	Zhen Ni, SeongJae Park

Add documentation for the page_owner filter functionality, including:
- Print mode filter (full stack vs stack handle)
- NUMA node filter (single node, multiple nodes, ranges)
- Usage examples for both filters

Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Changes in v6:
- No code changes

Changes in v5:
- No code changes

Changes in v4:
- Update print_mode documentation to reflect string-based interface
  * Change from "0/1" to "full_stack"/"stack_handle"
  * Add bracket notation example: "[full_stack] stack_handle"
- Update NUMA filter documentation
  * Remove "-1" example
  * Add empty string as clear method
- Fix indentation: use tabs instead of spaces in code examples

Changes in v3:
- New patch to document filter features as requested by Andrew Morton

v5: https://lore.kernel.org/linux-mm/20260507064643.179187-4-zhen.ni@easystack.cn/
v4: https://lore.kernel.org/linux-mm/20260430163247.13628-4-zhen.ni@easystack.cn/
v3: https://lore.kernel.org/linux-mm/20260428071112.1420380-5-zhen.ni@easystack.cn/
---
 Documentation/mm/page_owner.rst | 61 ++++++++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst
index 6b12f3b007ec..178bacfbb3fd 100644
--- a/Documentation/mm/page_owner.rst
+++ b/Documentation/mm/page_owner.rst
@@ -74,7 +74,17 @@ Usage
 
 3) Do the job that you want to debug.
 
-4) Analyze information from page owner::
+4) (Optional) Use filters to focus on specific memory allocations::
+
+	cd /sys/kernel/debug/page_owner_filter
+
+	# Print only stack handles instead of full traces
+	echo stack_handle > print_mode
+
+	# Filter by NUMA nodes
+	echo "0,2-3" > nid
+
+5) Analyze information from page owner::
 
 	cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt
 	cat stacks.txt
@@ -238,6 +248,55 @@ Usage
 				./page_owner_sort <input> <output> --tgid=1,2,3
 				./page_owner_sort <input> <output> --name name1,name2
 
+Page Owner Filters
+==================
+
+The page_owner feature provides filtering capabilities to focus on specific
+memory allocations (e.g., by NUMA node). Filters are controlled through debugfs
+files in ``/sys/kernel/debug/page_owner_filter/``.
+
+Print Mode Filter
+-----------------
+
+The ``print_mode`` file controls the level of detail in stack trace output.
+
+Available modes:
+
+- ``full_stack`` (default): Print full stack traces
+- ``stack_handle``: Print only stack handles
+
+Reading the file shows the current mode with brackets around the active option::
+
+	cat /sys/kernel/debug/page_owner_filter/print_mode
+	[full_stack] stack_handle
+
+The ``stack_handle`` mode significantly reduces output size. Instead of full
+stack traces, it prints only the handle number::
+
+	Page allocated via order 0, mask 0x42800(GFP_NOWAIT|__GFP_COMP),
+	pid 1, tgid 1 (systemd), ts 349667370 ns
+	PFN 0xa00a2 type Unmovable Block 1280 type Unmovable
+	Flags 0x33fffe0000004124(...)
+	handle: 17432583
+
+To retrieve the full stack trace for a handle, use::
+
+	cat /sys/kernel/debug/page_owner_stacks/show_stacks_handles
+
+NUMA Node Filter
+----------------
+
+The ``nid`` file filters pages by NUMA node. This is useful for NUMA-aware
+environments to analyze node-specific memory allocation.
+
+Supported input formats:
+
+- Single node: ``echo "2" > nid``
+- Multiple nodes: ``echo "0,2,3" > nid``
+- Node range: ``echo "0-3" > nid``
+- Mixed format: ``echo "0,2-4,7" > nid``
+- Clear filter: ``echo > nid`` (empty string)
+
 STANDARD FORMAT SPECIFIERS
 ==========================
 ::
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 1/3] mm/page_owner: add print_mode filter
  2026-05-11  3:30 ` [PATCH v6 1/3] mm/page_owner: add print_mode filter Zhen Ni
@ 2026-05-11  8:29   ` Oscar Salvador
  0 siblings, 0 replies; 7+ messages in thread
From: Oscar Salvador @ 2026-05-11  8:29 UTC (permalink / raw)
  To: Zhen Ni
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	linux-kernel, SeongJae Park

On Mon, May 11, 2026 at 11:30:15AM +0800, Zhen Ni wrote:
> Add a print_mode filter to page_owner that allows users to choose between
> printing full stack traces or only stack handles, significantly reducing
> output size for debugging and analysis.
> 
> The filter provides a string-based interface under
> /sys/kernel/debug/page_owner_filter/:
> - Reading shows the current mode with [] brackets around active option
> - Writing accepts "full_stack" or "stack_handle" strings
> 
> The default full_stack mode maintains backward compatibility with existing
> usage, displaying complete stack traces for each page allocation.
> 
> The stack_handle mode dramatically reduces log size by showing only
> the handle number instead of the full stack trace. The mapping from
> handles to actual stack traces can be obtained via the
> show_stacks_handles interface.
> 
> Example usage:
>   # echo stack_handle > /sys/kernel/debug/page_owner_filter/print_mode
>   # cat /sys/kernel/debug/page_owner_filter/print_mode
>   full_stack [stack_handle]
>   # cat /sys/kernel/debug/page_owner
>   Page allocated via order 0, migratetype Unmovable, gfp_mask 0x1100ca,
>   pid 1, tgid 1 (systemd), ts 123456789 ns
>   PFN 0x1000 type Unmovable Block 1 type Unmovable
>   Flags 0x3fffe800000084(referenced|lru|active|private|node=0|zone=1)
>   handle: 17432583
>   ...
> 
> Reviewed-by: SeongJae Park <sj@kernel.org>
> Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>

Overall looks good to me, one comment below

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
...
> ---
>  mm/page_owner.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 80 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 8178e0be557f..27a412c52d41 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/debugfs.h>
> +#include <linux/fs.h>

Why do we need this?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 3/3] mm/page_owner: document page_owner filter features
  2026-05-11  3:30 ` [PATCH v6 3/3] mm/page_owner: document page_owner filter features Zhen Ni
@ 2026-05-11  8:33   ` Oscar Salvador
  0 siblings, 0 replies; 7+ messages in thread
From: Oscar Salvador @ 2026-05-11  8:33 UTC (permalink / raw)
  To: Zhen Ni
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	linux-kernel, SeongJae Park

On Mon, May 11, 2026 at 11:30:17AM +0800, Zhen Ni wrote:
> Add documentation for the page_owner filter functionality, including:
> - Print mode filter (full stack vs stack handle)
> - NUMA node filter (single node, multiple nodes, ranges)
> - Usage examples for both filters
> 
> Reviewed-by: SeongJae Park <sj@kernel.org>
> Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support
  2026-05-11  3:30 ` [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
@ 2026-05-11  8:54   ` Oscar Salvador
  0 siblings, 0 replies; 7+ messages in thread
From: Oscar Salvador @ 2026-05-11  8:54 UTC (permalink / raw)
  To: Zhen Ni
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	linux-kernel

On Mon, May 11, 2026 at 11:30:16AM +0800, Zhen Ni wrote:
> Add NUMA node filtering functionality to page_owner to allow filtering
> pages by specific NUMA node(s). This is useful for NUMA-aware memory
> allocation analysis and debugging.
> 
> The filter supports flexible nodelist input formats:
> - Single node: echo "0" > nid
> - Multiple nodes: echo "0,2,3" > nid
> - Node range: echo "0-3" > nid
> - Mixed format: echo "0,2-4,7" > nid
> - Clear filter: echo > nid (empty string)
> 
> The implementation uses nodemask_t for efficient multi-node filtering
> and nodelist_parse() for flexible input parsing. Empty input clears
> the filter.
> 
> Note: Access to nid_mask uses plain load/store without locking because
> nodemask_t is too large (128 bytes) for READ_ONCE/WRITE_ONCE. This is
> safe for debug use: low-frequency changes and torn reads would only
> cause temporary inconsistency in debug output.
> 
> Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
> ---
...
> ---
>  mm/page_owner.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 27a412c52d41..8a38005539ff 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
...
> @@ -700,6 +708,9 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
>  	while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0)
>  		pfn++;
>  
> +	mask = owner_filter.nid_mask;
> +	filter_by_nid = !nodes_empty(mask);
> +
>  	/* Find an allocated page */
>  	for (; pfn < max_pfn; pfn++) {
>  		/*
> @@ -732,6 +743,14 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
>  		if (unlikely(!page_ext))
>  			continue;
>  
> +		/* NUMA node filter using bitmask */
> +		if (filter_by_nid) {

This comment is kinda pointless because it explains something that the code makes it
quite clear.
Either drop it, or just go with "NUMA node filter", but "using bitmask"
does not really add much.


> +			int nid = page_to_nid(page);
> +
> +			if (!node_isset(nid, mask))
> +				goto ext_put_continue;
> +		}
> +
>  		/*
>  		 * Some pages could be missed by concurrent allocation or free,
>  		 * because we don't hold the zone lock.
> @@ -1043,6 +1062,77 @@ static const struct file_operations page_owner_print_mode_fops = {
>  	.llseek = default_llseek,
>  };
>  
> +static ssize_t nid_filter_write(struct file *file,
> +				 const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	char *kbuf;
> +	nodemask_t mask;
> +	int ret;
> +
> +	/*
> +	 * Limit input size to handle worst-case nodelist (all nodes).
> +	 * Worst case per node: ",NNNNN" (comma + 5-digit node number) = 6 bytes.
> +	 */
> +	if (count > (6 * MAX_NUMNODES))
> +		return -EINVAL;
> +
> +	kbuf = kmalloc_objs(*kbuf, count + 1);
> +	if (!kbuf)
> +		return -ENOMEM;
> +
> +	if (strncpy_from_user(kbuf, buf, count) < 0) {
> +		ret = -EFAULT;
> +		goto out_free;
> +	}
> +	kbuf[count] = '\0';
> +
> +	/* Support nodelist format like "0", "0,2", "0-3", or empty to clear */
> +	if (nodelist_parse(kbuf, mask)) {
> +		ret = -EINVAL;
> +		goto out_free;
> +	}

nodelist_parse() can also return other return values besides EINVAL.
Something like

 ret = nodelist_parse(...)
 if (ret < 0)
    return ret

might be cleaner.

> +
> +	/* Validate that all specified nodes actually exist in the system */
> +	if (!nodes_subset(mask, node_states[N_MEMORY])) {
> +		ret = -EINVAL;
> +		goto out_free;
> +	}

Ok, I get that since you want to filter allocations by numa nodes, you
want to make sure that those nodes have memory.
Although that might change due to concurrent memory-hotplug operations,
but that is a different story.

I do not like the comment though, because we can have other nodes
existing in the system with no memory (e.g: memoryless nodes only having
cpus, or none of them), so I would make that clearer:

"
  /* 
   * We want to filter memory allocations by numa nodes, so make sure
   * that the specified nodes have memory.
   */
"

or something along those lines.


> +
> +	owner_filter.nid_mask = mask;
> +	ret = count;
> +
> +out_free:
> +	kfree(kbuf);
> +	return ret;
> +}
> +
> +static int nid_filter_show(struct seq_file *m, void *v)
> +{
> +	nodemask_t mask = owner_filter.nid_mask;
> +
> +	if (nodes_empty(mask))
> +		seq_puts(m, "\n");
> +	else
> +		seq_printf(m, "%*pbl\n", nodemask_pr_args(&mask));

is not nodemask_pr_args clever enough to not print anything or print "0"
if the nmask is NODE_MASK_NONE?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-11  8:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11  3:30 [PATCH v6 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Zhen Ni
2026-05-11  3:30 ` [PATCH v6 1/3] mm/page_owner: add print_mode filter Zhen Ni
2026-05-11  8:29   ` Oscar Salvador
2026-05-11  3:30 ` [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
2026-05-11  8:54   ` Oscar Salvador
2026-05-11  3:30 ` [PATCH v6 3/3] mm/page_owner: document page_owner filter features Zhen Ni
2026-05-11  8:33   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox