From: Zhen Ni <zhen.ni@easystack.cn>
To: akpm@linux-foundation.org, vbabka@kernel.org
Cc: surenb@google.com, mhocko@suse.com, jackmanb@google.com,
hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Zhen Ni <zhen.ni@easystack.cn>
Subject: [PATCH v6 0/3] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering
Date: Mon, 11 May 2026 11:30:14 +0800 [thread overview]
Message-ID: <20260511033017.747781-1-zhen.ni@easystack.cn> (raw)
This patch series introduces filtering capabilities to the page_owner
feature to address storage and performance challenges in production
environments.
Changes from v5:
- Address SeongJae Park's review comments for patch 1/3:
* Remove unnecessary braces in if/else statement
* Use stack array instead of kmalloc for input buffer
- Address SeongJae Park's review comments for patch 2/3:
* Add node validity check using nodes_subset() to reject non-existent nodes
* Separate variable declaration and statement
* Use kmalloc_objs() for consistency with kernel patterns
* Remove 100 bytes overhead
- Add lore links to all previous versions
Changes from v4:
- Optimize nodes_empty() check in page iteration loop
- Add __data_racy qualifier to nid_mask field
Changes from v3:
- Change print_mode from numeric (0/1) to string-based interface
* Use "full_stack"/"stack_handle" strings instead of numbers
* Display current mode with bracket notation: "[full_stack] stack_handle"
- Remove "-1" support from NUMA filter
* Use empty string to clear filter (echo > nid)
- Use strncpy_from_user() instead of copy_from_user()
- Rename nid_filter_fops to page_owner_nid_filter_fops for consistency
- Merge patch 1 (infrastructure) and patch 2 (print_mode) from v3
- Update documentation to match new interface
* String-based examples
* Tab indentation in code blocks
Changes from v2:
- Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors)
* nodemask_t is a large structure (128 bytes) that triggers compile-time asserts
* Direct assignment is safe for this use case
- Add comment explaining input length calculation formula
* 6 bytes = ",NNNNN" (comma + 5-digit node number)
- Simplify "-1" check using kstrtoint() instead of dual strcmp()
- Move nodemask_t mask read outside PFN iteration loop for performance
* Avoids 128-byte structure copy on each iteration
- Add documentation for filter features (patch 3/3)
Changes from v1:
- Renamed 'compact' to 'print_mode' with enum type for better clarity
* PAGE_OWNER_PRINT_FULL_STACK (0): print full stack traces
* PAGE_OWNER_PRINT_STACK_HANDLE (1): print only stack handles
- Changed NUMA filter from single node to nodelist with bitmask support
* Uses nodelist_parse() to support "0", "0,2", "0-3", "0,2-4,7" formats
* Uses nodemask_t internally for efficient multi-node filtering
* Output uses %*pbl format (e.g., "0-2", "0,2-4,7")
- Improved memory handling in nid_filter_write using dynamic allocation
* Limit: (100 + 6 * MAX_NUMNODES) to handle worst-case input
Problem Statement
=================
In production environments with large memory configurations (e.g., 250GB+),
collecting page_owner information often results in files ranging from
several gigabytes to over 10GB. This creates significant challenges:
1. Storage pressure on production systems
2. Difficulty transferring large files from production environments
3. Post-processing overhead with tools/mm/page_owner_sort.c
The primary contributor to file size is redundant stack trace
information. While the kernel already deduplicates stacks via
stackdepot, page_owner retrieves and stores full stack traces for
each page, only to deduplicate them again during post-processing.
Additionally, in NUMA-aware environments (e.g., DPDK-based cloud
deployments where QEMU processes are bound to specific NUMA nodes),
OOM events are often node-specific rather than system-wide.
Currently, page_owner cannot filter by NUMA node, forcing users to
collect and analyze data for all nodes.
Solution
========
This patch series introduces a flexible filter infrastructure with
two initial filters:
1. **Print Mode Filter**: Outputs only stack handles instead of
full stack traces. The handle-to-stack mapping can be retrieved
from the existing show_stacks_handles interface. This dramatically
reduces output size while preserving all allocation metadata.
2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s)
using flexible nodelist format, enabling targeted analysis of memory
issues in NUMA-aware deployments.
Implementation
==============
The series is structured as follows:
- Patch 1: Implement print_mode filter with string-based interface
(merges infrastructure + print_mode from v3)
- Patch 2: Implement NUMA node filter with nodelist support
* v6: Add node validity check to reject non-existent nodes
- Patch 3: Document filter features
Usage Example
=============
Enable print_mode and filter for NUMA nodes 0,2-3:
# cd /sys/kernel/debug/page_owner_filter/
# echo stack_handle > print_mode
# echo "0,2-3" > nid
# cat /sys/kernel/debug/page_owner > page_owner.txt
Sample print_mode output (showing handles only):
Page allocated via order 0, mask 0x0(), pid 0, tgid 0 (swapper),
ts 0 ns PFN 0x40000 type Unmovable Block 512 type Unmovable
Flags 0x3fffe0000000000(node=0|zone=0|lastcpupid=0x1ffff)
handle: 1048577
Page allocated via order 0, mask 0x252000(__GFP_NOWARN|
__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 0, tgid 0 (swapper),
ts 0 ns PFN 0x40002 type Unmovable Block 512 type Unmovable
Flags 0x23fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff)
handle: 1048577
Testing
=======
Tested on a system with multiple NUMA nodes. Verified that:
- Filters work independently and in combination
- Print_mode output correlates correctly with show_stacks_handles
- Default behavior (filters disabled) remains unchanged
- NUMA filter works with single node, multiple nodes, and ranges
- String-based interface works correctly ("full_stack"/"stack_handle")
- Empty string clears NUMA filter
- Node validity check correctly rejects non-existent nodes
- Code compiles without warnings or errors (allmodconfig tested)
Example test session:
# cat print_mode
[full_stack] stack_handle
# echo stack_handle > print_mode
# cat print_mode
full_stack [stack_handle]
# echo "0,1-2" > nid
# cat nid
0-2
# echo "0,2-3" > nid
# cat nid
0,2-3
# echo "10" > nid
-bash: echo: write error: Invalid argument
# echo > nid
# cat nid
(empty - filter cleared)
Future Enhancements
===================
The filter infrastructure is designed to be extensible. Potential
future filters could include:
- PID/TGID filtering
- Time range filtering (allocation timestamp windows)
- GFP flag filtering
- Migration type filtering
v5: https://lore.kernel.org/linux-mm/20260507064643.179187-1-zhen.ni@easystack.cn/
v4: https://lore.kernel.org/linux-mm/20260430163247.13628-1-zhen.ni@easystack.cn/
v3: https://lore.kernel.org/linux-mm/20260428071112.1420380-1-zhen.ni@easystack.cn/
v2: https://lore.kernel.org/linux-mm/20260419155540.376847-1-zhen.ni@easystack.cn/
v1: https://lore.kernel.org/linux-mm/20260417154638.22370-1-zhen.ni@easystack.cn/
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---
Zhen Ni (3):
mm/page_owner: add print_mode filter
mm/page_owner: add NUMA node filter with nodelist support
mm/page_owner: document page_owner filter features
Documentation/mm/page_owner.rst | 61 ++++++++++-
mm/page_owner.c | 174 +++++++++++++++++++++++++++++++-
2 files changed, 232 insertions(+), 3 deletions(-)
--
2.20.1
next reply other threads:[~2026-05-11 3:30 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 3:30 Zhen Ni [this message]
2026-05-11 3:30 ` [PATCH v6 1/3] mm/page_owner: add print_mode filter Zhen Ni
2026-05-11 8:29 ` Oscar Salvador
2026-05-11 3:30 ` [PATCH v6 2/3] mm/page_owner: add NUMA node filter with nodelist support Zhen Ni
2026-05-11 8:54 ` Oscar Salvador
2026-05-11 3:30 ` [PATCH v6 3/3] mm/page_owner: document page_owner filter features Zhen Ni
2026-05-11 8:33 ` Oscar Salvador
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260511033017.747781-1-zhen.ni@easystack.cn \
--to=zhen.ni@easystack.cn \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox