From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-m19731117.qiye.163.com (mail-m19731117.qiye.163.com [220.197.31.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A4AEF3D330A for ; Tue, 28 Apr 2026 07:16:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=220.197.31.117 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777360607; cv=none; b=Tkgp3u1mOg9cvSCqViVRFmxunMaa+8WOrDuXUA31Wfr5AFh/u/zK01nBPNRbiSdbpvltF0SXSfmh4QNfMmBYBc2sNlN0o3ItY7ohMLjuZV6qWobplQjaOwYGzwhBE2ShobXeiD+XxVLIKOCW/B8gteYAM8/4DzoLN53kpAtR9p0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777360607; c=relaxed/simple; bh=/PD5fKPwkjnvo1MSxwQLgtdB8Kxibcf75Tny57kf54Q=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=u3DkejTZ9EMEWNpzOs5LvUsdnSRmifS2wGz446rayet4CL7Hojbh05sb8PH02TWRu4RAwqnijX8tdYTSML0AA6yx0dFooRI1oFo96HqmoH0eVaA01WpyrpBPXWGTojdgYpyMU5akE3bxTLjgiBFR+ShSq6/ZaiylNiZRvEdUP0E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=easystack.cn; spf=pass smtp.mailfrom=easystack.cn; arc=none smtp.client-ip=220.197.31.117 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=easystack.cn Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=easystack.cn Received: from localhost.localdomain (unknown [218.94.118.90]) by smtp.qiye.163.com (Hmail) with ESMTP id 197b492f0; Tue, 28 Apr 2026 15:11:21 +0800 (GMT+08:00) From: Zhen Ni To: akpm@linux-foundation.org, vbabka@kernel.org Cc: surenb@google.com, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Zhen Ni Subject: [PATCH v3 0/4] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering Date: Tue, 28 Apr 2026 15:11:08 +0800 Message-Id: <20260428071112.1420380-1-zhen.ni@easystack.cn> X-Mailer: git-send-email 2.20.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-HM-Tid: 0a9dd2ed9f030229kunmb0c223f1156eed X-HM-MType: 1 X-HM-Spam-Status: e1kfGhgUHx5ZQUpXWQgPGg8OCBgUHx5ZQUlOS1dZFg8aDwILHllBWSg2Ly tZV1koWUFJQjdXWRgWCB1ZQUpXWS1ZQUlXWQ8JGhUIEh9ZQVlCGEtMVkoaSB9KSx0ZSBgdQlYVFA kWGhdVGRETFhoSFyQUDg9ZV1kYEgtZQVlJSkNVQk9VSkpDVUJLWVdZFhoPEhUdFFlBWU9LSFVKS0 lPT09IVUpLS1VKQktLWQY+ This patch series introduces filtering capabilities to the page_owner feature to address storage and performance challenges in production environments. Changes from v2: - Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors) * nodemask_t is a large structure (128 bytes) that triggers compile-time asserts * Direct assignment is safe for this use case - Add comment explaining input length calculation formula * 6 bytes = ",NNNNN" (comma + 5-digit node number) - Simplify "-1" check using kstrtoint() instead of dual strcmp() - Move nodemask_t mask read outside PFN iteration loop for performance * Avoids 128-byte structure copy on each iteration - Add documentation for filter features (patch 4/4) Changes from v1: - Renamed 'compact' to 'print_mode' with enum type for better clarity * PAGE_OWNER_PRINT_FULL_STACK (0): print full stack traces * PAGE_OWNER_PRINT_STACK_HANDLE (1): print only stack handles - Changed NUMA filter from single node to nodelist with bitmask support * Uses nodelist_parse() to support "0", "0,2", "0-3", "0,2-4,7" formats * Uses nodemask_t internally for efficient multi-node filtering * Output uses %*pbl format (e.g., "0-2", "0,2-4,7") - Improved memory handling in nid_filter_write using dynamic allocation * Limit: (100 + 6 * MAX_NUMNODES) to handle worst-case input These changes address feedback from v2 review: - AI review tool (sashiko.dev) identified READ_ONCE/WRITE_ONCE issue with nodemask_t - Andrew Morton requested documentation for filter features - Input length calculation justification - Code simplification using kstrtoint() - Performance optimization for mask read Problem Statement ================= In production environments with large memory configurations (e.g., 250GB+), collecting page_owner information often results in files ranging from several gigabytes to over 10GB. This creates significant challenges: 1. Storage pressure on production systems 2. Difficulty transferring large files from production environments 3. Post-processing overhead with tools/mm/page_owner_sort.c The primary contributor to file size is redundant stack trace information. While the kernel already deduplicates stacks via stackdepot, page_owner retrieves and stores full stack traces for each page, only to deduplicate them again during post-processing. Additionally, in NUMA-aware environments (e.g., DPDK-based cloud deployments where QEMU processes are bound to specific NUMA nodes), OOM events are often node-specific rather than system-wide. Currently, page_owner cannot filter by NUMA node, forcing users to collect and analyze data for all nodes. Solution ======== This patch series introduces a flexible filter infrastructure with two initial filters: 1. **Print Mode Filter**: Outputs only stack handles instead of full stack traces. The handle-to-stack mapping can be retrieved from the existing show_stacks_handles interface. This dramatically reduces output size while preserving all allocation metadata. 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s) using flexible nodelist format, enabling targeted analysis of memory issues in NUMA-aware deployments. Implementation ============== The series is structured as follows: - Patch 1: Add filter infrastructure (data structures and debugfs directory) - Patch 2: Implement print_mode filter - Patch 3: Implement NUMA node filter with nodelist support - Patch 4: Document filter features Usage Example ============= Enable print_mode and filter for NUMA nodes 0,2-3: # cd /sys/kernel/debug/page_owner_filter/ # echo 1 > print_mode # echo "0,2-3" > nid # cat /sys/kernel/debug/page_owner > page_owner.txt Sample print_mode output (showing handles only): Page allocated via order 0, mask 0x0(), pid 0, tgid 0 (swapper), ts 0 ns PFN 0x40000 type Unmovable Block 512 type Unmovable Flags 0x3fffe0000000000(node=0|zone=0|lastcpupid=0x1ffff) handle: 1048577 Page allocated via order 0, mask 0x252000(__GFP_NOWARN| __GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 0, tgid 0 (swapper), ts 0 ns PFN 0x40002 type Unmovable Block 512 type Unmovable Flags 0x23fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff) handle: 1048577 Testing ======= Tested on a system with multiple NUMA nodes. Verified that: - Filters work independently and in combination - Print_mode output correlates correctly with show_stacks_handles - Default behavior (filters disabled) remains unchanged - NUMA filter works with single node, multiple nodes, and ranges - Code compiles without warnings or errors (allmodconfig tested) Example test session: # cat print_mode 0 # echo "0,1-2" > nid # cat nid 0-2 # echo "0,2-3" > nid # cat nid 0,2-3 # echo 1 > print_mode # head -n 100 /sys/kernel/debug/page_owner [Shows compact mode output with handles only] Future Enhancements =================== The filter infrastructure is designed to be extensible. Potential future filters could include: - PID/TGID filtering - Time range filtering (allocation timestamp windows) - GFP flag filtering - Migration type filtering Signed-off-by: Zhen Ni --- Zhen Ni (4): mm/page_owner: add filter infrastructure mm/page_owner: add print_mode filter mm/page_owner: add NUMA node filter with nodelist support mm/page_owner: document page_owner filter features Documentation/mm/page_owner.rst | 55 +++++++++++++- mm/page_owner.c | 130 +++++++++++++++++++++++++++++++- 2 files changed, 182 insertions(+), 3 deletions(-) -- 2.20.1