KSM: performance optimizations for rmap_walk

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* KSM: performance optimizations for rmap_walk_ksm
@ 2026-06-29  9:41 xu.xin16
  2026-06-29  9:42 ` [PATCH v10 1/3] ksm: add linear_page_index into ksm_rmap_item xu.xin16
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: xu.xin16 @ 2026-06-29  9:41 UTC (permalink / raw)
  To: akpm, david
  Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
	xu.xin16

From: xu xin <xu.xin16@zte.com.cn>

This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.

Two key highlights:

1. Lock hold time drops from >500ms to <2ms
   - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
     anon_vma lock hold time during KSM rmap walk went from 705ms
     down to 1.67ms (max) and 1.44ms (avg).

2. Real user impact
   - The anon_vma lock is also acquired by page faults, reclaim,
     migration, compaction, mlock, exit_mmap, and cgroup accounting.

   - A long hold due to inefficient rmap walks stalls application
     threads, causing latency spikes, reduced throughput, or even
     container timeouts.

   - The problem occurs even without fork() – VMA splitting (e.g.,
     via mprotect or madvise over time) can create tens of thousands
     of VMAs all attached to the same anon_vma.

Real-world examples:

 - JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.

 - Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to release
specific pages, which also splits VMAs.

* Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing
one anon_vma on a production system running a Java application with KSM
enabled. The lock hold time before the patch was measured at 228 ms (max)
during rmap walks triggered by memory compaction and page migration.
The benchmark reproduces that VMA count and lock‑hold behavior in a
controlled environment.

For systems that do not have thousands of VMAs per anon_vma, the
patch adds negligible overhead (a single pgoff comparison). For systems
that do suffer from this issue, the improvement is dramatic:
1) Worst‑case anon_vma lock hold time drops from hundreds of milliseconds
to under 2 ms.2)This directly reduces blocking of parallel operations that
need the same lock – page faults, reclaim, migration, compaction, mlock, and
exit_mmap.

End‑users will see lower tail latency (fewer application stalls),
higher throughput under memory pressure, and no more spurious
lockup warnings or container timeouts caused by excessive lock hold
times.

In short: workloads that do not hit this pathological pattern are
unaffected; those that do will see a 100x to 500x reduction in lock
hold times, which translates directly into a more responsive system.

Change Log
==========
Changes in v10:
Only update patch 3/3 according to the suggestion:
https://lore.kernel.org/all/68da8183-dbe7-40d3-b6e6-43c3f12767e2@kernel.org/


Changes in v9:
1. For patch 1 & 2, some update of commit description and code comments.
2. For patch 3, fix according to the suggestion:
https://lore.kernel.org/all/eff26d08-d824-4459-858e-277760fcb4a2@kernel.org/


Changes in v8:
1. Suggested by David:

  * Drop the tracepoint and testbench patches and leave them as OOT patches.

  * Rename pgoff into linear_page_index and update the corresponding commit
  desciption.

2. Fix AI's issue: Use process-self's KSM counter instead of global KSM counters.
https://sashiko.dev/#/patchset/20260530165907829ZSDzDdMc110MnOflRzf9P@zte.com.cn


Changes in v7:
Mainly to fix some issues AI review points out at:
https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@zte.com.cn

We have completely correct those possible flaws according to AI useful suggestions.
- Patch 2: There are mainly 3 changes as follows.
	(1) Use COMM-PID filtering during trace parsing to precisely match the right
	    events.
	(2) Graceful handling of single‑NUMA node. trigger_rmap_walk() no longer calls
	    exit(1) when no other NUMA node is available. It returns an error, allowing
	         the caller to clean up (disable tracepoints, restore KSM config) before exiting.
	(3) Fair comparison for anonymous / file tests with KSM. anonymous and file‑backed tests now
	    use fork() to create thousands of child processes, each sharing the same physical
	    page via copy‑on‑write (or MAP_SHARED). This ensures that for all three page types
	    the latency measurement is based on a single physical page mapped by many VMAs (≈ NR_SHARERS).

- Patch 6: There are mainly 3 changes as follows.
	(1) Fix mapping size tracking after mremap and protect the original pointer on failure.
	(2) Use baseline delta comparison to eliminate interference from global KSM counters.
	(3) Fix error-code confusion caused by pread/close interactions.



Changes in v6:
- Patch 1: Defining a single event class once and instantiating the individual
	   tracepoints with DEFINE_EVENT, as AI said:
	https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@zte.com.cn

- Patch 2: Suggested-by AI below, three useful changes are done:
	(1) Safe event pairing – Now stores folio and rwc addresses for rmap_walk_start
	    and matches with the same addresses in rmap_walk_end, eliminating
	    cross‑thread interference.

	(2 )KSM configuration preservation – Saves original KSM settings and restores
	    them after the KSM test, avoiding persistent changes to system behaviour.

	(3) unlink in advance to prevent potentialfile leak – unlink(filename) called
	    immediately after mkstemp, so the temporary file is automatically removed
	    even if the program crashes early.

 - Patch 3: a separate, standalone patch to update the MAINTAINERS file.

Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
           when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
           linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
           description (real workloads, lock contention examples,
           VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
           of multi-process), as suggested by David.

Changes in v4:
 - Add a tracepoint for rmap_walk
 - Provide a testbench for rmap_walk
 - Add vm_pgoff field in ksm_rmap_item
 - use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)

Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.

Changes in v2:
- Use const variable to initialize 'addr'  "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)

xu xin (3):
  ksm: add linear_page_index into ksm_rmap_item
  ksm: Optimize rmap_walk_ksm by passing a suitable page index
  ksm: add mremap selftests for ksm_rmap_walk

 mm/ksm.c                          |  61 ++++++++++++++---
 tools/testing/selftests/mm/rmap.c | 107 ++++++++++++++++++++++++++++++
 2 files changed, 160 insertions(+), 8 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v10 1/3] ksm: add linear_page_index into ksm_rmap_item
  2026-06-29  9:41 KSM: performance optimizations for rmap_walk_ksm xu.xin16
@ 2026-06-29  9:42 ` xu.xin16
  2026-06-29  9:43 ` [PATCH v10 2/3] ksm: Optimize rmap_walk_ksm by passing a suitable page index xu.xin16
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-06-29  9:42 UTC (permalink / raw)
  To: akpm
  Cc: david, chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel,
	ljs, xu.xin16

From: xu xin <xu.xin16@zte.com.cn>

As preparation for KSM rmap optimizations, let's track the original
linear_page_index() of a de-duplicated page in its ksm_rmap_item, so we can
efficiently search for the page in an address space, avoiding scanning the
entire address space. This was previously discussed in [1, 2].

To avoid growing ksm_rmap_item, let's squeeze it into the existing
structure by overlying some members (oldchecksum, age, remaining_skips)
that are only relevant while on the unstable tree. The new entry will
only be relevant for entries in the stable tree.

However, as the age information is read by should_skip_rmap_item() with the
smart-scanning approach even while we have an entry in the stable tree, but
the page changes (no longer a KSM page, for example due to COW), we have to
change the handling there a bit.

We'll calculate the linear page index in try_to_merge_with_ksm_page(), when
adding it to the stable tree, and reset the index (to reset overlayed data)
when removing an item from the stable tree -- in
remove_rmap_item_from_tree(), remove_node_from_stable_tree() and
break_cow().

To be specially clarified, the reason for resetting the stored index at
break_cow() is:

- When a page successfully becomes a KSM page (i.e., after
  stable_tree_append() sets STABLE_FLAG), both anon_vma and the index are
  stored and remain valid.

- However, during the merging process, there are several failure paths
  where we already prepared an rmap item to be added to the stable tree,
  but must revert that as some part of the merge process failed. Examples
  include:
    1 The second call to try_to_merge_with_ksm_page() fails in
      try_to_merge_two_pages().
    2 stable_tree_insert() fails in cmp_and_merge_page().
  In such cases, break_cow() is invoked to break the COW mapping and
  discard the KSM state.

Currently, break_cow() already contains a put_anon_vma(rmap_item->anon_vma)
to release the reference taken during the aborted merge. Because the index
is logically paired with anon_vma (both are only meaningful when the
rmap_item is in a stable state), it must also be cleared (or reset) in
break_cow() to avoid leaving stale linear_page_index values that could
confuse subsequent rmap walks or scanning logic.

[1] https://lore.kernel.org/all/adTPQSb-qSSHviJN@lucifer/
[2] https://lore.kernel.org/all/202604091806051535BJWZ_FTtdIm3Snk24ei_@zte.com.cn/

Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7d5b76478f0b..60c6f959d81a 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -195,22 +195,28 @@ struct ksm_stable_node {
  * @node: rb node of this rmap_item in the unstable tree
  * @head: pointer to stable_node heading this list in the stable tree
  * @hlist: link into hlist of rmap_items hanging off that stable_node
- * @age: number of scan iterations since creation
- * @remaining_skips: how many scans to skip
+ * @age: number of scan iterations since creation (unstable node)
+ * @remaining_skips: how many scans to skip (unstable node)
+ * @linear_page_index: the original page's index before merged by KSM (stable node)
  */
 struct ksm_rmap_item {
 	struct ksm_rmap_item *rmap_list;
 	union {
-		struct anon_vma *anon_vma;	/* when stable */
+		struct anon_vma *anon_vma;	/* for reverse mapping, when stable */
 #ifdef CONFIG_NUMA
 		int nid;		/* when node of unstable tree */
 #endif
 	};
 	struct mm_struct *mm;
 	unsigned long address;		/* + low bits used for flags below */
-	unsigned int oldchecksum;	/* when unstable */
-	rmap_age_t age;
-	rmap_age_t remaining_skips;
+	union {
+		struct {
+			unsigned int oldchecksum;
+			rmap_age_t age;
+			rmap_age_t remaining_skips;
+		};			/* when unstable */
+		unsigned long linear_page_index;    /* for reverse mapping, when stable */
+	};
 	union {
 		struct rb_node node;	/* when node of unstable tree */
 		struct {		/* when listed from stable tree */
@@ -776,6 +782,11 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 	return vma;
 }

+/*
+ * break_cow: actively break COW, replacing the KSM page by a fresh anonymous
+ * page. This is called when rmap_item has not yet become stable, but page
+ * has been merged.
+ */
 static void break_cow(struct ksm_rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
@@ -787,6 +798,11 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
 	 * to undo, we also need to drop a reference to the anon_vma.
 	 */
 	put_anon_vma(rmap_item->anon_vma);
+	/*
+	 * Reset linear_page_index that might overlay age-related
+	 * information. (it's still unstable node)
+	 */
+	rmap_item->linear_page_index = 0;

 	mmap_read_lock(mm);
 	vma = find_mergeable_vma(mm, addr);
@@ -899,6 +915,8 @@ static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node)
 		VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
 		stable_node->rmap_hlist_len--;
 		put_anon_vma(rmap_item->anon_vma);
+		/* Reset linear_page_index that might overlay age-related information. */
+		rmap_item->linear_page_index = 0;
 		rmap_item->address &= PAGE_MASK;
 		cond_resched();
 	}
@@ -1052,6 +1070,8 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 		stable_node->rmap_hlist_len--;

 		put_anon_vma(rmap_item->anon_vma);
+		/* Reset linear_page_index that might overlay age-related information. */
+		rmap_item->linear_page_index = 0;
 		rmap_item->head = NULL;
 		rmap_item->address &= PAGE_MASK;

@@ -1598,8 +1618,15 @@ static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item,
 	/* Unstable nid is in union with stable anon_vma: remove first */
 	remove_rmap_item_from_tree(rmap_item);

-	/* Must get reference to anon_vma while still holding mmap_lock */
+	/*
+	 * We can consider the VMA only while still holding the mmap lock,
+	 * so lock, so reference the anon_vma and calculate the linear
+	 * page index early, before stable_tree_append(). If anything goes
+	 * wrong that prevents the rmap_item from being added to the
+	 * stable_tree, break_cow() will clean it up.
+	 */
 	rmap_item->anon_vma = vma->anon_vma;
+	rmap_item->linear_page_index = linear_page_index(vma, rmap_item->address);
 	get_anon_vma(vma->anon_vma);
 out:
 	mmap_read_unlock(mm);
@@ -2458,6 +2485,13 @@ static bool should_skip_rmap_item(struct folio *folio,
 	if (folio_test_ksm(folio))
 		return false;

+	/*
+	 * There is no age information in stable-tree nodes. We might end up
+	 * here without a KSM page for example after COW.
+	 */
+	if (rmap_item->address & STABLE_FLAG)
+		return false;
+
 	age = rmap_item->age;
 	if (age != U8_MAX)
 		rmap_item->age++;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v10 2/3] ksm: Optimize rmap_walk_ksm by passing a suitable page index
  2026-06-29  9:41 KSM: performance optimizations for rmap_walk_ksm xu.xin16
  2026-06-29  9:42 ` [PATCH v10 1/3] ksm: add linear_page_index into ksm_rmap_item xu.xin16
@ 2026-06-29  9:43 ` xu.xin16
  2026-06-29  9:44 ` [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
  2026-06-29 10:00 ` KSM: performance optimizations for rmap_walk_ksm Lorenzo Stoakes
  3 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-06-29  9:43 UTC (permalink / raw)
  To: akpm
  Cc: david, chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel,
	ljs, xu.xin16

From: xu xin <xu.xin16@zte.com.cn>

User impact / Why this matters to Linux users
=============================================
When a system runs with KSM enabled and memory becomes tight, KSM pages
may be swapped out or migrated. The kernel then performs a reverse map
walk by rmap_walk_ksm to locate all page table entries that reference
these pages. If A large number of unrelated VMAs can attach to a single
anon_vma related with this KSM page, then rmap_walk might be severe
performance bottleneck.  In our embedded test environment, we observed
~20,000 VMAs sharing one anon_vma without any fork – purely from VMA
splits， which cause 200~700ms duration of rmap_walk_ksm.

When one of those VMAs mapped a KSM page, then this KSM page's rmapping
will become bottleneck with hold its anon_vma lock for a long time. The
anon_vma lock is not only used by KSM; it is a core lock protecting the
VMA interval tree and is acquired by many critical memory operations:

  • Page faults: do_anonymous_page(), do_wp_page() (during COW)
  • Memory reclaim: try_to_unmap()
  • Page migration & compaction: migrate_pages(), compact_zone()
  • mlock / munlock: mlock_fixup()
  • Process exit: exit_mmap() (tearing down VMAs)
  • Cgroup memory accounting: mem_cgroup_move_charge()

If one thread holds the anon_vma lock for hundreds of milliseconds
because of an inefficient KSM rmap walk, any other thread that
tries to acquire the same lock (e.g., an application taking a page
fault, kswapd reclaiming pages, or a migration thread) will block.
This leads to stalled application threads, increased latency
spikes, and in extreme cases container timeouts or watchdog
triggers.

This patch reduces the worst-case anon_vma lock hold time during
ksm_rmap_walk from >500 ms to <1 ms, thereby almost eliminating
this source of lock contention and improving system responsiveness
under memory pressure.

Real-world examples:
====================
 - JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.

 - Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to
release specific pages, which also splits VMAs.

Root Cause
==========
Through local debugging trace analysis, we found that most of the
latency of rmap_walk_ksm occurs within anon_vma_interval_tree_foreach,
leading to an excessively long hold time on the anon_vma lock (even
reaching 500ms or more), which in turn causes upper-layer applications
(waiting for the anon_vma lock) to be blocked for extended periods.

Further investigation revealed that 99.9% of iterations inside the
anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a
large number of loop iterations are ineffective. This inefficiency
arises because the start page index and the end page index parameters
passed to anon_vma_interval_tree_foreach span the entire address space
from 0 to ULONG_MAX, resulting in very poor loop efficiency.

Solution
========
We cannot rely solely on anon_vma to locate all PTEs mapping this page
but also need to have the original page's linear_page_index. Since the
implementation of anon_vma_interval_tree_foreach — it essentially
iterates to find a suitable VMA such that the provided page index
falls within the candidate's vm_pgoff range.

vm_pgoff <= original linear page offset <= (vm_pgoff + vma_pages(v) - 1)

Fortunately, an earlier commit introduced the linear_page_index to struct
ksm_rmap_item, allowing for optimizing the RMAP walk.

Test results
============
A rmap testbench can be obtained with two Out-Of-Tree patches at [1][2].
After applying the OOT patches and building rmap_benchmark from:
tools/testing/rmap/rmap_benchmark.c, we can start the performance test.

The testing result in QEMU is shown as follows:

KSM rmapping	Maximum duration		Average duration

Before:		705.12 ms (705119858 ns)	532.04 ms (532041586 ns)
After:		1.67 ms (1665917 ns)		1.44 ms (1443784 ns)

The benchmark numbers are realistic, since we observed ~20,000 VMAs
sharing one anon_vma on a production system running a Java application
with KSM enabled. The lock hold time before the patch was measured at
228 ms (max) during rmap walks triggered by memory compaction and page
migration. The benchmark reproduces that VMA count and lock‑hold
behavior in a controlled environment.

[1] https://lore.kernel.org/all/202605301703094695zmVgcSC27BNR0rH0N8_x@zte.com.cn
[2] https://lore.kernel.org/all/20260530170404509QpJmBtpSjn3uQHeVKA2iA@zte.com.cn/

Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 60c6f959d81a..454ba2eb46e9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3207,6 +3207,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		/* Ignore the stable/unstable/sqnr flags */
 		const unsigned long addr = rmap_item->address & PAGE_MASK;
+		const unsigned long index = rmap_item->linear_page_index;
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
@@ -3220,8 +3221,18 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 			anon_vma_lock_read(anon_vma);
 		}

+		/*
+		 * Currently, KSM folios are always small folios, so it's
+		 * sufficient to search for a single page. We can simply use
+		 * the linear_page_index of the original de-duplicate
+		 * anonymous page that we remembered in the rmap_item while
+		 * de-duplicating. Note that mremap() always de-duplicates KSM
+		 * folios: so if there was mremap() in our parent or our child,
+		 * we wouldn't have the KSM folio mapped in these processes
+		 * anymore.
+		 */
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
-					       0, ULONG_MAX) {
+					       index, index) {

 			cond_resched();
 			vma = vmac->vma;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk
  2026-06-29  9:41 KSM: performance optimizations for rmap_walk_ksm xu.xin16
  2026-06-29  9:42 ` [PATCH v10 1/3] ksm: add linear_page_index into ksm_rmap_item xu.xin16
  2026-06-29  9:43 ` [PATCH v10 2/3] ksm: Optimize rmap_walk_ksm by passing a suitable page index xu.xin16
@ 2026-06-29  9:44 ` xu.xin16
  2026-06-29 11:59   ` David Hildenbrand (Arm)
  2026-06-29 10:00 ` KSM: performance optimizations for rmap_walk_ksm Lorenzo Stoakes
  3 siblings, 1 reply; 7+ messages in thread
From: xu.xin16 @ 2026-06-29  9:44 UTC (permalink / raw)
  To: akpm
  Cc: david, chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel,
	ljs, xu.xin16

From: xu xin <xu.xin16@zte.com.cn>

The existing tools/testing/selftests/mm/rmap.c has already one testcase
for ksm_rmap_walk in TEST_F(migrate, ksm), which takes use of migration
of page from one NUMA node to another NUMA node. However, it just lacks
the scenario of mremapped VMAs.

We add the calling of mremap() and then trigger KSM to merge pages before
migrating, which is specifically to test an optimization which is
introduced by this patch ("ksm: Optimize rmap_walk_ksm by passing a
suitable address pgoff").

This test can reproduce the issue that Hugh points out at
https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 tools/testing/selftests/mm/rmap.c | 107 ++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/tools/testing/selftests/mm/rmap.c b/tools/testing/selftests/mm/rmap.c
index 53f2058b0ef2..ad731bdff4fa 100644
--- a/tools/testing/selftests/mm/rmap.c
+++ b/tools/testing/selftests/mm/rmap.c
@@ -430,4 +430,111 @@ TEST_F(migrate, ksm)
 	propagate_children(_metadata, data);
 }

+/*
+ * Check if All PFNs of the region are the same to the input pfn.
+ *
+ * @pagemap_fd: file descriptor of /proc/pid/pagemap.
+ * @region: the start address of the associated region.
+ * @nr_pages: the number of pages that the region contains.
+ * @pfn: the referenced PFN.
+ */
+static bool merged_to_pfn(int pagemap_fd, void *region, int nr_pages,
+		unsigned long pfn)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		if (pagemap_get_pfn(pagemap_fd, region + i * getpagesize()) != pfn)
+			return false;
+
+	return true;
+}
+
+static int mremap_merge_and_migrate(struct global_data *data)
+{
+	int ret, pagemap_fd;
+	void *old_region;
+	void *new_region;
+	int nr_pages = 32;
+	unsigned long old_pfn;
+
+	/* Allocate exactly pages for the test */
+	data->mapsize = nr_pages * getpagesize();
+	data->region = mmap(NULL, data->mapsize, PROT_READ | PROT_WRITE,
+			    MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (data->region == MAP_FAILED)
+		ksft_exit_fail_perror("mmap failed");
+	memset(data->region, 0x77, data->mapsize);
+
+	/*
+	 * Mremap the second half region to the first half location (FIXED).
+	 */
+	old_region = data->region;
+	new_region = mremap(old_region + data->mapsize / 2, data->mapsize / 2,
+			    data->mapsize / 2, MREMAP_MAYMOVE | MREMAP_FIXED,
+			    old_region);
+	if (new_region == MAP_FAILED) {
+		ksft_print_msg("mremap failed: %s\n", strerror(errno));
+		return FAIL_ON_CHECK;
+	}
+	data->region = new_region;
+	data->mapsize /= 2;
+
+	/* madvise MADV_MERGABLE and merge these pages */
+	madvise(data->region, data->mapsize, MADV_MERGEABLE);
+	if (ksm_start() < 0)
+		return FAIL_ON_WORK;
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (pagemap_fd == -1)
+		return FAIL_ON_CHECK;
+
+	*data->expected_pfn = pagemap_get_pfn(pagemap_fd, data->region);
+	if (*data->expected_pfn == -1ul)
+		return FAIL_ON_CHECK;
+	old_pfn = *data->expected_pfn;
+
+	/* Before migrating, check if All pages's PFN are the same */
+	if (!merged_to_pfn(pagemap_fd, data->region, nr_pages / 2,
+		 *data->expected_pfn)) {
+		ksft_print_msg("After KSM merging, PFNs are not the same\n");
+		return FAIL_ON_CHECK;
+	}
+
+	/* Attempt to migrate the merged KSM page */
+	ret = try_to_move_page(data->region);
+	if (ret) {
+		ksft_print_msg("migration of KSM page after mremap failed\n");
+		return FAIL_ON_CHECK;
+	}
+
+	/* After migrating, check if all PFN aren't the old */
+	*data->expected_pfn = pagemap_get_pfn(pagemap_fd, data->region);
+	if (*data->expected_pfn == -1ul || *data->expected_pfn == old_pfn)
+		return FAIL_ON_CHECK;
+
+	if (*data->expected_pfn == old_pfn ||
+		!merged_to_pfn(pagemap_fd, data->region, nr_pages / 2,
+		*data->expected_pfn)) {
+		ksft_print_msg("Bug migration: still old PFN or PFNs are not expected\n");
+		return FAIL_ON_CHECK;
+	}
+
+	return 0;
+}
+
+
+TEST_F(migrate, ksm_and_mremap)
+{
+	struct global_data *data = &self->data;
+
+	/* Skip if KSM is not available */
+	if (ksm_stop() < 0)
+		SKIP(return, "accessing \"/sys/kernel/mm/ksm/run\" failed");
+	if (ksm_get_full_scans() < 0)
+		SKIP(return, "accessing \"/sys/kernel/mm/ksm/full_scan\" failed");
+
+	ASSERT_EQ(mremap_merge_and_migrate(data), 0);
+}
+
 TEST_HARNESS_MAIN
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk
  2026-06-29  9:44 ` [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
@ 2026-06-29 11:59   ` David Hildenbrand (Arm)
  2026-06-29 14:43     ` xu.xin16
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 11:59 UTC (permalink / raw)
  To: xu.xin16, akpm
  Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs

On 6/29/26 11:44, xu.xin16@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> The existing tools/testing/selftests/mm/rmap.c has already one testcase
> for ksm_rmap_walk in TEST_F(migrate, ksm), which takes use of migration
> of page from one NUMA node to another NUMA node. However, it just lacks
> the scenario of mremapped VMAs.
> 
> We add the calling of mremap() and then trigger KSM to merge pages before
> migrating, which is specifically to test an optimization which is
> introduced by this patch ("ksm: Optimize rmap_walk_ksm by passing a
> suitable address pgoff").
> 
> This test can reproduce the issue that Hugh points out at
> https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/
> 
> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> ---
>  tools/testing/selftests/mm/rmap.c | 107 ++++++++++++++++++++++++++++++
>  1 file changed, 107 insertions(+)
> 
> diff --git a/tools/testing/selftests/mm/rmap.c b/tools/testing/selftests/mm/rmap.c
> index 53f2058b0ef2..ad731bdff4fa 100644
> --- a/tools/testing/selftests/mm/rmap.c
> +++ b/tools/testing/selftests/mm/rmap.c
> @@ -430,4 +430,111 @@ TEST_F(migrate, ksm)
>  	propagate_children(_metadata, data);
>  }
> 
> +/*
> + * Check if All PFNs of the region are the same to the input pfn.
> + *
> + * @pagemap_fd: file descriptor of /proc/pid/pagemap.
> + * @region: the start address of the associated region.
> + * @nr_pages: the number of pages that the region contains.
> + * @pfn: the referenced PFN.
> + */
> +static bool merged_to_pfn(int pagemap_fd, void *region, int nr_pages,
> +		unsigned long pfn)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		if (pagemap_get_pfn(pagemap_fd, region + i * getpagesize()) != pfn)
> +			return false;
> +
> +	return true;
> +}
> +
> +static int mremap_merge_and_migrate(struct global_data *data)
> +{
> +	int ret, pagemap_fd;
> +	void *old_region;
> +	void *new_region;
> +	int nr_pages = 32;
> +	unsigned long old_pfn;
> +
> +	/* Allocate exactly pages for the test */
> +	data->mapsize = nr_pages * getpagesize();
> +	data->region = mmap(NULL, data->mapsize, PROT_READ | PROT_WRITE,
> +			    MAP_PRIVATE | MAP_ANON, -1, 0);
> +	if (data->region == MAP_FAILED)
> +		ksft_exit_fail_perror("mmap failed");
> +	memset(data->region, 0x77, data->mapsize);
> +
> +	/*
> +	 * Mremap the second half region to the first half location (FIXED).
> +	 */
> +	old_region = data->region;
> +	new_region = mremap(old_region + data->mapsize / 2, data->mapsize / 2,
> +			    data->mapsize / 2, MREMAP_MAYMOVE | MREMAP_FIXED,
> +			    old_region);
> +	if (new_region == MAP_FAILED) {
> +		ksft_print_msg("mremap failed: %s\n", strerror(errno));
> +		return FAIL_ON_CHECK;
> +	}
> +	data->region = new_region;
> +	data->mapsize /= 2;
> +
> +	/* madvise MADV_MERGABLE and merge these pages */
> +	madvise(data->region, data->mapsize, MADV_MERGEABLE);
> +	if (ksm_start() < 0)
> +		return FAIL_ON_WORK;
> +
> +	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);

I think there are various improvements and simplifications we can perform. In particular,
 I don't think we need the errors messages or use data-> members.

What about the following simplification, to move this over the finishing line? (untested)

There is the low chance of page compaction migrating the page while we check for it. Not sure 
if we should handle it (but it would involve retrying on PFN mismatch).


diff --git a/tools/testing/selftests/mm/rmap.c b/tools/testing/selftests/mm/rmap.c
index 53f2058b0ef2b..1b7ab46a520cf 100644
--- a/tools/testing/selftests/mm/rmap.c
+++ b/tools/testing/selftests/mm/rmap.c
@@ -430,4 +430,68 @@ TEST_F(migrate, ksm)
 	propagate_children(_metadata, data);
 }
 
+static bool range_maps_pfn(int pagemap_fd, void *region, int nr_pages,
+		unsigned long pfn)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		if (pagemap_get_pfn(pagemap_fd, region + i * getpagesize()) != pfn)
+			return false;
+	return true;
+}
+
+TEST_F(migrate, ksm_and_mremap)
+{
+	unsigned long old_pfn, new_pfn;
+	void *region, *mremap_region;
+	const int nr_pages = 16;
+	size_t mmap_size;
+	int pagemap_fd;
+
+	/* Skip if KSM is not available */
+	if (ksm_stop() < 0)
+		SKIP(return, "accessing \"/sys/kernel/mm/ksm/run\" failed");
+	if (ksm_get_full_scans() < 0)
+		SKIP(return, "accessing \"/sys/kernel/mm/ksm/full_scan\" failed");
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (pagemap_fd < 0)
+		SKIP(return, "opening pagemap failed");
+
+	/* Allocate and populate twice the anon pages initially. */
+	mmap_size = 2 * nr_pages * getpagesize();
+	region = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANON, -1, 0);
+	ASSERT_NE(region, MAP_FAILED);
+	memset(region, 0x77, mmap_size);
+
+	/* mremap the second half over the first half, to stress rmap handling */
+	mmap_size /= 2;
+	mremap_region = mremap(region + mmap_size, mmap_size, mmap_size,
+			       MREMAP_MAYMOVE | MREMAP_FIXED, region);
+	ASSERT_EQ(mremap_region, region);
+
+	/* Merge all pages into a single KSM page. */
+	madvise(region, mmap_size, MADV_MERGEABLE);
+	ASSERT_EQ(ksm_start(), 0);
+
+	/* The whole range should map the same KSM page. */
+	old_pfn = pagemap_get_pfn(pagemap_fd, region);
+	if (old_pfn == -1ul)
+		SKIP(return, "Obtaining PFN failed");
+	ASSERT_TRUE(range_maps_pfn(pagemap_fd, region, nr_pages, old_pfn));
+
+	/*
+	 * Migrate the KSM page; the whole range should map the new (migrated)
+	 * KSM page.
+	 */
+	ASSERT_EQ(try_to_move_page(region), 0);
+	new_pfn = pagemap_get_pfn(pagemap_fd, region);
+	if (new_pfn == -1ul)
+		SKIP(return, "Obtaining PFN failed");
+	ASSERT_NE(new_pfn, old_pfn);
+	ASSERT_TRUE(range_maps_pfn(pagemap_fd, region, nr_pages, new_pfn));
+}
+
 TEST_HARNESS_MAIN
-- 
2.43.0

-- 
Cheers,

David


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk
  2026-06-29 11:59   ` David Hildenbrand (Arm)
@ 2026-06-29 14:43     ` xu.xin16
  0 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-06-29 14:43 UTC (permalink / raw)
  To: david; +Cc: akpm, chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel,
	ljs

> I think there are various improvements and simplifications we can perform. In particular,
>  I don't think we need the errors messages or use data-> members.
> 
> What about the following simplification, to move this over the finishing line? (untested)

Agreed. Your version is cleaner and tests successfully.

> 
> There is the low chance of page compaction migrating the page while we check for it. Not sure 
> if we should handle it (but it would involve retrying on PFN mismatch).

It is really possible. To improve robustness, how about double-checking in range_maps_pfn() and
rename range_maps_pfn as range_maps_the_same_pfn() which caculate the fisrt PFN itself without
the input PFN as follows:

static bool range_maps_the_same_pfn(int pagemap_fd, void *region, int nr_pages)
{
        int i;
        int second_times = 0;
        unsigned long first_pfn;

again:
        first_pfn = pagemap_get_pfn(pagemap_fd, region);
        for (i = 0; i < nr_pages; i++) {
                if (pagemap_get_pfn(pagemap_fd, region + i * getpagesize()) != first_pfn) {
                        if (second_times)
                                return false;
                        else {
                                /*
                                 * In case of low chance the low chance of page compaction
                                 * migrating the page while we check for pfn.
                                 */
                                second_times++;
                                goto again;
                        }
                }
        }

        return true;
}


> 
> 
> diff --git a/tools/testing/selftests/mm/rmap.c b/tools/testing/selftests/mm/rmap.c
> index 53f2058b0ef2b..1b7ab46a520cf 100644
> --- a/tools/testing/selftests/mm/rmap.c
> +++ b/tools/testing/selftests/mm/rmap.c
> @@ -430,4 +430,68 @@ TEST_F(migrate, ksm)
>  	propagate_children(_metadata, data);
>  }
>  
> +static bool range_maps_pfn(int pagemap_fd, void *region, int nr_pages,
> +		unsigned long pfn)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		if (pagemap_get_pfn(pagemap_fd, region + i * getpagesize()) != pfn)
> +			return false;
> +	return true;
> +}
> +
> +TEST_F(migrate, ksm_and_mremap)
> +{
> +	unsigned long old_pfn, new_pfn;
> +	void *region, *mremap_region;
> +	const int nr_pages = 16;
> +	size_t mmap_size;
> +	int pagemap_fd;
> +
> +	/* Skip if KSM is not available */
> +	if (ksm_stop() < 0)
> +		SKIP(return, "accessing \"/sys/kernel/mm/ksm/run\" failed");
> +	if (ksm_get_full_scans() < 0)
> +		SKIP(return, "accessing \"/sys/kernel/mm/ksm/full_scan\" failed");
> +
> +	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
> +	if (pagemap_fd < 0)
> +		SKIP(return, "opening pagemap failed");
> +
> +	/* Allocate and populate twice the anon pages initially. */
> +	mmap_size = 2 * nr_pages * getpagesize();
> +	region = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
> +		      MAP_PRIVATE | MAP_ANON, -1, 0);
> +	ASSERT_NE(region, MAP_FAILED);
> +	memset(region, 0x77, mmap_size);
> +
> +	/* mremap the second half over the first half, to stress rmap handling */
> +	mmap_size /= 2;
> +	mremap_region = mremap(region + mmap_size, mmap_size, mmap_size,
> +			       MREMAP_MAYMOVE | MREMAP_FIXED, region);
> +	ASSERT_EQ(mremap_region, region);
> +
> +	/* Merge all pages into a single KSM page. */
> +	madvise(region, mmap_size, MADV_MERGEABLE);
> +	ASSERT_EQ(ksm_start(), 0);
> +
> +	/* The whole range should map the same KSM page. */
> +	old_pfn = pagemap_get_pfn(pagemap_fd, region);
> +	if (old_pfn == -1ul)
> +		SKIP(return, "Obtaining PFN failed");
> +	ASSERT_TRUE(range_maps_pfn(pagemap_fd, region, nr_pages, old_pfn));
> +
> +	/*
> +	 * Migrate the KSM page; the whole range should map the new (migrated)
> +	 * KSM page.
> +	 */
> +	ASSERT_EQ(try_to_move_page(region), 0);
> +	new_pfn = pagemap_get_pfn(pagemap_fd, region);
> +	if (new_pfn == -1ul)
> +		SKIP(return, "Obtaining PFN failed");
> +	ASSERT_NE(new_pfn, old_pfn);
> +	ASSERT_TRUE(range_maps_pfn(pagemap_fd, region, nr_pages, new_pfn));
> +}
> +
>  TEST_HARNESS_MAIN
> -- 
> 2.43.0
> 
> -- 
> Cheers,
> 
> David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KSM: performance optimizations for rmap_walk_ksm
  2026-06-29  9:41 KSM: performance optimizations for rmap_walk_ksm xu.xin16
                   ` (2 preceding siblings ...)
  2026-06-29  9:44 ` [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
@ 2026-06-29 10:00 ` Lorenzo Stoakes
  3 siblings, 0 replies; 7+ messages in thread
From: Lorenzo Stoakes @ 2026-06-29 10:00 UTC (permalink / raw)
  To: xu.xin16
  Cc: akpm, david, chengming.zhou, hughd, wang.yaxin, linux-mm,
	linux-kernel

Thanks for the detailed cover letter (will try to have a look through!), one
very small trivial thing - the cover letter is missing the [PATCH vXX 00/yy]
prefix :P

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-29 14:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29  9:41 KSM: performance optimizations for rmap_walk_ksm xu.xin16
2026-06-29  9:42 ` [PATCH v10 1/3] ksm: add linear_page_index into ksm_rmap_item xu.xin16
2026-06-29  9:43 ` [PATCH v10 2/3] ksm: Optimize rmap_walk_ksm by passing a suitable page index xu.xin16
2026-06-29  9:44 ` [PATCH v10 3/3] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
2026-06-29 11:59   ` David Hildenbrand (Arm)
2026-06-29 14:43     ` xu.xin16
2026-06-29 10:00 ` KSM: performance optimizations for rmap_walk_ksm Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox