[PATCH v4 0/5] KSM: Optimizations for rmap_walk

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm
@ 2026-05-03 12:35 xu.xin16
  2026-05-03 12:39 ` [PATCH v4 1/5] mm/rmap: add tracepoint for rmap_walk xu.xin16
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: xu.xin16 @ 2026-05-03 12:35 UTC (permalink / raw)
  To: akpm, david, ljs, hughd; +Cc: linux-mm, linux-kernel, michel, xu.xin16

From: xu xin <xu.xin16@zte.com.cn>

Deep investigation revealed that rmap_walk_ksm's 99.9% of iterations inside
the anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

An initial immature thought was using the "rmap_item->address >> PAGE_SHIFT"
to be the pgoff passed into anon_vma_interval_tree_foreach(). But this is
flawed because when a range has been mremap-moved, when its anon folio
indexes and anon_vma pgoff correspond to the original user address,
not to the current user address, which was pointed out at:

  https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/

According to the implementation of anon_vma_interval_tree_foreach —
it essentially iterates to find a suitable VMA such that the provided pgoff falls
within the VMA's range [vm_pgoff, vm_pgoff + vma_pages(v) - 1].

So the solution is to add vm_pgoff field in ksm_rmap_item and use vm_pgoff instead of
address >> PAGE_SHIFT.

Changes in v4:
 - Add a tracepoint for rmap_walk
 - Provide a testbench for rmap_walk
 - Add vm_pgoff field in ksm_rmap_item
 - use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)

Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.

Changes in v2:
- Use const variable to initialize 'addr'  "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)

xu xin (5):
  mm/rmap: add tracepoint for rmap_walk
  tools/testing: add rmap walk latency benchmark for KSM, anonymous and
    file pages
  ksm: add vm_pgoff into ksm_rmap_item
  ksm: Optimize rmap_walk_ksm by passing a suitable address range
  ksm: add mremap selftests for ksm_rmap_walk

 MAINTAINERS                          |   3 +
 include/trace/events/rmap.h          |  49 +++
 mm/ksm.c                             |  48 ++-
 mm/rmap.c                            |  14 +
 tools/testing/rmap/Makefile          |  11 +
 tools/testing/rmap/rmap_benchmark.c  | 488 +++++++++++++++++++++++++++
 tools/testing/selftests/mm/rmap.c    |  79 +++++
 tools/testing/selftests/mm/vm_util.c |  38 +++
 tools/testing/selftests/mm/vm_util.h |   2 +
 9 files changed, 724 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/rmap.h
 create mode 100644 tools/testing/rmap/Makefile
 create mode 100644 tools/testing/rmap/rmap_benchmark.c

-- 
2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/5] mm/rmap: add tracepoint for rmap_walk
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
@ 2026-05-03 12:39 ` xu.xin16
  2026-05-03 12:42 ` [PATCH v4 2/5] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages xu.xin16
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-05-03 12:39 UTC (permalink / raw)
  To: xu.xin16, akpm, david, ljs; +Cc: hughd, linux-mm, linux-kernel, michel

From: xu xin <xu.xin16@zte.com.cn>

Add a new tracepoint rmap_walk in mm/rmap.c to monitor reverse mapping
traversal. The tracepoint records the duration (in nanoseconds), the
type of the folio (KSM, anonymous, or file-backed), and the addresses
of the folio and rmap_walk_control structures. The type determination
is performed inside the tracepoint to keep the function itself
lightweight.

'# cat /sys/kernel/tracing/trace
'# tracer: nop
'#
'# entries-in-buffer/entries-written: 408/408   #P:4
'#
'#                     _-----=> irqs-off/BH-disabled
'#                    / _----=> need-resched
'#                   | / _---=> hardirq/softirq
'#                   || / _--=> preempt-depth
'#                   ||| / _-=> migrate-disable
'#                   |||| /     delay
'#TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
'#   | |         |   |||||     |         |
rmap-215     [001] .....   692.237079: rmap_walk: folio=000000000029ddcb rwc=00000000dac4cda0 duration_ns=828682 page_type=ksm locked=false
rmap-215     [001] .....   692.239480: rmap_walk: folio=0000000092a21fd3 rwc=00000000986376ff duration_ns=905966 page_type=ksm locked=false
rmap-230     [003] .....   692.583619: rmap_walk: folio=0000000037a237b6 rwc=0000000080fbbb0a duration_ns=107892 page_type=ksm locked=false
rmap-230     [003] .....   692.584104: rmap_walk: folio=0000000031bdbf8b rwc=00000000b39c973a duration_ns=330886 page_type=ksm locked=false
rmap-244     [003] .....   692.708706: rmap_walk: folio=000000009105fa6b rwc=0000000037d46cd7 duration_ns=987826 page_type=file locked=false
rmap-244     [003] .....   692.709198: rmap_walk: folio=000000009105fa6b rwc=0000000093942e2c duration_ns=161733 page_type=file locked=false
rmap-244     [003] .....   692.709606: rmap_walk: folio=000000009105fa6b rwc=0000000037d46cd7 duration_ns=54428 page_type=file locked=false
rmap-244     [003] .....   692.709658: rmap_walk: folio=000000009105fa6b rwc=0000000093942e2c duration_ns=27192 page_type=file locked=false

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 include/trace/events/rmap.h | 49 +++++++++++++++++++++++++++++++++++++
 mm/rmap.c                   | 14 +++++++++++
 2 files changed, 63 insertions(+)
 create mode 100644 include/trace/events/rmap.h

diff --git a/include/trace/events/rmap.h b/include/trace/events/rmap.h
new file mode 100644
index 000000000000..987fa204d65d
--- /dev/null
+++ b/include/trace/events/rmap.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rmap
+
+#if !defined(_TRACE_RMAP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RMAP_H
+
+#include <linux/tracepoint.h>
+#include <linux/rmap.h>
+
+#define GET_RMAP_PAGE_TYPE(folio) (folio_test_ksm(folio) ? "ksm" : \
+		(folio_test_anon(folio) ? "anon" : "file"))
+
+TRACE_EVENT(rmap_walk,
+
+	TP_PROTO(struct folio *folio, struct rmap_walk_control *rwc, u64 duration_ns, bool locked),
+
+	TP_ARGS(folio, rwc, duration_ns, locked),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, folio_addr)
+		__field(unsigned long, rwc_addr)
+		__field(u64, duration_ns)
+		__string(page_type, GET_RMAP_PAGE_TYPE(folio))
+		__field(bool, locked)
+	),
+
+	TP_fast_assign(
+		__entry->folio_addr = (unsigned long)folio;
+		__entry->rwc_addr = (unsigned long)rwc;
+		__entry->duration_ns = duration_ns;
+		__assign_str(page_type);
+		__entry->locked = locked;
+	),
+
+	TP_printk("folio=%p rwc=%p duration_ns=%llu page_type=%s locked=%s",
+		(void *)(unsigned long)__entry->folio_addr,
+		(void *)(unsigned long)__entry->rwc_addr,
+		__entry->duration_ns,
+		__get_str(page_type),
+		__entry->locked ? "true" : "false")
+);
+
+
+
+#endif /* _TRACE_RMAP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367c..14bf8483f38b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -75,11 +75,13 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
+#include <linux/sched/clock.h>

 #include <asm/tlb.h>

 #define CREATE_TRACE_POINTS
 #include <trace/events/migrate.h>
+#include <trace/events/rmap.h>

 #include "internal.h"
 #include "swap.h"
@@ -3098,23 +3100,35 @@ static void rmap_walk_file(struct folio *folio,

 void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc)
 {
+	u64 ts_start, delta_ns;
+	ts_start = local_clock();
+
 	if (unlikely(folio_test_ksm(folio)))
 		rmap_walk_ksm(folio, rwc);
 	else if (folio_test_anon(folio))
 		rmap_walk_anon(folio, rwc, false);
 	else
 		rmap_walk_file(folio, rwc, false);
+
+	delta_ns = local_clock() - ts_start;
+	trace_rmap_walk(folio, rwc, delta_ns, false);
 }

 /* Like rmap_walk, but caller holds relevant rmap lock */
 void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc)
 {
+	u64 ts_start, delta_ns;
+	ts_start = local_clock();
+
 	/* no ksm support for now */
 	VM_BUG_ON_FOLIO(folio_test_ksm(folio), folio);
 	if (folio_test_anon(folio))
 		rmap_walk_anon(folio, rwc, true);
 	else
 		rmap_walk_file(folio, rwc, true);
+
+	delta_ns = local_clock() - ts_start;
+	trace_rmap_walk(folio, rwc, delta_ns, true);
 }

 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/5] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
  2026-05-03 12:39 ` [PATCH v4 1/5] mm/rmap: add tracepoint for rmap_walk xu.xin16
@ 2026-05-03 12:42 ` xu.xin16
  2026-05-03 12:48 ` [PATCH v4 3/5] ksm: add vm_pgoff into ksm_rmap_item xu.xin16
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-05-03 12:42 UTC (permalink / raw)
  To: xu.xin16, akpm, david, ljs; +Cc: hughd, linux-mm, linux-kernel, michel

From: xu xin <xu.xin16@zte.com.cn>

When a physical page is shared among many VMAs (e.g., KSM merging or
fork with COW), reverse mapping traversal (rmap_walk) can become a
performance bottleneck. The cost of walking the stable_node->hlist and
checking each VMA's page tables increases linearly with the number of
sharers. KSM pages that are merged across many processes are especially
vulnerable.

Add a new benchmark that measures rmap_walk latency under controlled
conditions. The test creates a large region (20,000 pages by default),
optionally splits the VMA into many small VMAs by mprotect(PROT_READ)
on every other page, then triggers rmap_walk via move_pages() (or mbind
if NUMA is not available). The existing rmap_walk tracepoint
(events/rmap/rmap_walk) is used to collect duration_ns for events
with page_type=ksm, page_type=anon, and page_type=file.

Three separate test cases are run:

KSM pages: allocate an anonymous region, fill with identical data,
mark MADV_MERGEABLE, wait for KSM to merge all pages (by polling
/sys/kernel/mm/ksm/full_scans), then trigger migration.

Anonymous pages: similar but without KSM merging.

File pages: mmap a temporary file with shared mapping and fill with
identical data.

For each test, the program prints the number of captured events and
the maximum / average latency in milliseconds.

This benchmark helps developers evaluate optimizations in the reverse
mapping code, such as limiting max_page_sharing or improving tree
traversal efficiency.

Usage (must be run as root):
cd tools/testing/rmap/ && make
sudo ./rmap_bench
=== Testing KSM pages ===
Triggering rmap_walk via move_pages...
KSM rmap_walk latency:
  Maximum duration: 705.12 ms (705119858 ns)
  Average duration: 532.04 ms (532041586 ns)
  Count: 4 events

=== Testing anonymous pages ===
Triggering rmap_walk via move_pages...
Anonymous page rmap_walk latency:
  Maximum duration: 0.07 ms (69329 ns)
  Average duration: 0.05 ms (48287 ns)
  Count: 2 events

=== Testing file pages ===
Triggering rmap_walk via move_pages...
File page rmap_walk latency:
  Maximum duration: 0.07 ms (67090 ns)
  Average duration: 0.03 ms (30082 ns)
  Count: 4 events

By the way, update the section of REVERSE MAPPING in MAINTAINERS.

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 MAINTAINERS                         |   3 +
 tools/testing/rmap/Makefile         |  11 +
 tools/testing/rmap/rmap_benchmark.c | 488 ++++++++++++++++++++++++++++
 3 files changed, 502 insertions(+)
 create mode 100644 tools/testing/rmap/Makefile
 create mode 100644 tools/testing/rmap/rmap_benchmark.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8e7268d2f6ec..01cc34cc83a2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17006,11 +17006,14 @@ R:	Liam R. Howlett <liam@infradead.org>
 R:	Vlastimil Babka <vbabka@kernel.org>
 R:	Harry Yoo <harry@kernel.org>
 R:	Jann Horn <jannh@google.com>
+R:	Xu Xin <xu.xin16@zte.com.cn>
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	include/linux/rmap.h
+F:	include/trace/events/rmap.h
 F:	mm/page_vma_mapped.c
 F:	mm/rmap.c
+F:	tools/testing/rmap/rmap_benchmark.c
 F:	tools/testing/selftests/mm/rmap.c

 MEMORY MANAGEMENT - SECRETMEM
diff --git a/tools/testing/rmap/Makefile b/tools/testing/rmap/Makefile
new file mode 100644
index 000000000000..200bd364cafb
--- /dev/null
+++ b/tools/testing/rmap/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+CC := $(CROSS_COMPILE)gcc
+
+PROGS := rmap_benchmark
+
+all: $(PROGS)
+
+rmap_benchmark: LDLIBS = -lnuma
+
+clean:
+	rm -fr $(PROGS)
diff --git a/tools/testing/rmap/rmap_benchmark.c b/tools/testing/rmap/rmap_benchmark.c
new file mode 100644
index 000000000000..5bbeb26fb2b8
--- /dev/null
+++ b/tools/testing/rmap/rmap_benchmark.c
@@ -0,0 +1,488 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Reverse mapping latency test for KSM, anonymous and file pages
+ *
+ * This program creates a large number of pages (KSM merged, normal anonymous,
+ * or file mapped), splits the VMA into many small VMAs via mprotect,
+ * triggers rmap_walk by move_pages(), and collects latency data from the
+ * tracepoint 'rmap_walk'.
+ *
+ * Usage: must be run as root (to access tracefs and KSM sysfs).
+ *
+ * Copyright 2026, ZTE Corp.
+ *
+ * Author(s): Xu Xin <xu.xin16@zte.com.cn>
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/mount.h>
+#include <numaif.h>
+#include <numa.h>
+#include <time.h>
+#include <ctype.h>
+
+/* Page size and test parameters */
+#define PAGE_SIZE		4096
+#define NR_PAGES		20000	/* Number of virtual pages */
+#define TEST_PATTERN		0xaa
+
+/* KSM sysfs paths */
+#define KSM_RUN_PATH		"/sys/kernel/mm/ksm/run"
+#define KSM_SLEEP_MS_PATH	"/sys/kernel/mm/ksm/sleep_millisecs"
+#define KSM_PAGES_TO_SCAN	"/sys/kernel/mm/ksm/pages_to_scan"
+#define KSM_FULL_SCANS_PATH	"/sys/kernel/mm/ksm/full_scans"
+
+/* Tracepoint control paths */
+#define TRACE_ENABLE		"/sys/kernel/tracing/events/rmap/rmap_walk/enable"
+#define TRACE_FILE		"/sys/kernel/tracing/trace"
+
+/*
+ * Page types for rmap_walk tracepoint filtering
+ */
+enum page_type {
+	PAGE_TYPE_KSM,
+	PAGE_TYPE_ANON,
+	PAGE_TYPE_FILE,
+};
+
+static const char *page_type_str(enum page_type type)
+{
+	switch (type) {
+	case PAGE_TYPE_KSM:	return "ksm";
+	case PAGE_TYPE_ANON:	return "anon";
+	case PAGE_TYPE_FILE:	return "file";
+	default:		return "unknown";
+	}
+}
+
+/*
+ * Write a string to a sysfs file.
+ */
+static int write_sys(const char *path, const char *value)
+{
+	int fd;
+	ssize_t ret;
+
+	fd = open(path, O_WRONLY);
+	if (fd < 0) {
+		fprintf(stderr, "open %s failed: %s\n", path, strerror(errno));
+		return -1;
+	}
+	ret = write(fd, value, strlen(value));
+	close(fd);
+	if (ret != (ssize_t)strlen(value)) {
+		fprintf(stderr, "write %s failed: %s\n", path, strerror(errno));
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * Read an integer from a sysfs file.
+ */
+static int read_sys_int(const char *path)
+{
+	FILE *fp;
+	int val;
+
+	fp = fopen(path, "r");
+	if (!fp) {
+		fprintf(stderr, "fopen %s failed: %s\n", path, strerror(errno));
+		return -1;
+	}
+	if (fscanf(fp, "%d", &val) != 1) {
+		fprintf(stderr, "fscanf %s failed\n", path);
+		fclose(fp);
+		return -1;
+	}
+	fclose(fp);
+	return val;
+}
+
+/*
+ * Get KSM full scan count.
+ */
+static int ksm_get_full_scans(void)
+{
+	return read_sys_int(KSM_FULL_SCANS_PATH);
+}
+
+/*
+ * Wait for KSM to complete at least two full scans, which ensures that
+ * merging has had a chance to happen.
+ */
+static void wait_ksm_merge(void)
+{
+	int start_scans, end_scans;
+	int max_wait = 60;
+	int waited = 0;
+
+	start_scans = ksm_get_full_scans();
+	if (start_scans < 0) {
+		fprintf(stderr, "Failed to read initial full_scans\n");
+		return;
+	}
+
+	/* Make sure KSM is running */
+	if (write_sys(KSM_RUN_PATH, "1") < 0) {
+		fprintf(stderr, "Failed to start KSM\n");
+		return;
+	}
+
+	do {
+		sleep(1);
+		end_scans = ksm_get_full_scans();
+		if (end_scans < 0) {
+			fprintf(stderr, "Failed to read full_scans\n");
+			return;
+		}
+		waited++;
+		if (waited > max_wait) {
+			fprintf(stderr, "Warning: KSM full_scans not increased "
+				"after %d seconds\n", max_wait);
+			break;
+		}
+	} while (end_scans < start_scans + 2);
+}
+
+/*
+ * Enable the rmap_walk tracepoint and clear the trace buffer.
+ */
+static void enable_tracepoint(void)
+{
+	int fd;
+	struct stat st;
+
+	/* Check if tracefs is already accessible */
+	if (stat("/sys/kernel/tracing/trace", &st) != 0) {
+		/* Try to mount tracefs */
+		if (mount("tracefs", "/sys/kernel/tracing", "tracefs", 0, NULL) != 0) {
+			fprintf(stderr, "Warning: Failed to mount tracefs: %s\n",
+				strerror(errno));
+			/* Continue anyway, maybe it's already mounted elsewhere */
+		}
+	}
+
+	if (write_sys(TRACE_ENABLE, "1") < 0)
+		exit(1);
+	/* Truncate the trace file to clear old data */
+	fd = open(TRACE_FILE, O_WRONLY | O_TRUNC);
+	if (fd < 0) {
+		perror("open " TRACE_FILE);
+		exit(1);
+	}
+	close(fd);
+}
+
+/*
+ * Disable the rmap_walk tracepoint.
+ */
+static void disable_tracepoint(void)
+{
+	write_sys(TRACE_ENABLE, "0");
+}
+
+/*
+ * Parse the trace file and collect duration statistics for a given page_type.
+ * Returns 0 on success, -1 if no events found.
+ */
+static int parse_trace_and_print(enum page_type type, unsigned long long *max_ns,
+				 unsigned long long *avg_ns, int *count)
+{
+	FILE *fp;
+	char line[1024];
+	unsigned long long duration_ns;
+	unsigned long long max_val = 0;
+	unsigned long long sum = 0;
+	int cnt = 0;
+	const char *type_str = page_type_str(type);
+	char search_str[64];
+
+	snprintf(search_str, sizeof(search_str), "page_type=%s", type_str);
+
+	fp = fopen(TRACE_FILE, "r");
+	if (!fp) {
+		perror("fopen " TRACE_FILE);
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), fp)) {
+		char *dur = strstr(line, "duration_ns=");
+		char *type_match = strstr(line, search_str);
+
+		if (dur && type_match) {
+			char *end;
+
+			dur += 12;	/* skip "duration_ns=" */
+			duration_ns = strtoull(dur, &end, 10);
+			if (end != dur) {
+				if (duration_ns > max_val)
+					max_val = duration_ns;
+				sum += duration_ns;
+				cnt++;
+			}
+		}
+	}
+	fclose(fp);
+
+	if (cnt == 0) {
+		printf("No rmap_walk events with page_type=%s found.\n", type_str);
+		return -1;
+	}
+
+	*max_ns = max_val;
+	*avg_ns = sum / cnt;
+	*count = cnt;
+	return 0;
+}
+
+/*
+ * Trigger rmap_walk by moving a single page.
+ * region: pointer to the page (any page in the mapped region).
+ * The function will try to move that page to a different NUMA node.
+ */
+static void trigger_rmap_walk(void *region)
+{
+	int ret, status, cur_node, target_node;
+	void *pages[1];
+	int nodes[1];
+
+	printf("Triggering rmap_walk via move_pages...\n");
+
+	ret = move_pages(0, 1, (void **)&region, NULL, &status, MPOL_MF_MOVE_ALL);
+	if (ret != 0) {
+		perror("Failed to get original numa");
+		exit(1);
+	}
+	cur_node = status;
+
+	for (target_node = 0; target_node <= numa_max_node(); target_node++) {
+		if (numa_bitmask_isbitset(numa_all_nodes_ptr, target_node) &&
+		    target_node != cur_node)
+			break;
+	}
+	if (target_node > numa_max_node()) {
+		printf("Couldn't find available numa node for testing\n");
+		exit(1);
+	}
+
+	pages[0] = region;
+	nodes[0] = target_node;
+
+	/*
+	 * Note: We ignore the return value when ret >= 0, since there's probability
+	 * that ksmd's ksm_get_folio collides with do_move_page(), which cause
+	 * __migrate_folio failed due to the check "folio_ref_count(src) !=
+	 * expected_count".
+	 */
+	ret = move_pages(0, 1, pages, nodes, &status, MPOL_MF_MOVE_ALL);
+	if (ret < 0)
+		perror("move_pages");
+}
+
+/*
+ * Split a VMA into many small VMAs by changing protection on every other page.
+ * This increases the number of anon_vma_chain entries and makes rmap_walk slower.
+ */
+static void split_vma_with_mprotect(void *addr, size_t size)
+{
+	for (size_t i = 0; i < size / PAGE_SIZE; i++) {
+		if (i % 2 == 0) {
+			if (mprotect(addr + i * PAGE_SIZE, PAGE_SIZE, PROT_READ) < 0) {
+				if (errno != EACCES)
+					perror("mprotect");
+			}
+		}
+	}
+}
+
+/*
+ * Test for KSM pages.
+ */
+static void test_ksm(void)
+{
+	void *region;
+	size_t size = NR_PAGES * PAGE_SIZE;
+	unsigned long long max_ns, avg_ns;
+	int count;
+
+	printf("\n=== Testing KSM pages ===\n");
+
+	/* Stop KSM and set aggressive scan parameters */
+	if (write_sys(KSM_RUN_PATH, "2") < 0)
+		exit(1);
+	if (write_sys(KSM_SLEEP_MS_PATH, "0") < 0 ||
+	    write_sys(KSM_PAGES_TO_SCAN, "10000") < 0)
+		exit(1);
+
+	region = mmap(NULL, size, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (region == MAP_FAILED) {
+		perror("mmap for KSM");
+		exit(1);
+	}
+	memset(region, TEST_PATTERN, size);
+
+	if (madvise(region, size, MADV_MERGEABLE) != 0) {
+		perror("madvise MADV_MERGEABLE");
+		munmap(region, size);
+		exit(1);
+	}
+
+	/* Start KSM scanner */
+	if (write_sys(KSM_RUN_PATH, "1") < 0) {
+		munmap(region, size);
+		exit(1);
+	}
+
+	split_vma_with_mprotect(region, size);
+
+	/* Wait full merging */
+	wait_ksm_merge();
+
+	enable_tracepoint();
+	/* Move the page at offset PAGE_SIZE (any page is fine) */
+	trigger_rmap_walk(region + PAGE_SIZE);
+	usleep(100000);		/* allow trace to be written */
+	disable_tracepoint();
+
+	if (parse_trace_and_print(PAGE_TYPE_KSM, &max_ns, &avg_ns, &count) == 0) {
+		printf("KSM rmap_walk latency:\n");
+		printf("  Maximum duration: %.2f ms (%.0f ns)\n",
+		       max_ns / 1000000.0, (double)max_ns);
+		printf("  Average duration: %.2f ms (%.0f ns)\n",
+		       avg_ns / 1000000.0, (double)avg_ns);
+		printf("  Count: %d events\n", count);
+	}
+
+	munmap(region, size);
+	write_sys(KSM_RUN_PATH, "2");	/* stop KSM */
+}
+
+/*
+ * Test for normal anonymous pages.
+ */
+static void test_anon(void)
+{
+	void *region;
+	size_t size = NR_PAGES * PAGE_SIZE;
+	unsigned long long max_ns, avg_ns;
+	int count;
+
+	printf("\n=== Testing anonymous pages ===\n");
+
+	region = mmap(NULL, size, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (region == MAP_FAILED) {
+		perror("mmap for anonymous");
+		exit(1);
+	}
+	memset(region, TEST_PATTERN, size);
+
+	split_vma_with_mprotect(region, size);
+
+	enable_tracepoint();
+	trigger_rmap_walk(region + PAGE_SIZE);
+	usleep(100000);
+	disable_tracepoint();
+
+	if (parse_trace_and_print(PAGE_TYPE_ANON, &max_ns, &avg_ns, &count) == 0) {
+		printf("Anonymous page rmap_walk latency:\n");
+		printf("  Maximum duration: %.2f ms (%.0f ns)\n",
+		       max_ns / 1000000.0, (double)max_ns);
+		printf("  Average duration: %.2f ms (%.0f ns)\n",
+		       avg_ns / 1000000.0, (double)avg_ns);
+		printf("  Count: %d events\n", count);
+	}
+
+	munmap(region, size);
+}
+
+/*
+ * Test for file-backed pages (mmap of a temporary file).
+ */
+static void test_file(void)
+{
+	void *region;
+	size_t size = NR_PAGES * PAGE_SIZE;
+	int fd;
+	char filename[] = "/tmp/rmap_test_file_XXXXXX";
+
+	printf("\n=== Testing file pages ===\n");
+
+	fd = mkstemp(filename);
+	if (fd < 0) {
+		perror("mkstemp");
+		exit(1);
+	}
+	if (ftruncate(fd, size) < 0) {
+		perror("ftruncate");
+		unlink(filename);
+		close(fd);
+		exit(1);
+	}
+
+	region = mmap(NULL, size, PROT_READ | PROT_WRITE,
+		      MAP_SHARED, fd, 0);
+	if (region == MAP_FAILED) {
+		perror("mmap for file");
+		unlink(filename);
+		close(fd);
+		exit(1);
+	}
+	memset(region, TEST_PATTERN, size);
+
+	split_vma_with_mprotect(region, size);
+
+	enable_tracepoint();
+	trigger_rmap_walk(region + PAGE_SIZE);
+	usleep(100000);
+	disable_tracepoint();
+
+	unsigned long long max_ns, avg_ns;
+	int count;
+
+	if (parse_trace_and_print(PAGE_TYPE_FILE, &max_ns, &avg_ns, &count) == 0) {
+		printf("File page rmap_walk latency:\n");
+		printf("  Maximum duration: %.2f ms (%.0f ns)\n",
+		       max_ns / 1000000.0, (double)max_ns);
+		printf("  Average duration: %.2f ms (%.0f ns)\n",
+		       avg_ns / 1000000.0, (double)avg_ns);
+		printf("  Count: %d events\n", count);
+	}
+
+	munmap(region, size);
+	unlink(filename);
+	close(fd);
+}
+
+int main(void)
+{
+	/* Need root for tracefs and KSM sysfs */
+	if (geteuid() != 0) {
+		fprintf(stderr, "This program must be run as root.\n");
+		exit(1);
+	}
+
+	if (numa_available() < 0)
+		printf("Warning: NUMA not available, move_pages may not work.\n");
+
+	/* Run three tests */
+	test_ksm();
+	test_anon();
+	test_file();
+
+	return 0;
+}
+
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 3/5] ksm: add vm_pgoff into ksm_rmap_item
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
  2026-05-03 12:39 ` [PATCH v4 1/5] mm/rmap: add tracepoint for rmap_walk xu.xin16
  2026-05-03 12:42 ` [PATCH v4 2/5] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages xu.xin16
@ 2026-05-03 12:48 ` xu.xin16
  2026-05-03 12:50 ` [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-05-03 12:48 UTC (permalink / raw)
  To: akpm, david, ljs, hughd; +Cc: hughd, linux-mm, linux-kernel, michel, xu.xin16

From: xu xin <xu.xin16@zte.com.cn>
The reason for adding vm_pgoff to ksm_rmap_item has been discussed in previous
mailing list threads [1][2]. The main purpose is to allow the KSM reverse mapping
to obtain the original VMA's vm_pgoff, so that during anon_vma_tree travering, it
can conditionally locate the VMAs and avoid scanning the entire address space
[0, ULONG_MAX].

To minimize the size impact of adding vm_pgoff to ksm_rmap_item as much as
possible, a trick that David suggested is to use a UNION that groups the members
related to the unstable tree together with the newly added vm_pgoff. The members
that valids only when in unstable tree include oldchecksum and age information.
However, the function should_skip_rmap_item() in the smart scanning needs slight
modification, since this function still uses the age information even when the
rmap_item is in a stable state (the page is not KSM), a situation that occurs
during COW faults. After using union, the size is still 64 byte without increasing.

The setting and resetting of rmap_item->vm_pgoff are similar to rmap_item->anon_vma.

[1] https://lore.kernel.org/all/adTPQSb-qSSHviJN@lucifer/
[2] https://lore.kernel.org/all/202604091806051535BJWZ_FTtdIm3Snk24ei_@zte.com.cn/

Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 41 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7d5b76478f0b..0299a53ba7c9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -195,22 +195,28 @@ struct ksm_stable_node {
  * @node: rb node of this rmap_item in the unstable tree
  * @head: pointer to stable_node heading this list in the stable tree
  * @hlist: link into hlist of rmap_items hanging off that stable_node
- * @age: number of scan iterations since creation
- * @remaining_skips: how many scans to skip
+ * @age: number of scan iterations since creation (unstable node)
+ * @remaining_skips: how many scans to skip (unstable node)
+ * @vm_pgoff: vm_pgoff into the original VMA where the page is mapped (stable node)
  */
 struct ksm_rmap_item {
 	struct ksm_rmap_item *rmap_list;
 	union {
-		struct anon_vma *anon_vma;	/* when stable */
+		struct anon_vma *anon_vma;	/* for reverse mapping, when stable */
 #ifdef CONFIG_NUMA
 		int nid;		/* when node of unstable tree */
 #endif
 	};
 	struct mm_struct *mm;
 	unsigned long address;		/* + low bits used for flags below */
-	unsigned int oldchecksum;	/* when unstable */
-	rmap_age_t age;
-	rmap_age_t remaining_skips;
+	union {
+		struct {
+			unsigned int oldchecksum;
+			rmap_age_t age;
+			rmap_age_t remaining_skips;
+		};			/* when unstable */
+		unsigned long vm_pgoff;		/* for reverse mapping, when stable */
+	};
 	union {
 		struct rb_node node;	/* when node of unstable tree */
 		struct {		/* when listed from stable tree */
@@ -776,6 +782,10 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 	return vma;
 }

+/*
+ * break_cow: actively break the write-protect of the VMA. This is calld when
+ * rmap_item has not yet become stable, but page has been merged.
+ */
 static void break_cow(struct ksm_rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
@@ -787,6 +797,8 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
 	 * to undo, we also need to drop a reference to the anon_vma.
 	 */
 	put_anon_vma(rmap_item->anon_vma);
+	/* Reset pgoff that overlays age-related information. (still unstable) */
+	rmap_item->vm_pgoff = 0;

 	mmap_read_lock(mm);
 	vma = find_mergeable_vma(mm, addr);
@@ -899,6 +911,8 @@ static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node)
 		VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
 		stable_node->rmap_hlist_len--;
 		put_anon_vma(rmap_item->anon_vma);
+		/* Reset pgoff that overlays age-related information. */
+		rmap_item->vm_pgoff = 0;
 		rmap_item->address &= PAGE_MASK;
 		cond_resched();
 	}
@@ -1052,6 +1066,8 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 		stable_node->rmap_hlist_len--;

 		put_anon_vma(rmap_item->anon_vma);
+		/* Reset pgoff that overlays age-related information. */
+		rmap_item->vm_pgoff = 0;
 		rmap_item->head = NULL;
 		rmap_item->address &= PAGE_MASK;

@@ -1598,8 +1614,15 @@ static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item,
 	/* Unstable nid is in union with stable anon_vma: remove first */
 	remove_rmap_item_from_tree(rmap_item);

-	/* Must get reference to anon_vma while still holding mmap_lock */
+	/*
+	 * Must get reference to anon_vma while still holding mmap_lock,
+	 * We set these two members of stable node here instead of
+	 * stable_tree_append(), maybe because we don't want to hold
+	 * mmap_read_lock again? Here mmap_read_lock is already held to
+	 * find_mergeable_vma before merging.
+	 */
 	rmap_item->anon_vma = vma->anon_vma;
+	rmap_item->vm_pgoff = vma->vm_pgoff;
 	get_anon_vma(vma->anon_vma);
 out:
 	mmap_read_unlock(mm);
@@ -2458,6 +2481,10 @@ static bool should_skip_rmap_item(struct folio *folio,
 	if (folio_test_ksm(folio))
 		return false;

+	/* There is no age information in stable-tree nodes. */
+	if (rmap_item->address & STABLE_FLAG)
+		return false;
+
 	age = rmap_item->age;
 	if (age != U8_MAX)
 		rmap_item->age++;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
                   ` (2 preceding siblings ...)
  2026-05-03 12:48 ` [PATCH v4 3/5] ksm: add vm_pgoff into ksm_rmap_item xu.xin16
@ 2026-05-03 12:50 ` xu.xin16
  2026-05-03 12:51 ` [PATCH v4 5/5] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
  2026-05-03 14:59 ` [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm Andrew Morton
  5 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-05-03 12:50 UTC (permalink / raw)
  To: akpm, david, ljs, xu.xin16; +Cc: hughd, linux-mm, linux-kernel, michel

From: xu xin <xu.xin16@zte.com.cn>

Problem
=======
When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.

Root Cause
==========
Further investigation revealed that 99.9% of iterations inside the
anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

Solution
========
We cannot rely solely on anon_vma to locate all PTEs mapping this page but
also need to have the original page's pgoff. In fact, I believe only the
original vma->vm_pgoff is just enough. The implementation of
anon_vma_interval_tree_foreach — it essentially iterates to find a suitable
VMA such that the provided pgoff falls within the candidate's vm_pgoff range.

	vm_pgoff <= pgoff_parameter <= (vm_pgoff + vma_pages(v) - 1)

Fortunately, we have already vm_pgoff in ksm_rmap_item in the previos patch
of series, so that we use it to get the pgoff to accelerate the searching.

Performance
===========
In our real embedded Linux environment, the measured metrcis were as
follows:

1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
            and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
            in a loop of anon_vma_interval_tree_foreach.

The result is shown as follows:

         Time_ms      Nr_iteration_total    Skip_addr_out_of_range   Skip_mm_mismatch
Before:  228.65       22169                 22168                    0
After :   0.396        3                     0                       2

We also provide a rmap testbench: tools/testing/rmap/rmap_benchmark.c
The testing result in QEMU is shown as follows:

	Maximum duration		Average duration
Before:	705.12 ms (705119858 ns)	532.04 ms (532041586 ns)
After:	1.67 ms (1665917 ns)		1.44 ms (1443784 ns)

Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 0299a53ba7c9..a13184d00759 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		/* Ignore the stable/unstable/sqnr flags */
 		const unsigned long addr = rmap_item->address & PAGE_MASK;
+		const unsigned long vm_pgoff = rmap_item->vm_pgoff;
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
@@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 			anon_vma_lock_read(anon_vma);
 		}

+		/*
+		 * Currently KSM folios are order-0 normal pages, so pgoff_end
+		 * should be the same as pgoff_start.
+		 */
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
-					       0, ULONG_MAX) {
+					       vm_pgoff, vm_pgoff) {

 			cond_resched();
 			vma = vmac->vma;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 5/5] ksm: add mremap selftests for ksm_rmap_walk
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
                   ` (3 preceding siblings ...)
  2026-05-03 12:50 ` [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
@ 2026-05-03 12:51 ` xu.xin16
  2026-05-03 14:59 ` [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm Andrew Morton
  5 siblings, 0 replies; 7+ messages in thread
From: xu.xin16 @ 2026-05-03 12:51 UTC (permalink / raw)
  To: david, ljs, akpm, xu.xin16; +Cc: hughd, linux-mm, linux-kernel, michel

From: xu xin <xu.xin16@zte.com.cn>
The existing tools/testing/selftests/mm/rmap.c has already one testcase
for ksm_rmap_walk in TEST_F(migrate, ksm), which takes use of migration
of page from one NUMA node to another NUMA node. However, it just lacks
the scenario of mremapped VMAs.

There are one worker process and several checker processes, and in worker,
we add the calling of mremap() and then trigger KSM to merge pages before
migrating, , which is specailly to test a optimization which is introduced
by this patch ("ksm: Optimize rmap_walk_ksm by passing a suitable address
range"). In other checker processes, we just trigger KSM to merge pages
from each child process and check their PFN.

This test can reproduce the issue that Hugh points out at
https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 tools/testing/selftests/mm/rmap.c    | 79 ++++++++++++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c | 38 +++++++++++++
 tools/testing/selftests/mm/vm_util.h |  2 +
 3 files changed, 119 insertions(+)

diff --git a/tools/testing/selftests/mm/rmap.c b/tools/testing/selftests/mm/rmap.c
index 53f2058b0ef2..fced1f6304ac 100644
--- a/tools/testing/selftests/mm/rmap.c
+++ b/tools/testing/selftests/mm/rmap.c
@@ -430,4 +430,83 @@ TEST_F(migrate, ksm)
 	propagate_children(_metadata, data);
 }

+static void prepare_two_pages(struct global_data *data)
+{
+	/* Allocate exactly 2 pages for the test */
+	data->mapsize = 2 * getpagesize();
+	data->region = mmap(NULL, data->mapsize, PROT_READ | PROT_WRITE,
+			    MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (data->region == MAP_FAILED)
+		ksft_exit_fail_perror("mmap failed");
+
+	/* Fill both pages with identical content to encourage KSM merging */
+	memset(data->region, 0x77, data->mapsize);
+}
+
+static int mremap_merge_and_migrate(struct global_data *data)
+{
+	int ret, pagemap_fd;
+	void *old_region = data->region;
+	unsigned long page_sz = getpagesize();
+
+	/*
+	 * Mremap the second page to the first page's location (FIXED).
+	 * This effectively overwrites the first page, leaving the second page
+	 * unmapped. The physical page originally at the second page is now
+	 * mapped at the first page's virtual address.
+	 */
+	data->region = mremap(old_region + page_sz, page_sz, page_sz,
+			      MREMAP_MAYMOVE | MREMAP_FIXED, old_region);
+	if (data->region == MAP_FAILED) {
+		ksft_print_msg("mremap failed: %s\n", strerror(errno));
+		return FAIL_ON_CHECK;
+	}
+
+	/* Ensure KSM is active and wait for merging */
+	if (ksm_start() < 0) {
+		ksft_print_msg("KSM start failed\n");
+		return FAIL_ON_CHECK;
+	}
+
+	/* Attempt to migrate the merged KSM page */
+	ret = try_to_move_page(data->region);
+	if (ret != 0) {
+		ksft_print_msg("migration of KSM page after mremap failed\n");
+		return FAIL_ON_CHECK;
+	}
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (pagemap_fd == -1)
+		return FAIL_ON_WORK;
+	*data->expected_pfn = pagemap_get_pfn(pagemap_fd, data->region);
+
+	return 0;
+}
+
+TEST_F(migrate, ksm_and_mremap)
+{
+	struct global_data *data = &self->data;
+	int ret;
+
+	/* Skip if KSM is not available */
+	if (ksm_stop() < 0)
+		SKIP(return, "accessing \"/sys/kernel/mm/ksm/run\" failed");
+	if (ksm_get_full_scans() < 0)
+		SKIP(return, "accessing \"/sys/kernel/mm/ksm/full_scan\" failed");
+
+	ret = prctl(PR_SET_MEMORY_MERGE, 1, 0, 0, 0);
+	if (ret < 0 && errno == EINVAL)
+		SKIP(return, "PR_SET_MEMORY_MERGE not supported");
+	else if (ret)
+		ksft_exit_fail_perror("PR_SET_MEMORY_MERGE=1 failed");
+
+	/* Assign the three callbacks required by propagate_children */
+	data->do_prepare = prepare_two_pages;
+	data->do_work = mremap_merge_and_migrate;
+	data->do_check = has_same_pfn;
+
+	/* Run the test in a process tree to stress rmap locking */
+	propagate_children(_metadata, data);
+}
+
 TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index db94564f4431..a33a4069de7c 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -648,6 +648,44 @@ long ksm_get_self_merging_pages(void)
 	return strtol(buf, NULL, 10);
 }

+long ksm_get_pages_shared(void)
+{
+	int ksm_pages_shared_fd;
+	char buf[10];
+	ssize_t ret;
+
+	ksm_pages_shared_fd = open("/sys/kernel/mm/ksm/pages_shared", O_RDONLY);
+	if (ksm_pages_shared_fd < 0)
+		return -errno;
+
+	ret = pread(ksm_pages_shared_fd, buf, sizeof(buf) - 1, 0);
+	close(ksm_pages_shared_fd);
+	if (ret <= 0)
+		return -errno;
+	buf[ret] = 0;
+
+	return strtol(buf, NULL, 10);
+}
+
+long ksm_get_pages_sharing(void)
+{
+	int ksm_pages_sharing_fd;
+	char buf[10];
+	ssize_t ret;
+
+	ksm_pages_sharing_fd = open("/sys/kernel/mm/ksm/pages_sharing", O_RDONLY);
+	if (ksm_pages_sharing_fd < 0)
+		return -errno;
+
+	ret = pread(ksm_pages_sharing_fd, buf, sizeof(buf) - 1, 0);
+	close(ksm_pages_sharing_fd);
+	if (ret <= 0)
+		return -errno;
+	buf[ret] = 0;
+
+	return strtol(buf, NULL, 10);
+}
+
 long ksm_get_full_scans(void)
 {
 	int ksm_full_scans_fd;
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 1a07305ceff4..3b40727c3f1f 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -151,6 +151,8 @@ void *sys_mremap(void *old_address, unsigned long old_size,

 long ksm_get_self_zero_pages(void);
 long ksm_get_self_merging_pages(void);
+long ksm_get_pages_shared(void);
+long ksm_get_pages_sharing(void);
 long ksm_get_full_scans(void);
 int ksm_use_zero_pages(void);
 int ksm_start(void);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
                   ` (4 preceding siblings ...)
  2026-05-03 12:51 ` [PATCH v4 5/5] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
@ 2026-05-03 14:59 ` Andrew Morton
  5 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2026-05-03 14:59 UTC (permalink / raw)
  To: xu.xin16; +Cc: david, ljs, hughd, linux-mm, linux-kernel, michel

On Sun, 3 May 2026 20:35:38 +0800 (CST) <xu.xin16@zte.com.cn> wrote:

> Deep investigation revealed that rmap_walk_ksm's 99.9% of iterations inside
> the anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the pgoff_start and pgoff_end parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
> 
> An initial immature thought was using the "rmap_item->address >> PAGE_SHIFT"
> to be the pgoff passed into anon_vma_interval_tree_foreach(). But this is
> flawed because when a range has been mremap-moved, when its anon folio
> indexes and anon_vma pgoff correspond to the original user address,
> not to the current user address, which was pointed out at:
> 
>   https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/
> 
> According to the implementation of anon_vma_interval_tree_foreach —
> it essentially iterates to find a suitable VMA such that the provided pgoff falls
> within the VMA's range [vm_pgoff, vm_pgoff + vma_pages(v) - 1].
> 
> So the solution is to add vm_pgoff field in ksm_rmap_item and use vm_pgoff instead of
> address >> PAGE_SHIFT.

Thanks for pushing ahead with this.

Regarding the [4/5] changelog: I don't think I understand how much
effect this change has upon real-world workloads.  Are you able to
clarify that?  "How useful is this change to Linux users".

AI review had a lot to say:
	https://sashiko.dev/#/patchset/20260503203538194jFwVGloy43M1F3sQGaFt7@zte.com.cn

Human review was wondering how much overhead [1/5] would add.  I do
note that it adds overhead even when CONFIG_TRACING=n - the rmap.o text
segment gets a few hundred bytes larger and there's additional runtime
overhead.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-03 14:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
2026-05-03 12:39 ` [PATCH v4 1/5] mm/rmap: add tracepoint for rmap_walk xu.xin16
2026-05-03 12:42 ` [PATCH v4 2/5] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages xu.xin16
2026-05-03 12:48 ` [PATCH v4 3/5] ksm: add vm_pgoff into ksm_rmap_item xu.xin16
2026-05-03 12:50 ` [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
2026-05-03 12:51 ` [PATCH v4 5/5] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
2026-05-03 14:59 ` [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox