* [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
@ 2026-05-22 2:52 xu.xin16
2026-05-22 2:54 ` [PATCH v6 1/6] mm/rmap: add tracepoint for rmap_walk xu.xin16
` (6 more replies)
0 siblings, 7 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 2:52 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.
Two key highlights:
1. Lock hold time drops from >500ms to <2ms
- In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
anon_vma lock hold time during KSM rmap walk went from 705ms
down to 1.67ms (max) and 1.44ms (avg).
2. Real user impact
- The anon_vma lock is also acquired by page faults, reclaim,
migration, compaction, mlock, exit_mmap, and cgroup accounting.
- A long hold due to inefficient rmap walks stalls application
threads, causing latency spikes, reduced throughput, or even
container timeouts.
- The problem occurs even without fork() – VMA splitting (e.g.,
via mprotect or madvise over time) can create tens of thousands
of VMAs all attached to the same anon_vma.
Real-world examples:
- JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.
- Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to release
specific pages, which also splits VMAs.
* Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing
one anon_vma on a production system running a Java application with KSM
enabled. The lock hold time before the patch was measured at 228 ms (max)
during rmap walks triggered by memory compaction and page migration.
The benchmark reproduces that VMA count and lock‑hold behavior in a
controlled environment.
For systems that do not have thousands of VMAs per anon_vma, the
patch adds negligible overhead (a single pgoff comparison). For systems
that do suffer from this issue, the improvement is dramatic:
1) Worst‑case anon_vma lock hold time drops from hundreds of milliseconds
to under 2 ms.2)This directly reduces blocking of parallel operations that
need the same lock – page faults, reclaim, migration, compaction, mlock, and
exit_mmap.
End‑users will see lower tail latency (fewer application stalls),
higher throughput under memory pressure, and no more spurious
lockup warnings or container timeouts caused by excessive lock hold
times.
In short: workloads that do not hit this pathological pattern are
unaffected; those that do will see a 100x to 500x reduction in lock
hold times, which translates directly into a more responsive system.
---
Changes in v6:
- Patch 1: Defining a single event class once and instantiating the individual
tracepoints with DEFINE_EVENT, as AI said:
https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@zte.com.cn
- Patch 2: Suggested-by AI below, three useful changes are done:
(1) Safe event pairing – Now stores folio and rwc addresses for rmap_walk_start
and matches with the same addresses in rmap_walk_end, eliminating
cross‑thread interference.
(2 )KSM configuration preservation – Saves original KSM settings and restores
them after the KSM test, avoiding persistent changes to system behaviour.
(3) unlink in advance to prevent potentialfile leak – unlink(filename) called
immediately after mkstemp, so the temporary file is automatically removed
even if the program crashes early.
- Patch 3: a separate, standalone patch to update the MAINTAINERS file.
Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
description (real workloads, lock contention examples,
VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
of multi-process), as suggested by David.
Changes in v4:
- Add a tracepoint for rmap_walk
- Provide a testbench for rmap_walk
- Add vm_pgoff field in ksm_rmap_item
- use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)
Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.
Changes in v2:
- Use const variable to initialize 'addr' "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)
xu xin (6):
mm/rmap: add tracepoint for rmap_walk
tools/testing: add rmap walk latency benchmark for KSM, anonymous and
file pages
MAINTAINERS: add myself as reviewer for rmap section
ksm: add pgoff into ksm_rmap_item
ksm: Optimize rmap_walk_ksm by passing a suitable address range
ksm: add mremap selftests for ksm_rmap_walk
MAINTAINERS | 3 +
include/trace/events/rmap.h | 67 ++++
mm/ksm.c | 48 ++-
mm/rmap.c | 9 +
tools/testing/rmap/Makefile | 11 +
tools/testing/rmap/rmap_benchmark.c | 461 +++++++++++++++++++++++++++
tools/testing/selftests/mm/rmap.c | 76 +++++
tools/testing/selftests/mm/vm_util.c | 38 +++
tools/testing/selftests/mm/vm_util.h | 2 +
9 files changed, 707 insertions(+), 8 deletions(-)
create mode 100644 include/trace/events/rmap.h
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
--
2.25.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v6 1/6] mm/rmap: add tracepoint for rmap_walk
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
@ 2026-05-22 2:54 ` xu.xin16
2026-05-22 2:56 ` [PATCH v6 2/6] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages xu.xin16
` (5 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 2:54 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
Add trace_rmap_walk_start() and trace_rmap_walk_end() to bracket
reverse mapping walks. Unlike manual clock sampling, these
tracepoints record no timestamp; latency can be computed offline
by tools (e.g., perf, trace-cmd) using the event timestamps.
When tracepoints are disabled, the only cost is a static branch
check (no clock read, no duration calculation), making them
suitable for production use.
The information (folio type, locked state) helps diagnose
performance issues in KSM, anonymous, and file-backed rmap walks.
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
include/trace/events/rmap.h | 67 +++++++++++++++++++++++++++++++++++++
mm/rmap.c | 9 +++++
2 files changed, 76 insertions(+)
create mode 100644 include/trace/events/rmap.h
diff --git a/include/trace/events/rmap.h b/include/trace/events/rmap.h
new file mode 100644
index 000000000000..55a319ba6235
--- /dev/null
+++ b/include/trace/events/rmap.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rmap
+
+#if !defined(_TRACE_RMAP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RMAP_H
+
+#include <linux/tracepoint.h>
+#include <linux/rmap.h>
+
+#define GET_RMAP_PAGE_TYPE(folio) (folio_test_ksm(folio) ? "ksm" : \
+ (folio_test_anon(folio) ? "anon" : "file"))
+
+/**
+ * rmap_walk_template - called for start / stop of rmap_walk.
+ */
+DECLARE_EVENT_CLASS(rmap_walk_template,
+
+ TP_PROTO(struct folio *folio, struct rmap_walk_control *rwc, bool locked),
+
+ TP_ARGS(folio, rwc, locked),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, folio_addr)
+ __field(unsigned long, rwc_addr)
+ __string(page_type, GET_RMAP_PAGE_TYPE(folio))
+ __field(bool, locked)
+ ),
+
+ TP_fast_assign(
+ __entry->folio_addr = (unsigned long)folio;
+ __entry->rwc_addr = (unsigned long)rwc;
+ __assign_str(page_type);
+ __entry->locked = locked;
+ ),
+
+ TP_printk("folio=%p rwc=%p page_type=%s locked=%s",
+ (void *)(unsigned long)__entry->folio_addr,
+ (void *)(unsigned long)__entry->rwc_addr,
+ __get_str(page_type),
+ __entry->locked ? "true" : "false")
+
+);
+
+/**
+ * rmap_walk_start - called before a folio is rmapped.
+ */
+DEFINE_EVENT(rmap_walk_template, rmap_walk_start,
+
+ TP_PROTO(struct folio *folio, struct rmap_walk_control *rwc, bool locked),
+
+ TP_ARGS(folio, rwc, locked)
+);
+
+DEFINE_EVENT(rmap_walk_template, rmap_walk_end,
+
+ TP_PROTO(struct folio *folio, struct rmap_walk_control *rwc, bool locked),
+
+ TP_ARGS(folio, rwc, locked)
+);
+
+
+#endif /* _TRACE_RMAP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367c..52f795f768e1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -80,6 +80,7 @@
#define CREATE_TRACE_POINTS
#include <trace/events/migrate.h>
+#include <trace/events/rmap.h>
#include "internal.h"
#include "swap.h"
@@ -3098,23 +3099,31 @@ static void rmap_walk_file(struct folio *folio,
void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc)
{
+ trace_rmap_walk_start(folio, rwc, false);
+
if (unlikely(folio_test_ksm(folio)))
rmap_walk_ksm(folio, rwc);
else if (folio_test_anon(folio))
rmap_walk_anon(folio, rwc, false);
else
rmap_walk_file(folio, rwc, false);
+
+ trace_rmap_walk_end(folio, rwc, false);
}
/* Like rmap_walk, but caller holds relevant rmap lock */
void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc)
{
+ trace_rmap_walk_start(folio, rwc, true);
+
/* no ksm support for now */
VM_BUG_ON_FOLIO(folio_test_ksm(folio), folio);
if (folio_test_anon(folio))
rmap_walk_anon(folio, rwc, true);
else
rmap_walk_file(folio, rwc, true);
+
+ trace_rmap_walk_end(folio, rwc, true);
}
#ifdef CONFIG_HUGETLB_PAGE
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 2/6] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
2026-05-22 2:54 ` [PATCH v6 1/6] mm/rmap: add tracepoint for rmap_walk xu.xin16
@ 2026-05-22 2:56 ` xu.xin16
2026-05-22 2:57 ` [PATCH v6 3/6] MAINTAINERS: add myself as reviewer for rmap section xu.xin16
` (4 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 2:56 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
Add a new benchmark that measures rmap_walk latency under controlled
conditions. The test creates a large region (20,000 pages by default),
optionally splits the VMA into many small VMAs by mprotect(PROT_READ)
on every other page, then triggers rmap_walk via move_pages().
The existing rmap_walk tracepoints (events/rmap/rmap_walk_start and
events/rmap/rmap_walk_end) are used to collect duration for events with
page_type=ksm, page_type=anon, and page_type=file.
Three separate test cases are run:
- KSM pages: allocate an anonymous region, fill with identical data,
mark MADV_MERGEABLE, wait for KSM to merge all pages (by polling
/sys/kernel/mm/ksm/full_scans), then trigger migration.
- Anonymous pages: similar but without KSM merging.
- File pages: mmap a temporary file with shared mapping and fill with
identical data.
For each test, the program prints the number of captured events and
the maximum / average latency in milliseconds.
This benchmark helps developers evaluate optimizations in the reverse
mapping code, such as limiting max_page_sharing or improving tree
traversal efficiency.
Usage (must be run as root):
cd tools/testing/rmap/ && make
sudo ./rmap_bench
=== Testing KSM pages ===
Triggering rmap_walk via move_pages...
KSM rmap_walk latency:
Maximum duration: 705.12 ms (705119 us)
Average duration: 532.04 ms (532041 us)
Count: 4 events
=== Testing anonymous pages ===
Triggering rmap_walk via move_pages...
Anonymous page rmap_walk latency:
Maximum duration: 0.07 ms (69 us)
Average duration: 0.05 ms (48 us)
Count: 2 events
=== Testing file pages ===
Triggering rmap_walk via move_pages...
File page rmap_walk latency:
Maximum duration: 0.07 ms (67 us)
Average duration: 0.03 ms (30 us)
Count: 4 events
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
tools/testing/rmap/Makefile | 11 +
tools/testing/rmap/rmap_benchmark.c | 461 ++++++++++++++++++++++++++++
2 files changed, 472 insertions(+)
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
diff --git a/tools/testing/rmap/Makefile b/tools/testing/rmap/Makefile
new file mode 100644
index 000000000000..200bd364cafb
--- /dev/null
+++ b/tools/testing/rmap/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+CC := $(CROSS_COMPILE)gcc
+
+PROGS := rmap_benchmark
+
+all: $(PROGS)
+
+rmap_benchmark: LDLIBS = -lnuma
+
+clean:
+ rm -fr $(PROGS)
diff --git a/tools/testing/rmap/rmap_benchmark.c b/tools/testing/rmap/rmap_benchmark.c
new file mode 100644
index 000000000000..b163f4d6aec3
--- /dev/null
+++ b/tools/testing/rmap/rmap_benchmark.c
@@ -0,0 +1,461 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Reverse mapping latency test for KSM, anonymous and file pages
+ *
+ * This program creates a large number of pages (KSM merged, normal anonymous,
+ * or file mapped), splits the VMA into many small VMAs via mprotect,
+ * triggers rmap_walk by move_pages(), and collects latency data from the
+ * tracepoints 'rmap_walk_start' and 'rmap_walk_end' (offline timestamp diff).
+ *
+ * Usage: must be run as root (to access tracefs and KSM sysfs).
+ *
+ * Copyright 2026, ZTE Corp.
+ *
+ * Author(s): Xu Xin <xu.xin16@zte.com.cn>
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/mount.h>
+#include <numaif.h>
+#include <numa.h>
+#include <time.h>
+#include <ctype.h>
+
+/* Page size and test parameters */
+int page_size;
+#define NR_PAGES 20000 /* Number of virtual pages */
+#define TEST_PATTERN 0xaa
+
+/* KSM sysfs paths */
+#define KSM_RUN_PATH "/sys/kernel/mm/ksm/run"
+#define KSM_SLEEP_MS_PATH "/sys/kernel/mm/ksm/sleep_millisecs"
+#define KSM_PAGES_TO_SCAN "/sys/kernel/mm/ksm/pages_to_scan"
+#define KSM_FULL_SCANS_PATH "/sys/kernel/mm/ksm/full_scans"
+
+/* Tracepoint control paths - enable all events under rmap */
+#define TRACE_ENABLE "/sys/kernel/tracing/events/rmap/enable"
+#define TRACE_FILE "/sys/kernel/tracing/trace"
+
+enum page_type {
+ PAGE_TYPE_KSM,
+ PAGE_TYPE_ANON,
+ PAGE_TYPE_FILE,
+};
+
+static const char *page_type_str(enum page_type type)
+{
+ switch (type) {
+ case PAGE_TYPE_KSM: return "ksm";
+ case PAGE_TYPE_ANON: return "anon";
+ case PAGE_TYPE_FILE: return "file";
+ default: return "unknown";
+ }
+}
+
+/* Helper: read/write sysfs */
+static int write_sys(const char *path, const char *value)
+{
+ int fd = open(path, O_WRONLY);
+ if (fd < 0) {
+ fprintf(stderr, "open %s failed: %s\n", path, strerror(errno));
+ return -1;
+ }
+ ssize_t ret = write(fd, value, strlen(value));
+ close(fd);
+ if (ret != (ssize_t)strlen(value)) {
+ fprintf(stderr, "write %s failed: %s\n", path, strerror(errno));
+ return -1;
+ }
+ return 0;
+}
+
+static int read_sys_int(const char *path, int *val)
+{
+ FILE *fp = fopen(path, "r");
+ if (!fp)
+ return -1;
+ if (fscanf(fp, "%d", val) != 1) {
+ fclose(fp);
+ return -1;
+ }
+ fclose(fp);
+ return 0;
+}
+
+/* KSM full scan count */
+static int ksm_get_full_scans(void)
+{
+ int val;
+ if (read_sys_int(KSM_FULL_SCANS_PATH, &val))
+ return -1;
+ return val;
+}
+
+/* Wait for KSM full scans */
+static void wait_ksm_merge(void)
+{
+ int start_scans, end_scans;
+ int max_wait = 60, waited = 0;
+
+ start_scans = ksm_get_full_scans();
+ if (start_scans < 0) {
+ fprintf(stderr, "Failed to read initial full_scans\n");
+ return;
+ }
+ if (write_sys(KSM_RUN_PATH, "1") < 0) {
+ fprintf(stderr, "Failed to start KSM\n");
+ return;
+ }
+ do {
+ sleep(1);
+ end_scans = ksm_get_full_scans();
+ if (end_scans < 0)
+ return;
+ waited++;
+ if (waited > max_wait) {
+ fprintf(stderr, "Warning: KSM full_scans not increased after %ds\n", max_wait);
+ break;
+ }
+ } while (end_scans < start_scans + 2);
+}
+
+/* Tracepoint enable/disable */
+static void enable_tracepoint(void)
+{
+ struct stat st;
+ if (stat("/sys/kernel/tracing/trace", &st) != 0) {
+ if (mount("tracefs", "/sys/kernel/tracing", "tracefs", 0, NULL) != 0)
+ fprintf(stderr, "Warning: mount tracefs failed: %s\n", strerror(errno));
+ }
+ if (write_sys(TRACE_ENABLE, "1") < 0)
+ exit(1);
+ int fd = open(TRACE_FILE, O_WRONLY | O_TRUNC);
+ if (fd < 0) {
+ perror("open " TRACE_FILE);
+ exit(1);
+ }
+ close(fd);
+}
+
+static void disable_tracepoint(void)
+{
+ write_sys(TRACE_ENABLE, "0");
+}
+
+/* Timestamp extraction (us) */
+static unsigned long long extract_timestamp_us(const char *line)
+{
+ char time_str[32];
+ double ts_sec = 0.0;
+ if (sscanf(line, "%*s %*s %*s %31s", time_str) == 1) {
+ char *colon = strchr(time_str, ':');
+ if (colon) *colon = '\0';
+ ts_sec = strtod(time_str, NULL);
+ }
+ return (unsigned long long)(ts_sec * 1e6);
+}
+
+/* Safe start/end pairing using folio and rwc addresses */
+struct pending_start {
+ unsigned long long ts;
+ unsigned long folio;
+ unsigned long rwc;
+};
+
+static int parse_trace_and_print(enum page_type type, unsigned long long *max_us,
+ unsigned long long *avg_us, int *count)
+{
+ FILE *fp = fopen(TRACE_FILE, "r");
+ if (!fp) {
+ perror("fopen " TRACE_FILE);
+ return -1;
+ }
+
+ char line[1024];
+ struct pending_start pending[128];
+ int pending_cnt = 0;
+ unsigned long long sum = 0, max_val = 0;
+ int pairs = 0;
+ const char *type_str = page_type_str(type);
+ char type_pattern[64];
+ snprintf(type_pattern, sizeof(type_pattern), "page_type=%s", type_str);
+
+ while (fgets(line, sizeof(line), fp)) {
+ if (!strstr(line, type_pattern))
+ continue;
+
+ /* Extract folio and rwc addresses */
+ unsigned long folio = 0, rwc = 0;
+ char *folio_str = strstr(line, "folio=");
+ char *rwc_str = strstr(line, "rwc=");
+ if (folio_str && rwc_str) {
+ folio = strtoul(folio_str + 6, NULL, 16);
+ rwc = strtoul(rwc_str + 4, NULL, 16);
+ } else {
+ continue;
+ }
+
+ if (strstr(line, "rmap_walk_start:")) {
+ if (pending_cnt < 128) {
+ pending[pending_cnt].ts = extract_timestamp_us(line);
+ pending[pending_cnt].folio = folio;
+ pending[pending_cnt].rwc = rwc;
+ pending_cnt++;
+ }
+ } else if (strstr(line, "rmap_walk_end:")) {
+ unsigned long long end_ts = extract_timestamp_us(line);
+ /* Find matching start event */
+ for (int i = 0; i < pending_cnt; i++) {
+ if (pending[i].folio == folio && pending[i].rwc == rwc) {
+ unsigned long long delta = end_ts - pending[i].ts;
+ if (delta > max_val) max_val = delta;
+ sum += delta;
+ pairs++;
+ /* Remove this pending entry */
+ pending[i] = pending[--pending_cnt];
+ break;
+ }
+ }
+ }
+ }
+ fclose(fp);
+
+ if (pairs == 0) {
+ printf("No rmap_walk events with page_type=%s found.\n", type_str);
+ return -1;
+ }
+
+ *max_us = max_val;
+ *avg_us = sum / pairs;
+ *count = pairs;
+ return 0;
+}
+
+/* Trigger rmap_walk via move_pages */
+static void trigger_rmap_walk(void *region)
+{
+ int ret, status, cur_node, target_node;
+ void *pages[1];
+ int nodes[1];
+
+ ret = move_pages(0, 1, (void **)®ion, NULL, &status, MPOL_MF_MOVE_ALL);
+ if (ret != 0) {
+ perror("Failed to get original numa");
+ exit(1);
+ }
+ cur_node = status;
+
+ for (target_node = 0; target_node <= numa_max_node(); target_node++) {
+ if (numa_bitmask_isbitset(numa_all_nodes_ptr, target_node) && target_node != cur_node)
+ break;
+ }
+ if (target_node > numa_max_node()) {
+ fprintf(stderr, "No other NUMA node\n");
+ exit(1);
+ }
+
+ pages[0] = region;
+ nodes[0] = target_node;
+ ret = move_pages(0, 1, pages, nodes, &status, MPOL_MF_MOVE_ALL);
+ if (ret < 0)
+ perror("move_pages");
+}
+
+/* Split VMA with mprotect */
+static void split_vma_with_mprotect(void *addr, size_t size)
+{
+ for (size_t i = 0; i < size / page_size; i++) {
+ if (i % 2 == 0) {
+ if (mprotect(addr + i * page_size, page_size, PROT_READ) < 0 && errno != EACCES)
+ perror("mprotect");
+ }
+ }
+}
+
+/* KSM configuration save/restore */
+static struct ksm_config {
+ int run;
+ int sleep_ms;
+ int pages_to_scan;
+} orig_ksm;
+
+static int save_ksm_config(void)
+{
+ if (read_sys_int(KSM_RUN_PATH, &orig_ksm.run) ||
+ read_sys_int(KSM_SLEEP_MS_PATH, &orig_ksm.sleep_ms) ||
+ read_sys_int(KSM_PAGES_TO_SCAN, &orig_ksm.pages_to_scan)) {
+ fprintf(stderr, "Failed to read KSM config\n");
+ return -1;
+ }
+ return 0;
+}
+
+static void restore_ksm_config(void)
+{
+ char buf[32];
+ snprintf(buf, sizeof(buf), "%d", orig_ksm.run);
+ write_sys(KSM_RUN_PATH, buf);
+ snprintf(buf, sizeof(buf), "%d", orig_ksm.sleep_ms);
+ write_sys(KSM_SLEEP_MS_PATH, buf);
+ snprintf(buf, sizeof(buf), "%d", orig_ksm.pages_to_scan);
+ write_sys(KSM_PAGES_TO_SCAN, buf);
+}
+
+/* KSM test */
+static void test_ksm(void)
+{
+ size_t size = NR_PAGES * page_size;
+ unsigned long long max_us, avg_us;
+ int count;
+
+ if (save_ksm_config() < 0) {
+ printf("KSM not available, skip KSM test.\n");
+ return;
+ }
+
+ if (write_sys(KSM_RUN_PATH, "2") < 0 ||
+ write_sys(KSM_SLEEP_MS_PATH, "0") < 0 ||
+ write_sys(KSM_PAGES_TO_SCAN, "10000") < 0) {
+ fprintf(stderr, "Failed to configure KSM\n");
+ restore_ksm_config();
+ return;
+ }
+
+ void *region = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (region == MAP_FAILED) {
+ perror("mmap for KSM");
+ restore_ksm_config();
+ return;
+ }
+
+ memset(region, TEST_PATTERN, size);
+ if (madvise(region, size, MADV_MERGEABLE) != 0) {
+ perror("madvise MADV_MERGEABLE");
+ munmap(region, size);
+ restore_ksm_config();
+ return;
+ }
+
+ if (write_sys(KSM_RUN_PATH, "1") < 0) {
+ perror("Start KSM");
+ munmap(region, size);
+ restore_ksm_config();
+ return;
+ }
+
+ /* Construct a anon_vma shared by a number of unrelated VMAs */
+ split_vma_with_mprotect(region, size);
+ wait_ksm_merge();
+
+ /* Trigger one page to be rmapped */
+ enable_tracepoint();
+ trigger_rmap_walk(region + page_size);
+ usleep(100000);
+ disable_tracepoint();
+
+ if (parse_trace_and_print(PAGE_TYPE_KSM, &max_us, &avg_us, &count) == 0) {
+ printf("KSM rmap_walk latency:\n");
+ printf(" Max: %.2f ms (%.0f us)\n", max_us/1000.0, (double)max_us);
+ printf(" Avg: %.2f ms (%.0f us)\n", avg_us/1000.0, (double)avg_us);
+ printf(" Count: %d\n", count);
+ }
+ munmap(region, size);
+ restore_ksm_config();
+}
+
+/* Anonymous test */
+static void test_anon(void)
+{
+ size_t size = NR_PAGES * page_size;
+ unsigned long long max_us, avg_us;
+ int count;
+ void *region = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (region == MAP_FAILED) {
+ perror("mmap anon");
+ return;
+ }
+ memset(region, TEST_PATTERN, size);
+ split_vma_with_mprotect(region, size);
+ enable_tracepoint();
+ trigger_rmap_walk(region + page_size);
+ usleep(100000);
+ disable_tracepoint();
+ if (parse_trace_and_print(PAGE_TYPE_ANON, &max_us, &avg_us, &count) == 0) {
+ printf("Anonymous page rmap_walk latency:\n");
+ printf(" Max: %.2f ms (%.0f us)\n", max_us/1000.0, (double)max_us);
+ printf(" Avg: %.2f ms (%.0f us)\n", avg_us/1000.0, (double)avg_us);
+ printf(" Count: %d\n", count);
+ }
+ munmap(region, size);
+}
+
+/* File-backed test (with early unlink) */
+static void test_file(void)
+{
+ size_t size = NR_PAGES * page_size;
+ char filename[] = "/tmp/rmap_test_file_XXXXXX";
+ int fd = mkstemp(filename);
+ if (fd < 0) {
+ perror("mkstemp");
+ return;
+ }
+ unlink(filename); /* file will vanish when fd closed, even on crash */
+ if (ftruncate(fd, size) < 0) {
+ perror("ftruncate");
+ close(fd);
+ return;
+ }
+ void *region = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (region == MAP_FAILED) {
+ perror("mmap file");
+ close(fd);
+ return;
+ }
+ memset(region, TEST_PATTERN, size);
+ split_vma_with_mprotect(region, size);
+ enable_tracepoint();
+ trigger_rmap_walk(region + page_size);
+ usleep(100000);
+ disable_tracepoint();
+
+ unsigned long long max_us, avg_us;
+ int count;
+ if (parse_trace_and_print(PAGE_TYPE_FILE, &max_us, &avg_us, &count) == 0) {
+ printf("File page rmap_walk latency:\n");
+ printf(" Max: %.2f ms (%.0f us)\n", max_us/1000.0, (double)max_us);
+ printf(" Avg: %.2f ms (%.0f us)\n", avg_us/1000.0, (double)avg_us);
+ printf(" Count: %d\n", count);
+ }
+ munmap(region, size);
+ close(fd);
+}
+
+int main(void)
+{
+ page_size = getpagesize();
+
+ if (geteuid() != 0) {
+ fprintf(stderr, "Must be run as root.\n");
+ return 1;
+ }
+ if (numa_available() < 0) {
+ fprintf(stderr, "NUMA not available.\n");
+ return 1;
+ }
+
+ test_ksm();
+ test_anon();
+ test_file();
+ return 0;
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 3/6] MAINTAINERS: add myself as reviewer for rmap section
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
2026-05-22 2:54 ` [PATCH v6 1/6] mm/rmap: add tracepoint for rmap_walk xu.xin16
2026-05-22 2:56 ` [PATCH v6 2/6] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages xu.xin16
@ 2026-05-22 2:57 ` xu.xin16
2026-05-22 3:00 ` [PATCH v6 4/6] ksm: add pgoff into ksm_rmap_item xu.xin16
` (3 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 2:57 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
To help review future changes related to rmap tracing and testing,
add myself as a reviewer (R:) for the rmap entry, and also update
the file patterns to include:
- include/trace/events/rmap.h
- tools/testing/rmap/rmap_benchmark.c
Signed-off-by: Xu Xin <xu.xin16@zte.com.cn>
---
MAINTAINERS | 3 +++
1 file changed, 3 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 8e7268d2f6ec..01cc34cc83a2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17006,11 +17006,14 @@ R: Liam R. Howlett <liam@infradead.org>
R: Vlastimil Babka <vbabka@kernel.org>
R: Harry Yoo <harry@kernel.org>
R: Jann Horn <jannh@google.com>
+R: Xu Xin <xu.xin16@zte.com.cn>
L: linux-mm@kvack.org
S: Maintained
F: include/linux/rmap.h
+F: include/trace/events/rmap.h
F: mm/page_vma_mapped.c
F: mm/rmap.c
+F: tools/testing/rmap/rmap_benchmark.c
F: tools/testing/selftests/mm/rmap.c
MEMORY MANAGEMENT - SECRETMEM
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 4/6] ksm: add pgoff into ksm_rmap_item
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
` (2 preceding siblings ...)
2026-05-22 2:57 ` [PATCH v6 3/6] MAINTAINERS: add myself as reviewer for rmap section xu.xin16
@ 2026-05-22 3:00 ` xu.xin16
2026-05-22 3:01 ` [PATCH v6 5/6] ksm: Optimize rmap_walk_ksm by passing a suitable pgoff xu.xin16
` (2 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 3:00 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
The reason for adding pgoff to ksm_rmap_item has been discussed in previous
mailing list threads [1][2]. The main purpose is to allow the KSM reverse mapping
to obtain the original page's linear page index, so that during anon_vma_tree
travering, it can conditionally locate the VMAs and avoid scanning the entire
address space [0, ULONG_MAX].
To minimize the size impact of adding pgoff to ksm_rmap_item as much as
possible, a trick that David suggested is to use a UNION that groups the members
related to the unstable tree together with the newly added linear page index. The
members that valids only when in unstable tree include oldchecksum and age information.
However, the function should_skip_rmap_item() in the smart scanning needs slight
modification, since this function still uses the age information even when the
rmap_item is in a stable state (the page is not KSM), a situation that occurs
during COW faults. After using union, the size is still 64 byte without increasing.
We keep the same way to store the pgoff as rmap->anon_vma which is set when the page
is merged and become a KsmPage at try_to_merge_with_ksm_page(), and reset at
remove_rmap_item_from_tree() and remove_node_from_stable_tree() and reset when break_cow.
To be specially clarified, the reason for resetting pgoff at break_cow() is:
- When a page successfully becomes a KSM page (i.e., after stable_tree_append()
sets STABLE_FLAG), both anon_vma and vm_pgoff are stored and remain valid.
- However, during the merging process there are several failure paths where a
page that was temporarily treated as a KSM page must be reverted back to an
anonymous page. Examples include:
* The second call to try_to_merge_with_ksm_page() fails in
try_to_merge_two_pages().
* stable_tree_insert() fails in cmp_and_merge_page().
In such cases, break_cow() is invoked to break the COW mapping and discard
the KSM state.
Currently, break_cow() already contains a put_anon_vma(rmap_item->anon_vma)
to release the reference taken during the aborted merge. Because 'pgoff' is
logically paired with anon_vma (both are only meaningful when the rmap_item
is in a stable state), it must also be cleared (or reset) in break_cow() to
avoid leaving stale pgoff values that could confuse subsequent rmap walks
or scanning logic.
[1] https://lore.kernel.org/all/adTPQSb-qSSHviJN@lucifer/
[2] https://lore.kernel.org/all/202604091806051535BJWZ_FTtdIm3Snk24ei_@zte.com.cn/
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
mm/ksm.c | 41 ++++++++++++++++++++++++++++++++++-------
1 file changed, 34 insertions(+), 7 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 7d5b76478f0b..4761ca3fa984 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -195,22 +195,28 @@ struct ksm_stable_node {
* @node: rb node of this rmap_item in the unstable tree
* @head: pointer to stable_node heading this list in the stable tree
* @hlist: link into hlist of rmap_items hanging off that stable_node
- * @age: number of scan iterations since creation
- * @remaining_skips: how many scans to skip
+ * @age: number of scan iterations since creation (unstable node)
+ * @remaining_skips: how many scans to skip (unstable node)
+ * @pgoff: pgoff into @anon_vma where the page is mapped (stable tree)
*/
struct ksm_rmap_item {
struct ksm_rmap_item *rmap_list;
union {
- struct anon_vma *anon_vma; /* when stable */
+ struct anon_vma *anon_vma; /* for reverse mapping, when stable */
#ifdef CONFIG_NUMA
int nid; /* when node of unstable tree */
#endif
};
struct mm_struct *mm;
unsigned long address; /* + low bits used for flags below */
- unsigned int oldchecksum; /* when unstable */
- rmap_age_t age;
- rmap_age_t remaining_skips;
+ union {
+ struct {
+ unsigned int oldchecksum;
+ rmap_age_t age;
+ rmap_age_t remaining_skips;
+ }; /* when unstable */
+ unsigned long pgoff; /* for reverse mapping, when stable */
+ };
union {
struct rb_node node; /* when node of unstable tree */
struct { /* when listed from stable tree */
@@ -776,6 +782,10 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
return vma;
}
+/*
+ * break_cow: actively break the write-protect of the VMA. This is called when
+ * rmap_item has not yet become stable, but page has been merged.
+ */
static void break_cow(struct ksm_rmap_item *rmap_item)
{
struct mm_struct *mm = rmap_item->mm;
@@ -787,6 +797,8 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
* to undo, we also need to drop a reference to the anon_vma.
*/
put_anon_vma(rmap_item->anon_vma);
+ /* Reset pgoff that might overlay age-related information. (still unstable) */
+ rmap_item->pgoff = 0;
mmap_read_lock(mm);
vma = find_mergeable_vma(mm, addr);
@@ -899,6 +911,8 @@ static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node)
VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
stable_node->rmap_hlist_len--;
put_anon_vma(rmap_item->anon_vma);
+ /* Reset pgoff that might overlay age-related information. */
+ rmap_item->pgoff = 0;
rmap_item->address &= PAGE_MASK;
cond_resched();
}
@@ -1052,6 +1066,8 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
stable_node->rmap_hlist_len--;
put_anon_vma(rmap_item->anon_vma);
+ /* Reset pgoff that might overlay age-related information. */
+ rmap_item->pgoff = 0;
rmap_item->head = NULL;
rmap_item->address &= PAGE_MASK;
@@ -1598,8 +1614,15 @@ static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item,
/* Unstable nid is in union with stable anon_vma: remove first */
remove_rmap_item_from_tree(rmap_item);
- /* Must get reference to anon_vma while still holding mmap_lock */
+ /*
+ * Must get reference to anon_vma while still holding mmap_lock,
+ * We set these two members of stable node here instead of
+ * stable_tree_append(), maybe because we don't want to hold
+ * mmap_read_lock again. Here mmap_read_lock is already held to
+ * find_mergeable_vma before merging.
+ */
rmap_item->anon_vma = vma->anon_vma;
+ rmap_item->pgoff = linear_page_index(vma, rmap_item->address);
get_anon_vma(vma->anon_vma);
out:
mmap_read_unlock(mm);
@@ -2458,6 +2481,10 @@ static bool should_skip_rmap_item(struct folio *folio,
if (folio_test_ksm(folio))
return false;
+ /* There is no age information in stable-tree nodes. */
+ if (rmap_item->address & STABLE_FLAG)
+ return false;
+
age = rmap_item->age;
if (age != U8_MAX)
rmap_item->age++;
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 5/6] ksm: Optimize rmap_walk_ksm by passing a suitable pgoff
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
` (3 preceding siblings ...)
2026-05-22 3:00 ` [PATCH v6 4/6] ksm: add pgoff into ksm_rmap_item xu.xin16
@ 2026-05-22 3:01 ` xu.xin16
2026-05-22 3:02 ` [PATCH v6 6/6] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
2026-05-23 3:28 ` [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm Andrew Morton
6 siblings, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 3:01 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
User impact / Why this matters to Linux users
=============================================
When a system runs with KSM enabled and memory becomes tight, KSM pages
may be swapped out or migrated. The kernel then performs a reverse map
walk by rmap_walk_ksm to locate all page table entries that reference
these pages. If A large number of unrelated VMAs can attach to a single
anon_vma related with this KSM page, then rmap_walk might be severe
performance bottleneck. In our embedded test environment, we observed
~20,000 VMAs sharing one anon_vma without any fork – purely from VMA
splits, which cause 200~700ms duration of rmap_walk_ksm.
When one of those VMAs mapped a KSM page, then this KSM page's rmapping
will become bottleneck with hold its anon_vma lock for a long time. The
anon_vma lock is not only used by KSM; it is a core lock protecting the
VMA interval tree and is acquired by many critical memory operations:
• Page faults: do_anonymous_page(), do_wp_page() (especially during COW)
• Memory reclaim: try_to_unmap()
• Page migration & compaction: migrate_pages(), compact_zone()
• mlock / munlock: mlock_fixup()
• Process exit: exit_mmap() (tearing down VMAs)
• Cgroup memory accounting: mem_cgroup_move_charge()
If one thread holds the anon_vma lock for hundreds of milliseconds
because of an inefficient KSM rmap walk, any other thread that tries to
acquire the same lock (e.g., an application taking a page fault, kswapd
reclaiming pages, or a migration thread) will block. This leads to
stalled application threads, increased latency spikes, and in extreme
cases container timeouts or watchdog triggers.
This patch reduces the worst-case anon_vma lock hold time during KSM
rmap walk from >500 ms to <1 ms, thereby almost eliminating this
source of lock contention and improving system responsiveness under
memory pressure.
Real-world examples:
====================
- JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.
- Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to release
specific pages, which also splits VMAs.
* Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing
one anon_vma on a production system running a Java application with KSM
enabled. The lock hold time before the patch was measured at 228 ms (max)
during rmap walks triggered by memory compaction and page migration.
The benchmark reproduces that VMA count and lock‑hold behavior in a
controlled environment.
Root Cause
==========
Through my local debugging trace analysis, we found that most of the latency
of rmap_walk_ksm occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or more),
which in turn causes upper-layer applications (waiting for the anon_vma lock)
to be blocked for extended periods.
Further investigation revealed that 99.9% of iterations inside the
anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.
Solution
========
We cannot rely solely on anon_vma to locate all PTEs mapping this page but
also need to have the original page's pgoff. Since the implementation of
anon_vma_interval_tree_foreach — it essentially iterates to find a suitable
VMA such that the provided pgoff falls within the candidate's vm_pgoff range.
vm_pgoff <= pgoff (original linear page offset) <= (vm_pgoff + vma_pages(v) - 1)
Fortunately, we have already pgoff in ksm_rmap_item in the previos patch
of series, so that we use it to get the pgoff to accelerate the searching.
Test results
============
We provide a rmap testbench: tools/testing/rmap/rmap_benchmark.c
The testing result in QEMU is shown as follows:
KSM rmapping Maximum duration Average duration
Before: 705.12 ms (705119858 ns) 532.04 ms (532041586 ns)
After: 1.67 ms (1665917 ns) 1.44 ms (1443784 ns)
Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
mm/ksm.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 4761ca3fa984..7fe1a8753309 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
/* Ignore the stable/unstable/sqnr flags */
const unsigned long addr = rmap_item->address & PAGE_MASK;
+ const unsigned long pgoff = rmap_item->pgoff;
struct anon_vma *anon_vma = rmap_item->anon_vma;
struct anon_vma_chain *vmac;
struct vm_area_struct *vma;
@@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
anon_vma_lock_read(anon_vma);
}
+ /*
+ * Currently KSM folios are order-0 normal pages, so pgoff_end
+ * should be the same as pgoff_start.
+ */
anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
- 0, ULONG_MAX) {
+ pgoff, pgoff) {
cond_resched();
vma = vmac->vma;
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 6/6] ksm: add mremap selftests for ksm_rmap_walk
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
` (4 preceding siblings ...)
2026-05-22 3:01 ` [PATCH v6 5/6] ksm: Optimize rmap_walk_ksm by passing a suitable pgoff xu.xin16
@ 2026-05-22 3:02 ` xu.xin16
2026-05-23 3:28 ` [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm Andrew Morton
6 siblings, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-22 3:02 UTC (permalink / raw)
To: akpm, xu.xin16, david
Cc: chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel, ljs,
corbet
From: xu xin <xu.xin16@zte.com.cn>
The existing tools/testing/selftests/mm/rmap.c has already one testcase
for ksm_rmap_walk in TEST_F(migrate, ksm), which takes use of migration
of page from one NUMA node to another NUMA node. However, it just lacks
the scenario of mremapped VMAs.
We add the calling of mremap() and then trigger KSM to merge pages before
migrating, , which is specailly to test a optimization which is introduced
by this patch ("ksm: Optimize rmap_walk_ksm by passing a suitable address
range").
This test can reproduce the issue that Hugh points out at
https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
tools/testing/selftests/mm/rmap.c | 76 ++++++++++++++++++++++++++++
tools/testing/selftests/mm/vm_util.c | 38 ++++++++++++++
tools/testing/selftests/mm/vm_util.h | 2 +
3 files changed, 116 insertions(+)
diff --git a/tools/testing/selftests/mm/rmap.c b/tools/testing/selftests/mm/rmap.c
index 53f2058b0ef2..f3eb693872ac 100644
--- a/tools/testing/selftests/mm/rmap.c
+++ b/tools/testing/selftests/mm/rmap.c
@@ -430,4 +430,80 @@ TEST_F(migrate, ksm)
propagate_children(_metadata, data);
}
+static void prepare_pages(struct global_data *data, int nr_pages)
+{
+ /* Allocate exactly pages for the test */
+ data->mapsize = nr_pages * getpagesize();
+ data->region = mmap(NULL, data->mapsize, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (data->region == MAP_FAILED)
+ ksft_exit_fail_perror("mmap failed");
+
+ /* Fill all pages with identical content to encourage KSM merging */
+ memset(data->region, 0x77, data->mapsize);
+}
+
+static int mremap_merge_and_migrate(struct global_data *data)
+{
+ int ret;
+ void *old_region;
+ int nr_pages = 32;
+
+ prepare_pages(data, nr_pages);
+
+ if (ksm_start() < 0)
+ return FAIL_ON_CHECK;
+
+ old_region = data->region;
+ /*
+ * Mremap the second harf region to the first harf location (FIXED).
+ */
+ data->region = mremap(old_region + data->mapsize / 2, data->mapsize / 2,
+ data->mapsize / 2, MREMAP_MAYMOVE | MREMAP_FIXED, old_region);
+ if (data->region == MAP_FAILED) {
+ ksft_print_msg("mremap failed: %s\n", strerror(errno));
+ return FAIL_ON_CHECK;
+ }
+
+ if (ksm_start() < 0)
+ return FAIL_ON_CHECK;
+
+ /* Attempt to migrate the merged KSM page */
+ ret = try_to_move_page(data->region);
+ if (ret != 0) {
+ ksft_print_msg("migration of KSM page after mremap failed\n");
+ return FAIL_ON_CHECK;
+ }
+
+ /* Ensure ksmd scan two turns at least to update ksm counters */
+ if (ksm_start() < 0)
+ return FAIL_ON_CHECK;
+
+ if (ksm_get_pages_shared() != 1 ||
+ ksm_get_pages_sharing() != nr_pages / 2 - 1)
+ return FAIL_ON_CHECK;
+
+ return 0;
+}
+
+TEST_F(migrate, ksm_and_mremap)
+{
+ struct global_data *data = &self->data;
+ int ret;
+
+ /* Skip if KSM is not available */
+ if (ksm_stop() < 0)
+ SKIP(return, "accessing \"/sys/kernel/mm/ksm/run\" failed");
+ if (ksm_get_full_scans() < 0)
+ SKIP(return, "accessing \"/sys/kernel/mm/ksm/full_scan\" failed");
+
+ ret = prctl(PR_SET_MEMORY_MERGE, 1, 0, 0, 0);
+ if (ret < 0 && errno == EINVAL)
+ SKIP(return, "PR_SET_MEMORY_MERGE not supported");
+ else if (ret)
+ ksft_exit_fail_perror("PR_SET_MEMORY_MERGE=1 failed");
+
+ ASSERT_EQ(mremap_merge_and_migrate(data), 0);
+}
+
TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index db94564f4431..a33a4069de7c 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -648,6 +648,44 @@ long ksm_get_self_merging_pages(void)
return strtol(buf, NULL, 10);
}
+long ksm_get_pages_shared(void)
+{
+ int ksm_pages_shared_fd;
+ char buf[10];
+ ssize_t ret;
+
+ ksm_pages_shared_fd = open("/sys/kernel/mm/ksm/pages_shared", O_RDONLY);
+ if (ksm_pages_shared_fd < 0)
+ return -errno;
+
+ ret = pread(ksm_pages_shared_fd, buf, sizeof(buf) - 1, 0);
+ close(ksm_pages_shared_fd);
+ if (ret <= 0)
+ return -errno;
+ buf[ret] = 0;
+
+ return strtol(buf, NULL, 10);
+}
+
+long ksm_get_pages_sharing(void)
+{
+ int ksm_pages_sharing_fd;
+ char buf[10];
+ ssize_t ret;
+
+ ksm_pages_sharing_fd = open("/sys/kernel/mm/ksm/pages_sharing", O_RDONLY);
+ if (ksm_pages_sharing_fd < 0)
+ return -errno;
+
+ ret = pread(ksm_pages_sharing_fd, buf, sizeof(buf) - 1, 0);
+ close(ksm_pages_sharing_fd);
+ if (ret <= 0)
+ return -errno;
+ buf[ret] = 0;
+
+ return strtol(buf, NULL, 10);
+}
+
long ksm_get_full_scans(void)
{
int ksm_full_scans_fd;
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 1a07305ceff4..3b40727c3f1f 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -151,6 +151,8 @@ void *sys_mremap(void *old_address, unsigned long old_size,
long ksm_get_self_zero_pages(void);
long ksm_get_self_merging_pages(void);
+long ksm_get_pages_shared(void);
+long ksm_get_pages_sharing(void);
long ksm_get_full_scans(void);
int ksm_use_zero_pages(void);
int ksm_start(void);
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
` (5 preceding siblings ...)
2026-05-22 3:02 ` [PATCH v6 6/6] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
@ 2026-05-23 3:28 ` Andrew Morton
2026-05-23 4:13 ` 答复: " xu.xin16
2026-05-29 7:56 ` Lorenzo Stoakes
6 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2026-05-23 3:28 UTC (permalink / raw)
To: xu.xin16
Cc: david, chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel,
ljs, corbet
On Fri, 22 May 2026 10:52:34 +0800 (CST) <xu.xin16@zte.com.cn> wrote:
> This series fixes a severe KSM reverse-mapping performance problem
> that can freeze applications for hundreds of milliseconds under
> memory pressure especially when a lot of unrelated VMAs sharing a
> single anon_vma.
Thanks. I agree that this behaviour is quite obnoxious and getting it
addressed is quite desirable.
So I'd normally merge this in its present unreviewed state in order to
push things along a bit, but the AI review gives me pause:
https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@zte.com.cn
Can you please take a look, decide what (if any) changes are needed?
^ permalink raw reply [flat|nested] 10+ messages in thread
* 答复: [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
2026-05-23 3:28 ` [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm Andrew Morton
@ 2026-05-23 4:13 ` xu.xin16
2026-05-29 7:56 ` Lorenzo Stoakes
1 sibling, 0 replies; 10+ messages in thread
From: xu.xin16 @ 2026-05-23 4:13 UTC (permalink / raw)
To: akpm
Cc: david, chengming.zhou, hughd, wang.yaxin, linux-mm, linux-kernel,
ljs, corbet
>> This series fixes a severe KSM reverse-mapping performance problem
>> that can freeze applications for hundreds of milliseconds under
>> memory pressure especially when a lot of unrelated VMAs sharing a
>> single anon_vma.
>
>Thanks. I agree that this behaviour is quite obnoxious and getting it
>addressed is quite desirable.
>
>So I'd normally merge this in its present unreviewed state in order to
>push things along a bit, but the AI review gives me pause:
>
> https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@zte.com.cn
>
>Can you please take a look, decide what (if any) changes are needed?
Yes
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm
2026-05-23 3:28 ` [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm Andrew Morton
2026-05-23 4:13 ` 答复: " xu.xin16
@ 2026-05-29 7:56 ` Lorenzo Stoakes
1 sibling, 0 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-05-29 7:56 UTC (permalink / raw)
To: Andrew Morton
Cc: xu.xin16, david, chengming.zhou, hughd, wang.yaxin, linux-mm,
linux-kernel, corbet
On Fri, May 22, 2026 at 08:28:24PM -0700, Andrew Morton wrote:
> On Fri, 22 May 2026 10:52:34 +0800 (CST) <xu.xin16@zte.com.cn> wrote:
>
> > This series fixes a severe KSM reverse-mapping performance problem
> > that can freeze applications for hundreds of milliseconds under
> > memory pressure especially when a lot of unrelated VMAs sharing a
> > single anon_vma.
>
> Thanks. I agree that this behaviour is quite obnoxious and getting it
> addressed is quite desirable.
>
> So I'd normally merge this in its present unreviewed state in order to
> push things along a bit, but the AI review gives me pause:
>
> https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@zte.com.cn
>
> Can you please take a look, decide what (if any) changes are needed?
I also plan to look through this!
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-29 7:56 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 2:52 [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm xu.xin16
2026-05-22 2:54 ` [PATCH v6 1/6] mm/rmap: add tracepoint for rmap_walk xu.xin16
2026-05-22 2:56 ` [PATCH v6 2/6] tools/testing: add rmap walk latency benchmark for KSM, anonymous and file pages xu.xin16
2026-05-22 2:57 ` [PATCH v6 3/6] MAINTAINERS: add myself as reviewer for rmap section xu.xin16
2026-05-22 3:00 ` [PATCH v6 4/6] ksm: add pgoff into ksm_rmap_item xu.xin16
2026-05-22 3:01 ` [PATCH v6 5/6] ksm: Optimize rmap_walk_ksm by passing a suitable pgoff xu.xin16
2026-05-22 3:02 ` [PATCH v6 6/6] ksm: add mremap selftests for ksm_rmap_walk xu.xin16
2026-05-23 3:28 ` [PATCH v6 0/6] KSM: performance optimizations for rmap_walk_ksm Andrew Morton
2026-05-23 4:13 ` 答复: " xu.xin16
2026-05-29 7:56 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox