Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
@ 2026-05-08 16:22 hawk
  2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
  2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
  0 siblings, 2 replies; 7+ messages in thread
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team, hawk

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add tracepoints to the page allocator fast paths that acquire
zone->lock, allowing diagnosis of lock contention in production.

Three tracepoints are introduced:
  kmem:mm_zone_lock_contended - fires when trylock fails (lock is held)
  kmem:mm_zone_locked         - fires on every acquisition
  kmem:mm_zone_lock_unlock    - fires on every release

Each event records the NUMA node, zone name, batch count, and caller.
The mm_zone_locked event additionally records wait_ns: the time spent
spinning when contended, measured via local_clock() with IRQs disabled
to ensure accurate same-CPU timestamps.

The lock/unlock paths are wrapped in __zone_lock()/__zone_unlock()
helpers that use trylock-first to separate the contended and
uncontended cases.  Only the fast paths (free_pcppages_bulk,
rmqueue_bulk, free_one_page) are covered.  Other zone->lock holders
such as compaction, page isolation, and memory hotplug are not
instrumented.

For minimum overhead in production, enable only mm_zone_lock_contended
which fires only on actual contention.  Enable mm_zone_locked for
wait-time analysis, and add mm_zone_lock_unlock for hold-time
measurement.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/trace/events/kmem.h | 101 ++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |  50 +++++++++++++++---
 2 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..870c68c70d57 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -458,6 +458,107 @@ TRACE_EVENT(rss_stat,
 		__print_symbolic(__entry->member, TRACE_MM_PAGES),
 		__entry->size)
 	);
+
+/*
+ * Tracepoints for zone->lock on the page allocator fast paths only.
+ * Other code paths that acquire zone->lock (compaction, isolation,
+ * memory hotplug, vmstat, etc.) are not covered here.
+ *
+ * Three events:
+ *   mm_zone_lock_contended - trylock failed, about to spin
+ *   mm_zone_locked         - lock acquired, includes wait_ns when
+ *                            contended (zero otherwise)
+ *   mm_zone_lock_unlock    - lock released
+ *
+ * For production use with minimum overhead, enable only
+ * mm_zone_lock_contended -- it fires only when trylock detects the
+ * lock is already held.
+ *
+ * For wait-time analysis, enable mm_zone_locked -- its wait_ns
+ * field gives the spin duration directly.  Adding unlock allows
+ * hold-time measurement, at the cost of one event per acquisition.
+ */
+TRACE_EVENT(mm_zone_lock_contended,
+
+	TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+	TP_ARGS(zone, count, caller),
+
+	TP_STRUCT__entry(
+		__field(	int,		node_id		)
+		__string(	name,		zone->name	)
+		__field(	int,		count		)
+		__field(	unsigned long,	caller		)
+	),
+
+	TP_fast_assign(
+		__entry->node_id = zone_to_nid(zone);
+		__assign_str(name);
+		__entry->count = count;
+		__entry->caller = caller;
+	),
+
+	TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+		  __entry->node_id, __get_str(name),
+		  __entry->count, (void *)__entry->caller)
+);
+
+TRACE_EVENT(mm_zone_locked,
+
+	TP_PROTO(struct zone *zone, int count, bool contended,
+		 unsigned long caller, u64 wait_ns),
+
+	TP_ARGS(zone, count, contended, caller, wait_ns),
+
+	TP_STRUCT__entry(
+		__field(	int,		node_id		)
+		__string(	name,		zone->name	)
+		__field(	int,		count		)
+		__field(	bool,		contended	)
+		__field(	unsigned long,	caller		)
+		__field(	u64,		wait_ns		)
+	),
+
+	TP_fast_assign(
+		__entry->node_id = zone_to_nid(zone);
+		__assign_str(name);
+		__entry->count = count;
+		__entry->contended = contended;
+		__entry->caller = caller;
+		__entry->wait_ns = wait_ns;
+	),
+
+	TP_printk("node=%d zone=%-8s count=%-5d contended=%d caller=%pS wait=%llu ns",
+		  __entry->node_id, __get_str(name),
+		  __entry->count, __entry->contended,
+		  (void *)__entry->caller, __entry->wait_ns)
+);
+
+TRACE_EVENT(mm_zone_lock_unlock,
+
+	TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+	TP_ARGS(zone, count, caller),
+
+	TP_STRUCT__entry(
+		__field(	int,		node_id		)
+		__string(	name,		zone->name	)
+		__field(	int,		count		)
+		__field(	unsigned long,	caller		)
+	),
+
+	TP_fast_assign(
+		__entry->node_id = zone_to_nid(zone);
+		__assign_str(name);
+		__entry->count = count;
+		__entry->caller = caller;
+	),
+
+	TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+		  __entry->node_id, __get_str(name),
+		  __entry->count, (void *)__entry->caller)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..08018e9beab4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -19,6 +19,7 @@
 #include <linux/highmem.h>
 #include <linux/interrupt.h>
 #include <linux/jiffies.h>
+#include <linux/sched/clock.h>
 #include <linux/compiler.h>
 #include <linux/kernel.h>
 #include <linux/kasan.h>
@@ -1447,6 +1448,43 @@ bool free_pages_prepare(struct page *page, unsigned int order)
 	return __free_pages_prepare(page, order, FPI_NONE);
 }
 
+/*
+ * Helper functions for locking zone->lock with tracepoints.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments.  The @count parameter indicates the number
+ * of pages being freed or allocated in the batch operation.
+ *
+ * For minimum overhead attach to kmem:mm_zone_lock_contended, which
+ * only gets activated when trylock detects lock is contended.
+ */
+static inline void
+__zone_lock(struct zone *zone, int count, unsigned long *flags)
+	__acquires(&zone->lock)
+{
+	unsigned long caller = _RET_IP_;
+	u64 wait_start, wait_time = 0;
+	bool contended;
+
+	local_irq_save(*flags);
+	contended = !spin_trylock(&zone->lock);
+	if (contended) {
+		wait_start = local_clock();
+		trace_mm_zone_lock_contended(zone, count, caller);
+		spin_lock(&zone->lock);
+		wait_time = local_clock() - wait_start;
+	}
+	trace_mm_zone_locked(zone, count, contended, caller, wait_time);
+}
+
+static inline void
+__zone_unlock(struct zone *zone, int count, unsigned long *flags)
+	__releases(&zone->lock)
+{
+	trace_mm_zone_lock_unlock(zone, count, _RET_IP_);
+	spin_unlock_irqrestore(&zone->lock, *flags);
+}
+
 /*
  * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone.
@@ -1469,7 +1507,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	__zone_lock(zone, count, &flags);
 
 	while (count > 0) {
 		struct list_head *list;
@@ -1502,7 +1540,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (count > 0 && !list_empty(list));
 	}
 
-	spin_unlock_irqrestore(&zone->lock, flags);
+	__zone_unlock(zone, count, &flags);
 }
 
 /* Split a multi-block free page into its individual pageblocks. */
@@ -1551,7 +1589,7 @@ static void free_one_page(struct zone *zone, struct page *page,
 			return;
 		}
 	} else {
-		spin_lock_irqsave(&zone->lock, flags);
+		__zone_lock(zone, 1 << order, &flags);
 	}
 
 	/* The lock succeeded. Process deferred pages. */
@@ -1569,7 +1607,7 @@ static void free_one_page(struct zone *zone, struct page *page,
 		}
 	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
-	spin_unlock_irqrestore(&zone->lock, flags);
+	__zone_unlock(zone, 1 << order, &flags);
 
 	__count_vm_events(PGFREE, 1 << order);
 }
@@ -2525,7 +2563,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (!spin_trylock_irqsave(&zone->lock, flags))
 			return 0;
 	} else {
-		spin_lock_irqsave(&zone->lock, flags);
+		__zone_lock(zone, count, &flags);
 	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2545,7 +2583,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 */
 		list_add_tail(&page->pcp_list, list);
 	}
-	spin_unlock_irqrestore(&zone->lock, flags);
+	__zone_unlock(zone, i, &flags);
 
 	return i;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
  2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
@ 2026-05-08 16:22 ` hawk
  2026-05-08 20:15   ` David Hildenbrand (Arm)
  2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
  1 sibling, 1 reply; 7+ messages in thread
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team, hawk

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add a selftest to verify the kmem:mm_zone_lock_contended,
kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.

The test has two components:

zone_lock_contention.c - a workload that spawns threads doing rapid
page allocation and freeing to generate zone->lock contention. It
shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
free_pcppages_bulk() and rmqueue_bulk() calls.

test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
exist, have the expected fields, fire under load, and that wait_ns
is populated when contention occurs.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 tools/testing/selftests/mm/Makefile           |   2 +
 .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
 .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++
 3 files changed, 380 insertions(+)
 create mode 100755 tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
 create mode 100644 tools/testing/selftests/mm/zone_lock_contention.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27..af6cfdf3c8a0 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
 TEST_GEN_FILES += merge
 TEST_GEN_FILES += rmap
 TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += zone_lock_contention
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
@@ -173,6 +174,7 @@ TEST_PROGS += ksft_thp.sh
 TEST_PROGS += ksft_userfaultfd.sh
 TEST_PROGS += ksft_vma_merge.sh
 TEST_PROGS += ksft_vmalloc.sh
+TEST_PROGS += test_zone_lock_tracepoints.sh
 
 TEST_FILES := test_vmalloc.sh
 TEST_FILES += test_hmm.sh
diff --git a/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
new file mode 100755
index 000000000000..7fa3dab1f6c5
--- /dev/null
+++ b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
@@ -0,0 +1,212 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# test_zone_lock_tracepoints.sh - Verify mm_zone_lock tracepoints fire
+#
+# Generates zone->lock contention and uses bpftrace to verify that the
+# kmem:mm_zone_lock_contended, kmem:mm_zone_locked, and
+# kmem:mm_zone_lock_unlock tracepoints activate and produce output.
+#
+# Requirements: bpftrace, root privileges, CONFIG_FTRACE=y
+#
+# Usage: ./test_zone_lock_tracepoints.sh [duration_sec]
+#        Default duration: 5 seconds
+#
+# For running in a VM via virtme-ng:
+#   make -C tools/testing/selftests/mm zone_lock_contention
+#   vng --cpus 4 --memory 2G \
+#       --rwdir tools/testing/selftests/mm \
+#       --exec "cd tools/testing/selftests/mm && ./test_zone_lock_tracepoints.sh 5"
+
+set -e
+
+DURATION=${1:-5}
+TESTDIR="$(cd "$(dirname "$0")" && pwd)"
+WORKLOAD="$TESTDIR/zone_lock_contention"
+NR_THREADS=4
+PASS=0
+FAIL=0
+SKIP=0
+
+# --- helpers ---
+
+pass() { echo "PASS: $1"; PASS=$((PASS + 1)); }
+fail() { echo "FAIL: $1"; FAIL=$((FAIL + 1)); }
+skip() { echo "SKIP: $1"; SKIP=$((SKIP + 1)); }
+
+check_root() {
+	if [ "$(id -u)" -ne 0 ]; then
+		echo "ERROR: must run as root"
+		exit 4  # ksft SKIP
+	fi
+}
+
+check_bpftrace() {
+	if ! command -v bpftrace >/dev/null 2>&1; then
+		echo "SKIP: bpftrace not found"
+		exit 4
+	fi
+}
+
+check_workload() {
+	if [ ! -x "$WORKLOAD" ]; then
+		echo "SKIP: $WORKLOAD not found, run 'make -C tools/testing/selftests/mm' first"
+		exit 4
+	fi
+}
+
+check_tracepoint_exists() {
+	local tp="$1"
+	if [ ! -d "/sys/kernel/tracing/events/kmem/$tp" ]; then
+		skip "$tp tracepoint not in kernel"
+		return 1
+	fi
+	return 0
+}
+
+# --- Test 1: verify tracepoints exist in tracefs ---
+
+test_tracepoints_exist() {
+	echo "--- Test 1: tracepoints exist in tracefs ---"
+	for tp in mm_zone_lock_contended mm_zone_locked mm_zone_lock_unlock; do
+		if check_tracepoint_exists "$tp"; then
+			pass "$tp exists"
+		fi
+	done
+}
+
+# --- Test 2: verify format fields ---
+
+test_tracepoint_fields() {
+	echo "--- Test 2: tracepoint format fields ---"
+	local fmt
+
+	if [ -f /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format ]; then
+		fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format)
+		for field in node_id name count caller; do
+			if echo "$fmt" | grep -q "field.*$field"; then
+				pass "mm_zone_lock_contended has field '$field'"
+			else
+				fail "mm_zone_lock_contended missing field '$field'"
+			fi
+		done
+	fi
+
+	if [ -f /sys/kernel/tracing/events/kmem/mm_zone_locked/format ]; then
+		fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_locked/format)
+		for field in node_id name count contended caller wait_ns; do
+			if echo "$fmt" | grep -q "field.*$field"; then
+				pass "mm_zone_locked has field '$field'"
+			else
+				fail "mm_zone_locked missing field '$field'"
+			fi
+		done
+	fi
+}
+
+# --- Test 3: bpftrace counts tracepoint hits under load ---
+
+test_bpftrace_counts() {
+	echo "--- Test 3: bpftrace tracepoint activation under contention ---"
+
+	if ! check_tracepoint_exists mm_zone_locked; then
+		return
+	fi
+
+	local BPFTRACE_OUT
+	BPFTRACE_OUT=$(mktemp /tmp/zone_lock_bt.XXXXXX)
+
+	# bpftrace one-liner: count hits per tracepoint
+	bpftrace -e '
+		tracepoint:kmem:mm_zone_lock_contended { @contended = count(); }
+		tracepoint:kmem:mm_zone_locked          { @locked = count(); }
+		tracepoint:kmem:mm_zone_lock_unlock     { @unlock = count(); }
+	' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+	local BT_PID=$!
+
+	# Wait for bpftrace + workload to finish
+	wait $BT_PID 2>/dev/null || true
+
+	echo "bpftrace output:"
+	cat "$BPFTRACE_OUT"
+
+	# Check that mm_zone_locked fired (it fires on every acquisition)
+	if grep -q '@locked: [0-9]' "$BPFTRACE_OUT"; then
+		pass "mm_zone_locked tracepoint fired"
+	else
+		fail "mm_zone_locked tracepoint did NOT fire"
+	fi
+
+	# Check that mm_zone_lock_unlock fired
+	if grep -q '@unlock: [0-9]' "$BPFTRACE_OUT"; then
+		pass "mm_zone_lock_unlock tracepoint fired"
+	else
+		fail "mm_zone_lock_unlock tracepoint did NOT fire"
+	fi
+
+	# contended may or may not fire depending on actual contention
+	if grep -q '@contended: [0-9]' "$BPFTRACE_OUT"; then
+		pass "mm_zone_lock_contended tracepoint fired (contention detected)"
+	else
+		skip "mm_zone_lock_contended did not fire (no contention observed)"
+	fi
+
+	rm -f "$BPFTRACE_OUT"
+}
+
+# --- Test 4: bpftrace verifies wait_ns > 0 when contended ---
+
+test_wait_ns() {
+	echo "--- Test 4: wait_ns is populated when contended ---"
+
+	if ! check_tracepoint_exists mm_zone_locked; then
+		return
+	fi
+
+	local BPFTRACE_OUT
+	BPFTRACE_OUT=$(mktemp /tmp/zone_lock_wait.XXXXXX)
+
+	bpftrace -e '
+		tracepoint:kmem:mm_zone_locked /args->contended/ {
+			@has_wait_ns = count();
+			@wait_ns = hist(args->wait_ns);
+		}
+	' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+	local BT_PID=$!
+
+	wait $BT_PID 2>/dev/null || true
+
+	echo "bpftrace wait_ns output:"
+	cat "$BPFTRACE_OUT"
+
+	if grep -q '@has_wait_ns: [0-9]' "$BPFTRACE_OUT"; then
+		pass "wait_ns populated on contended acquisitions"
+	else
+		skip "no contended acquisitions observed for wait_ns check"
+	fi
+
+	rm -f "$BPFTRACE_OUT"
+}
+
+# --- Main ---
+
+echo "=== zone->lock tracepoint selftest ==="
+echo "Duration: ${DURATION}s, Threads: ${NR_THREADS}"
+echo
+
+check_root
+check_bpftrace
+check_workload
+
+test_tracepoints_exist
+test_tracepoint_fields
+test_bpftrace_counts
+test_wait_ns
+
+echo
+echo "=== Results: $PASS passed, $FAIL failed, $SKIP skipped ==="
+
+if [ "$FAIL" -gt 0 ]; then
+	exit 1
+fi
+exit 0
diff --git a/tools/testing/selftests/mm/zone_lock_contention.c b/tools/testing/selftests/mm/zone_lock_contention.c
new file mode 100644
index 000000000000..35ddad7670b1
--- /dev/null
+++ b/tools/testing/selftests/mm/zone_lock_contention.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zone_lock_contention.c - Generate zone->lock contention for tracepoint testing
+ *
+ * Spawns multiple threads that rapidly allocate and free pages to force
+ * PCP (per-cpu pageset) drains and refills, which acquire zone->lock via
+ * free_pcppages_bulk() and rmqueue_bulk().
+ *
+ * Reducing percpu_pagelist_high_fraction makes PCP lists smaller, causing
+ * more frequent zone->lock acquisitions and thus more contention.
+ *
+ * Usage: zone_lock_contention [duration_sec] [nr_threads]
+ *        Defaults: 5 seconds, 4 threads
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <time.h>
+
+/* Each thread mmaps/touches/munmaps in a loop to churn pages */
+#define CHUNK_SIZE	(2 * 1024 * 1024)	/* 2 MB per iteration */
+#define PAGE_SZ		4096
+
+static volatile int stop;
+
+struct thread_stats {
+	unsigned long iterations;
+	unsigned long pages_touched;
+};
+
+static void *churn_thread(void *arg)
+{
+	struct thread_stats *stats = arg;
+	unsigned long iter = 0;
+	unsigned long pages = 0;
+
+	while (!stop) {
+		char *p;
+		size_t i;
+
+		p = mmap(NULL, CHUNK_SIZE, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+		if (p == MAP_FAILED) {
+			perror("mmap");
+			break;
+		}
+
+		/* Touch every page to ensure allocation */
+		for (i = 0; i < CHUNK_SIZE; i += PAGE_SZ)
+			p[i] = 1;
+
+		pages += CHUNK_SIZE / PAGE_SZ;
+
+		/* Free pages back - forces PCP drain */
+		munmap(p, CHUNK_SIZE);
+		iter++;
+	}
+
+	stats->iterations = iter;
+	stats->pages_touched = pages;
+	return NULL;
+}
+
+static int write_sysctl(const char *path, const char *val)
+{
+	FILE *f = fopen(path, "w");
+
+	if (!f)
+		return -1;
+	fputs(val, f);
+	fclose(f);
+	return 0;
+}
+
+static int read_sysctl(const char *path, char *buf, size_t len)
+{
+	FILE *f = fopen(path, "r");
+
+	if (!f)
+		return -1;
+	if (!fgets(buf, len, f)) {
+		fclose(f);
+		return -1;
+	}
+	fclose(f);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int duration = 5;
+	int nr_threads = 4;
+	char orig_fraction[32] = "";
+	const char *sysctl_path = "/proc/sys/vm/percpu_pagelist_high_fraction";
+	pthread_t *threads;
+	struct thread_stats *stats;
+	unsigned long total_iter = 0, total_pages = 0;
+	int i;
+
+	if (argc > 1)
+		duration = atoi(argv[1]);
+	if (argc > 2)
+		nr_threads = atoi(argv[2]);
+
+	if (duration <= 0 || nr_threads <= 0) {
+		fprintf(stderr, "Usage: %s [duration_sec] [nr_threads]\n", argv[0]);
+		return 1;
+	}
+
+	printf("zone_lock_contention: %d threads, %d seconds\n",
+	       nr_threads, duration);
+
+	/* Shrink PCP lists to force more zone->lock acquisitions */
+	read_sysctl(sysctl_path, orig_fraction, sizeof(orig_fraction));
+	if (write_sysctl(sysctl_path, "100") < 0)
+		fprintf(stderr, "WARNING: cannot write %s (not root?)\n",
+			sysctl_path);
+	else
+		printf("Set percpu_pagelist_high_fraction=100 (was %s)\n",
+		       orig_fraction);
+
+	threads = calloc(nr_threads, sizeof(*threads));
+	stats = calloc(nr_threads, sizeof(*stats));
+	if (!threads || !stats) {
+		perror("calloc");
+		return 1;
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&threads[i], NULL, churn_thread, &stats[i])) {
+			perror("pthread_create");
+			return 1;
+		}
+	}
+
+	sleep(duration);
+	stop = 1;
+
+	for (i = 0; i < nr_threads; i++) {
+		pthread_join(threads[i], NULL);
+		total_iter += stats[i].iterations;
+		total_pages += stats[i].pages_touched;
+	}
+
+	printf("Total: %lu iterations, %lu pages (%lu MB) churned\n",
+	       total_iter, total_pages,
+	       (total_pages * PAGE_SZ) / (1024 * 1024));
+
+	/* Restore original sysctl */
+	if (orig_fraction[0]) {
+		/* Strip trailing newline */
+		orig_fraction[strcspn(orig_fraction, "\n")] = '\0';
+		write_sysctl(sysctl_path, orig_fraction);
+		printf("Restored percpu_pagelist_high_fraction=%s\n",
+		       orig_fraction);
+	}
+
+	free(threads);
+	free(stats);
+	return 0;
+}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
  2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
@ 2026-05-08 17:29 ` Andrew Morton
  2026-05-08 17:38   ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2026-05-08 17:29 UTC (permalink / raw)
  To: hawk
  Cc: linux-mm, Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, David Hildenbrand, Lorenzo Stoakes,
	Shuah Khan, linux-kernel, linux-trace-kernel, kernel-team

e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:

> Add tracepoints to the page allocator fast paths that acquire
> zone->lock, allowing diagnosis of lock contention in production.

Thanks, I'm surprised we haven't done this yet.

Unfortunately "mm: use spinlock guards for zone lock" messed this up
(https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).

So please let's give it a few days for reviewers to comment then redo
against mm.git's mm-unstable branch?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
@ 2026-05-08 17:38   ` Vlastimil Babka (SUSE)
  2026-05-08 17:40     ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-08 17:38 UTC (permalink / raw)
  To: Andrew Morton, hawk, Dmitry Ilvokhin, Matthew Wilcox
  Cc: linux-mm, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team

On 5/8/26 7:29 PM, Andrew Morton wrote:
> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> 
>> Add tracepoints to the page allocator fast paths that acquire
>> zone->lock, allowing diagnosis of lock contention in production.
> 
> Thanks, I'm surprised we haven't done this yet.

There was a recent attempt [1]. Not being a generic solution wasn't welcome.

[1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/

> Unfortunately "mm: use spinlock guards for zone lock" messed this up
> (https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
> 
> So please let's give it a few days for reviewers to comment then redo
> against mm.git's mm-unstable branch?



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 17:38   ` Vlastimil Babka (SUSE)
@ 2026-05-08 17:40     ` Vlastimil Babka (SUSE)
  2026-05-08 18:07       ` Dmitry Ilvokhin
  0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-08 17:40 UTC (permalink / raw)
  To: Andrew Morton, hawk, Dmitry Ilvokhin, Matthew Wilcox
  Cc: linux-mm, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team

On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> On 5/8/26 7:29 PM, Andrew Morton wrote:
>> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>>
>>> Add tracepoints to the page allocator fast paths that acquire
>>> zone->lock, allowing diagnosis of lock contention in production.
>>
>> Thanks, I'm surprised we haven't done this yet.
> 
> There was a recent attempt [1]. Not being a generic solution wasn't welcome.
> 
> [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/

And this is the generic solution I think?

https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/

>> Unfortunately "mm: use spinlock guards for zone lock" messed this up
>> (https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
>>
>> So please let's give it a few days for reviewers to comment then redo
>> against mm.git's mm-unstable branch?
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 17:40     ` Vlastimil Babka (SUSE)
@ 2026-05-08 18:07       ` Dmitry Ilvokhin
  0 siblings, 0 replies; 7+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-08 18:07 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Andrew Morton, hawk, Matthew Wilcox, linux-mm, Steven Rostedt,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, David Hildenbrand,
	Lorenzo Stoakes, Shuah Khan, linux-kernel, linux-trace-kernel,
	kernel-team

On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
> On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> > On 5/8/26 7:29 PM, Andrew Morton wrote:
> >> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> >>
> >>> Add tracepoints to the page allocator fast paths that acquire
> >>> zone->lock, allowing diagnosis of lock contention in production.
> >>
> >> Thanks, I'm surprised we haven't done this yet.
> > 
> > There was a recent attempt [1]. Not being a generic solution wasn't welcome.
> > 
> > [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
> 
> And this is the generic solution I think?
> 
> https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/

Thanks for cc'ing me, Vlastimil.

Yes, this is an attempt at a generic solution for tracing contended
locks, including spinlocks, so it should also cover the use case
proposed in this patchset.

In fact, zone->lock contention was one of the primary motivations for
this work.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
  2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
@ 2026-05-08 20:15   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 20:15 UTC (permalink / raw)
  To: hawk, Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team

On 5/8/26 18:22, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Add a selftest to verify the kmem:mm_zone_lock_contended,
> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
> 
> The test has two components:
> 
> zone_lock_contention.c - a workload that spawns threads doing rapid
> page allocation and freeing to generate zone->lock contention. It
> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
> free_pcppages_bulk() and rmqueue_bulk() calls.
> 
> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
> exist, have the expected fields, fire under load, and that wait_ns
> is populated when contention occurs.
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  tools/testing/selftests/mm/Makefile           |   2 +
>  .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
>  .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++

This really looks excessive and ... not really how we usually treat tracepoints?

I don't know about others, but I don't think this is really what we want as a MM
selftest.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-08 20:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
2026-05-08 20:15   ` David Hildenbrand (Arm)
2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
2026-05-08 17:38   ` Vlastimil Babka (SUSE)
2026-05-08 17:40     ` Vlastimil Babka (SUSE)
2026-05-08 18:07       ` Dmitry Ilvokhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox