Linux Trace Kernel
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
@ 2026-05-08 16:22 hawk
  2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
  2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
  0 siblings, 2 replies; 9+ messages in thread
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team, hawk

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add tracepoints to the page allocator fast paths that acquire
zone->lock, allowing diagnosis of lock contention in production.

Three tracepoints are introduced:
  kmem:mm_zone_lock_contended - fires when trylock fails (lock is held)
  kmem:mm_zone_locked         - fires on every acquisition
  kmem:mm_zone_lock_unlock    - fires on every release

Each event records the NUMA node, zone name, batch count, and caller.
The mm_zone_locked event additionally records wait_ns: the time spent
spinning when contended, measured via local_clock() with IRQs disabled
to ensure accurate same-CPU timestamps.

The lock/unlock paths are wrapped in __zone_lock()/__zone_unlock()
helpers that use trylock-first to separate the contended and
uncontended cases.  Only the fast paths (free_pcppages_bulk,
rmqueue_bulk, free_one_page) are covered.  Other zone->lock holders
such as compaction, page isolation, and memory hotplug are not
instrumented.

For minimum overhead in production, enable only mm_zone_lock_contended
which fires only on actual contention.  Enable mm_zone_locked for
wait-time analysis, and add mm_zone_lock_unlock for hold-time
measurement.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/trace/events/kmem.h | 101 ++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |  50 +++++++++++++++---
 2 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..870c68c70d57 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -458,6 +458,107 @@ TRACE_EVENT(rss_stat,
 		__print_symbolic(__entry->member, TRACE_MM_PAGES),
 		__entry->size)
 	);
+
+/*
+ * Tracepoints for zone->lock on the page allocator fast paths only.
+ * Other code paths that acquire zone->lock (compaction, isolation,
+ * memory hotplug, vmstat, etc.) are not covered here.
+ *
+ * Three events:
+ *   mm_zone_lock_contended - trylock failed, about to spin
+ *   mm_zone_locked         - lock acquired, includes wait_ns when
+ *                            contended (zero otherwise)
+ *   mm_zone_lock_unlock    - lock released
+ *
+ * For production use with minimum overhead, enable only
+ * mm_zone_lock_contended -- it fires only when trylock detects the
+ * lock is already held.
+ *
+ * For wait-time analysis, enable mm_zone_locked -- its wait_ns
+ * field gives the spin duration directly.  Adding unlock allows
+ * hold-time measurement, at the cost of one event per acquisition.
+ */
+TRACE_EVENT(mm_zone_lock_contended,
+
+	TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+	TP_ARGS(zone, count, caller),
+
+	TP_STRUCT__entry(
+		__field(	int,		node_id		)
+		__string(	name,		zone->name	)
+		__field(	int,		count		)
+		__field(	unsigned long,	caller		)
+	),
+
+	TP_fast_assign(
+		__entry->node_id = zone_to_nid(zone);
+		__assign_str(name);
+		__entry->count = count;
+		__entry->caller = caller;
+	),
+
+	TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+		  __entry->node_id, __get_str(name),
+		  __entry->count, (void *)__entry->caller)
+);
+
+TRACE_EVENT(mm_zone_locked,
+
+	TP_PROTO(struct zone *zone, int count, bool contended,
+		 unsigned long caller, u64 wait_ns),
+
+	TP_ARGS(zone, count, contended, caller, wait_ns),
+
+	TP_STRUCT__entry(
+		__field(	int,		node_id		)
+		__string(	name,		zone->name	)
+		__field(	int,		count		)
+		__field(	bool,		contended	)
+		__field(	unsigned long,	caller		)
+		__field(	u64,		wait_ns		)
+	),
+
+	TP_fast_assign(
+		__entry->node_id = zone_to_nid(zone);
+		__assign_str(name);
+		__entry->count = count;
+		__entry->contended = contended;
+		__entry->caller = caller;
+		__entry->wait_ns = wait_ns;
+	),
+
+	TP_printk("node=%d zone=%-8s count=%-5d contended=%d caller=%pS wait=%llu ns",
+		  __entry->node_id, __get_str(name),
+		  __entry->count, __entry->contended,
+		  (void *)__entry->caller, __entry->wait_ns)
+);
+
+TRACE_EVENT(mm_zone_lock_unlock,
+
+	TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+	TP_ARGS(zone, count, caller),
+
+	TP_STRUCT__entry(
+		__field(	int,		node_id		)
+		__string(	name,		zone->name	)
+		__field(	int,		count		)
+		__field(	unsigned long,	caller		)
+	),
+
+	TP_fast_assign(
+		__entry->node_id = zone_to_nid(zone);
+		__assign_str(name);
+		__entry->count = count;
+		__entry->caller = caller;
+	),
+
+	TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+		  __entry->node_id, __get_str(name),
+		  __entry->count, (void *)__entry->caller)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..08018e9beab4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -19,6 +19,7 @@
 #include <linux/highmem.h>
 #include <linux/interrupt.h>
 #include <linux/jiffies.h>
+#include <linux/sched/clock.h>
 #include <linux/compiler.h>
 #include <linux/kernel.h>
 #include <linux/kasan.h>
@@ -1447,6 +1448,43 @@ bool free_pages_prepare(struct page *page, unsigned int order)
 	return __free_pages_prepare(page, order, FPI_NONE);
 }
 
+/*
+ * Helper functions for locking zone->lock with tracepoints.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments.  The @count parameter indicates the number
+ * of pages being freed or allocated in the batch operation.
+ *
+ * For minimum overhead attach to kmem:mm_zone_lock_contended, which
+ * only gets activated when trylock detects lock is contended.
+ */
+static inline void
+__zone_lock(struct zone *zone, int count, unsigned long *flags)
+	__acquires(&zone->lock)
+{
+	unsigned long caller = _RET_IP_;
+	u64 wait_start, wait_time = 0;
+	bool contended;
+
+	local_irq_save(*flags);
+	contended = !spin_trylock(&zone->lock);
+	if (contended) {
+		wait_start = local_clock();
+		trace_mm_zone_lock_contended(zone, count, caller);
+		spin_lock(&zone->lock);
+		wait_time = local_clock() - wait_start;
+	}
+	trace_mm_zone_locked(zone, count, contended, caller, wait_time);
+}
+
+static inline void
+__zone_unlock(struct zone *zone, int count, unsigned long *flags)
+	__releases(&zone->lock)
+{
+	trace_mm_zone_lock_unlock(zone, count, _RET_IP_);
+	spin_unlock_irqrestore(&zone->lock, *flags);
+}
+
 /*
  * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone.
@@ -1469,7 +1507,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	__zone_lock(zone, count, &flags);
 
 	while (count > 0) {
 		struct list_head *list;
@@ -1502,7 +1540,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (count > 0 && !list_empty(list));
 	}
 
-	spin_unlock_irqrestore(&zone->lock, flags);
+	__zone_unlock(zone, count, &flags);
 }
 
 /* Split a multi-block free page into its individual pageblocks. */
@@ -1551,7 +1589,7 @@ static void free_one_page(struct zone *zone, struct page *page,
 			return;
 		}
 	} else {
-		spin_lock_irqsave(&zone->lock, flags);
+		__zone_lock(zone, 1 << order, &flags);
 	}
 
 	/* The lock succeeded. Process deferred pages. */
@@ -1569,7 +1607,7 @@ static void free_one_page(struct zone *zone, struct page *page,
 		}
 	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
-	spin_unlock_irqrestore(&zone->lock, flags);
+	__zone_unlock(zone, 1 << order, &flags);
 
 	__count_vm_events(PGFREE, 1 << order);
 }
@@ -2525,7 +2563,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (!spin_trylock_irqsave(&zone->lock, flags))
 			return 0;
 	} else {
-		spin_lock_irqsave(&zone->lock, flags);
+		__zone_lock(zone, count, &flags);
 	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2545,7 +2583,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 */
 		list_add_tail(&page->pcp_list, list);
 	}
-	spin_unlock_irqrestore(&zone->lock, flags);
+	__zone_unlock(zone, i, &flags);
 
 	return i;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
  2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
@ 2026-05-08 16:22 ` hawk
  2026-05-08 20:15   ` David Hildenbrand (Arm)
  2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
  1 sibling, 1 reply; 9+ messages in thread
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team, hawk

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add a selftest to verify the kmem:mm_zone_lock_contended,
kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.

The test has two components:

zone_lock_contention.c - a workload that spawns threads doing rapid
page allocation and freeing to generate zone->lock contention. It
shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
free_pcppages_bulk() and rmqueue_bulk() calls.

test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
exist, have the expected fields, fire under load, and that wait_ns
is populated when contention occurs.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 tools/testing/selftests/mm/Makefile           |   2 +
 .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
 .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++
 3 files changed, 380 insertions(+)
 create mode 100755 tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
 create mode 100644 tools/testing/selftests/mm/zone_lock_contention.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27..af6cfdf3c8a0 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
 TEST_GEN_FILES += merge
 TEST_GEN_FILES += rmap
 TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += zone_lock_contention
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
@@ -173,6 +174,7 @@ TEST_PROGS += ksft_thp.sh
 TEST_PROGS += ksft_userfaultfd.sh
 TEST_PROGS += ksft_vma_merge.sh
 TEST_PROGS += ksft_vmalloc.sh
+TEST_PROGS += test_zone_lock_tracepoints.sh
 
 TEST_FILES := test_vmalloc.sh
 TEST_FILES += test_hmm.sh
diff --git a/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
new file mode 100755
index 000000000000..7fa3dab1f6c5
--- /dev/null
+++ b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
@@ -0,0 +1,212 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# test_zone_lock_tracepoints.sh - Verify mm_zone_lock tracepoints fire
+#
+# Generates zone->lock contention and uses bpftrace to verify that the
+# kmem:mm_zone_lock_contended, kmem:mm_zone_locked, and
+# kmem:mm_zone_lock_unlock tracepoints activate and produce output.
+#
+# Requirements: bpftrace, root privileges, CONFIG_FTRACE=y
+#
+# Usage: ./test_zone_lock_tracepoints.sh [duration_sec]
+#        Default duration: 5 seconds
+#
+# For running in a VM via virtme-ng:
+#   make -C tools/testing/selftests/mm zone_lock_contention
+#   vng --cpus 4 --memory 2G \
+#       --rwdir tools/testing/selftests/mm \
+#       --exec "cd tools/testing/selftests/mm && ./test_zone_lock_tracepoints.sh 5"
+
+set -e
+
+DURATION=${1:-5}
+TESTDIR="$(cd "$(dirname "$0")" && pwd)"
+WORKLOAD="$TESTDIR/zone_lock_contention"
+NR_THREADS=4
+PASS=0
+FAIL=0
+SKIP=0
+
+# --- helpers ---
+
+pass() { echo "PASS: $1"; PASS=$((PASS + 1)); }
+fail() { echo "FAIL: $1"; FAIL=$((FAIL + 1)); }
+skip() { echo "SKIP: $1"; SKIP=$((SKIP + 1)); }
+
+check_root() {
+	if [ "$(id -u)" -ne 0 ]; then
+		echo "ERROR: must run as root"
+		exit 4  # ksft SKIP
+	fi
+}
+
+check_bpftrace() {
+	if ! command -v bpftrace >/dev/null 2>&1; then
+		echo "SKIP: bpftrace not found"
+		exit 4
+	fi
+}
+
+check_workload() {
+	if [ ! -x "$WORKLOAD" ]; then
+		echo "SKIP: $WORKLOAD not found, run 'make -C tools/testing/selftests/mm' first"
+		exit 4
+	fi
+}
+
+check_tracepoint_exists() {
+	local tp="$1"
+	if [ ! -d "/sys/kernel/tracing/events/kmem/$tp" ]; then
+		skip "$tp tracepoint not in kernel"
+		return 1
+	fi
+	return 0
+}
+
+# --- Test 1: verify tracepoints exist in tracefs ---
+
+test_tracepoints_exist() {
+	echo "--- Test 1: tracepoints exist in tracefs ---"
+	for tp in mm_zone_lock_contended mm_zone_locked mm_zone_lock_unlock; do
+		if check_tracepoint_exists "$tp"; then
+			pass "$tp exists"
+		fi
+	done
+}
+
+# --- Test 2: verify format fields ---
+
+test_tracepoint_fields() {
+	echo "--- Test 2: tracepoint format fields ---"
+	local fmt
+
+	if [ -f /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format ]; then
+		fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format)
+		for field in node_id name count caller; do
+			if echo "$fmt" | grep -q "field.*$field"; then
+				pass "mm_zone_lock_contended has field '$field'"
+			else
+				fail "mm_zone_lock_contended missing field '$field'"
+			fi
+		done
+	fi
+
+	if [ -f /sys/kernel/tracing/events/kmem/mm_zone_locked/format ]; then
+		fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_locked/format)
+		for field in node_id name count contended caller wait_ns; do
+			if echo "$fmt" | grep -q "field.*$field"; then
+				pass "mm_zone_locked has field '$field'"
+			else
+				fail "mm_zone_locked missing field '$field'"
+			fi
+		done
+	fi
+}
+
+# --- Test 3: bpftrace counts tracepoint hits under load ---
+
+test_bpftrace_counts() {
+	echo "--- Test 3: bpftrace tracepoint activation under contention ---"
+
+	if ! check_tracepoint_exists mm_zone_locked; then
+		return
+	fi
+
+	local BPFTRACE_OUT
+	BPFTRACE_OUT=$(mktemp /tmp/zone_lock_bt.XXXXXX)
+
+	# bpftrace one-liner: count hits per tracepoint
+	bpftrace -e '
+		tracepoint:kmem:mm_zone_lock_contended { @contended = count(); }
+		tracepoint:kmem:mm_zone_locked          { @locked = count(); }
+		tracepoint:kmem:mm_zone_lock_unlock     { @unlock = count(); }
+	' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+	local BT_PID=$!
+
+	# Wait for bpftrace + workload to finish
+	wait $BT_PID 2>/dev/null || true
+
+	echo "bpftrace output:"
+	cat "$BPFTRACE_OUT"
+
+	# Check that mm_zone_locked fired (it fires on every acquisition)
+	if grep -q '@locked: [0-9]' "$BPFTRACE_OUT"; then
+		pass "mm_zone_locked tracepoint fired"
+	else
+		fail "mm_zone_locked tracepoint did NOT fire"
+	fi
+
+	# Check that mm_zone_lock_unlock fired
+	if grep -q '@unlock: [0-9]' "$BPFTRACE_OUT"; then
+		pass "mm_zone_lock_unlock tracepoint fired"
+	else
+		fail "mm_zone_lock_unlock tracepoint did NOT fire"
+	fi
+
+	# contended may or may not fire depending on actual contention
+	if grep -q '@contended: [0-9]' "$BPFTRACE_OUT"; then
+		pass "mm_zone_lock_contended tracepoint fired (contention detected)"
+	else
+		skip "mm_zone_lock_contended did not fire (no contention observed)"
+	fi
+
+	rm -f "$BPFTRACE_OUT"
+}
+
+# --- Test 4: bpftrace verifies wait_ns > 0 when contended ---
+
+test_wait_ns() {
+	echo "--- Test 4: wait_ns is populated when contended ---"
+
+	if ! check_tracepoint_exists mm_zone_locked; then
+		return
+	fi
+
+	local BPFTRACE_OUT
+	BPFTRACE_OUT=$(mktemp /tmp/zone_lock_wait.XXXXXX)
+
+	bpftrace -e '
+		tracepoint:kmem:mm_zone_locked /args->contended/ {
+			@has_wait_ns = count();
+			@wait_ns = hist(args->wait_ns);
+		}
+	' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+	local BT_PID=$!
+
+	wait $BT_PID 2>/dev/null || true
+
+	echo "bpftrace wait_ns output:"
+	cat "$BPFTRACE_OUT"
+
+	if grep -q '@has_wait_ns: [0-9]' "$BPFTRACE_OUT"; then
+		pass "wait_ns populated on contended acquisitions"
+	else
+		skip "no contended acquisitions observed for wait_ns check"
+	fi
+
+	rm -f "$BPFTRACE_OUT"
+}
+
+# --- Main ---
+
+echo "=== zone->lock tracepoint selftest ==="
+echo "Duration: ${DURATION}s, Threads: ${NR_THREADS}"
+echo
+
+check_root
+check_bpftrace
+check_workload
+
+test_tracepoints_exist
+test_tracepoint_fields
+test_bpftrace_counts
+test_wait_ns
+
+echo
+echo "=== Results: $PASS passed, $FAIL failed, $SKIP skipped ==="
+
+if [ "$FAIL" -gt 0 ]; then
+	exit 1
+fi
+exit 0
diff --git a/tools/testing/selftests/mm/zone_lock_contention.c b/tools/testing/selftests/mm/zone_lock_contention.c
new file mode 100644
index 000000000000..35ddad7670b1
--- /dev/null
+++ b/tools/testing/selftests/mm/zone_lock_contention.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zone_lock_contention.c - Generate zone->lock contention for tracepoint testing
+ *
+ * Spawns multiple threads that rapidly allocate and free pages to force
+ * PCP (per-cpu pageset) drains and refills, which acquire zone->lock via
+ * free_pcppages_bulk() and rmqueue_bulk().
+ *
+ * Reducing percpu_pagelist_high_fraction makes PCP lists smaller, causing
+ * more frequent zone->lock acquisitions and thus more contention.
+ *
+ * Usage: zone_lock_contention [duration_sec] [nr_threads]
+ *        Defaults: 5 seconds, 4 threads
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <time.h>
+
+/* Each thread mmaps/touches/munmaps in a loop to churn pages */
+#define CHUNK_SIZE	(2 * 1024 * 1024)	/* 2 MB per iteration */
+#define PAGE_SZ		4096
+
+static volatile int stop;
+
+struct thread_stats {
+	unsigned long iterations;
+	unsigned long pages_touched;
+};
+
+static void *churn_thread(void *arg)
+{
+	struct thread_stats *stats = arg;
+	unsigned long iter = 0;
+	unsigned long pages = 0;
+
+	while (!stop) {
+		char *p;
+		size_t i;
+
+		p = mmap(NULL, CHUNK_SIZE, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+		if (p == MAP_FAILED) {
+			perror("mmap");
+			break;
+		}
+
+		/* Touch every page to ensure allocation */
+		for (i = 0; i < CHUNK_SIZE; i += PAGE_SZ)
+			p[i] = 1;
+
+		pages += CHUNK_SIZE / PAGE_SZ;
+
+		/* Free pages back - forces PCP drain */
+		munmap(p, CHUNK_SIZE);
+		iter++;
+	}
+
+	stats->iterations = iter;
+	stats->pages_touched = pages;
+	return NULL;
+}
+
+static int write_sysctl(const char *path, const char *val)
+{
+	FILE *f = fopen(path, "w");
+
+	if (!f)
+		return -1;
+	fputs(val, f);
+	fclose(f);
+	return 0;
+}
+
+static int read_sysctl(const char *path, char *buf, size_t len)
+{
+	FILE *f = fopen(path, "r");
+
+	if (!f)
+		return -1;
+	if (!fgets(buf, len, f)) {
+		fclose(f);
+		return -1;
+	}
+	fclose(f);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int duration = 5;
+	int nr_threads = 4;
+	char orig_fraction[32] = "";
+	const char *sysctl_path = "/proc/sys/vm/percpu_pagelist_high_fraction";
+	pthread_t *threads;
+	struct thread_stats *stats;
+	unsigned long total_iter = 0, total_pages = 0;
+	int i;
+
+	if (argc > 1)
+		duration = atoi(argv[1]);
+	if (argc > 2)
+		nr_threads = atoi(argv[2]);
+
+	if (duration <= 0 || nr_threads <= 0) {
+		fprintf(stderr, "Usage: %s [duration_sec] [nr_threads]\n", argv[0]);
+		return 1;
+	}
+
+	printf("zone_lock_contention: %d threads, %d seconds\n",
+	       nr_threads, duration);
+
+	/* Shrink PCP lists to force more zone->lock acquisitions */
+	read_sysctl(sysctl_path, orig_fraction, sizeof(orig_fraction));
+	if (write_sysctl(sysctl_path, "100") < 0)
+		fprintf(stderr, "WARNING: cannot write %s (not root?)\n",
+			sysctl_path);
+	else
+		printf("Set percpu_pagelist_high_fraction=100 (was %s)\n",
+		       orig_fraction);
+
+	threads = calloc(nr_threads, sizeof(*threads));
+	stats = calloc(nr_threads, sizeof(*stats));
+	if (!threads || !stats) {
+		perror("calloc");
+		return 1;
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&threads[i], NULL, churn_thread, &stats[i])) {
+			perror("pthread_create");
+			return 1;
+		}
+	}
+
+	sleep(duration);
+	stop = 1;
+
+	for (i = 0; i < nr_threads; i++) {
+		pthread_join(threads[i], NULL);
+		total_iter += stats[i].iterations;
+		total_pages += stats[i].pages_touched;
+	}
+
+	printf("Total: %lu iterations, %lu pages (%lu MB) churned\n",
+	       total_iter, total_pages,
+	       (total_pages * PAGE_SZ) / (1024 * 1024));
+
+	/* Restore original sysctl */
+	if (orig_fraction[0]) {
+		/* Strip trailing newline */
+		orig_fraction[strcspn(orig_fraction, "\n")] = '\0';
+		write_sysctl(sysctl_path, orig_fraction);
+		printf("Restored percpu_pagelist_high_fraction=%s\n",
+		       orig_fraction);
+	}
+
+	free(threads);
+	free(stats);
+	return 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
  2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
@ 2026-05-08 17:29 ` Andrew Morton
  2026-05-08 17:38   ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2026-05-08 17:29 UTC (permalink / raw)
  To: hawk
  Cc: linux-mm, Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, David Hildenbrand, Lorenzo Stoakes,
	Shuah Khan, linux-kernel, linux-trace-kernel, kernel-team

e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:

> Add tracepoints to the page allocator fast paths that acquire
> zone->lock, allowing diagnosis of lock contention in production.

Thanks, I'm surprised we haven't done this yet.

Unfortunately "mm: use spinlock guards for zone lock" messed this up
(https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).

So please let's give it a few days for reviewers to comment then redo
against mm.git's mm-unstable branch?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
@ 2026-05-08 17:38   ` Vlastimil Babka (SUSE)
  2026-05-08 17:40     ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-08 17:38 UTC (permalink / raw)
  To: Andrew Morton, hawk, Dmitry Ilvokhin, Matthew Wilcox
  Cc: linux-mm, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team

On 5/8/26 7:29 PM, Andrew Morton wrote:
> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> 
>> Add tracepoints to the page allocator fast paths that acquire
>> zone->lock, allowing diagnosis of lock contention in production.
> 
> Thanks, I'm surprised we haven't done this yet.

There was a recent attempt [1]. Not being a generic solution wasn't welcome.

[1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/

> Unfortunately "mm: use spinlock guards for zone lock" messed this up
> (https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
> 
> So please let's give it a few days for reviewers to comment then redo
> against mm.git's mm-unstable branch?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 17:38   ` Vlastimil Babka (SUSE)
@ 2026-05-08 17:40     ` Vlastimil Babka (SUSE)
  2026-05-08 18:07       ` Dmitry Ilvokhin
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-08 17:40 UTC (permalink / raw)
  To: Andrew Morton, hawk, Dmitry Ilvokhin, Matthew Wilcox
  Cc: linux-mm, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
	linux-kernel, linux-trace-kernel, kernel-team

On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> On 5/8/26 7:29 PM, Andrew Morton wrote:
>> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>>
>>> Add tracepoints to the page allocator fast paths that acquire
>>> zone->lock, allowing diagnosis of lock contention in production.
>>
>> Thanks, I'm surprised we haven't done this yet.
> 
> There was a recent attempt [1]. Not being a generic solution wasn't welcome.
> 
> [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/

And this is the generic solution I think?

https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/

>> Unfortunately "mm: use spinlock guards for zone lock" messed this up
>> (https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
>>
>> So please let's give it a few days for reviewers to comment then redo
>> against mm.git's mm-unstable branch?
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 17:40     ` Vlastimil Babka (SUSE)
@ 2026-05-08 18:07       ` Dmitry Ilvokhin
  2026-05-13 15:32         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-08 18:07 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Andrew Morton, hawk, Matthew Wilcox, linux-mm, Steven Rostedt,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, David Hildenbrand,
	Lorenzo Stoakes, Shuah Khan, linux-kernel, linux-trace-kernel,
	kernel-team

On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
> On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> > On 5/8/26 7:29 PM, Andrew Morton wrote:
> >> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> >>
> >>> Add tracepoints to the page allocator fast paths that acquire
> >>> zone->lock, allowing diagnosis of lock contention in production.
> >>
> >> Thanks, I'm surprised we haven't done this yet.
> > 
> > There was a recent attempt [1]. Not being a generic solution wasn't welcome.
> > 
> > [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
> 
> And this is the generic solution I think?
> 
> https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/

Thanks for cc'ing me, Vlastimil.

Yes, this is an attempt at a generic solution for tracing contended
locks, including spinlocks, so it should also cover the use case
proposed in this patchset.

In fact, zone->lock contention was one of the primary motivations for
this work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
  2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
@ 2026-05-08 20:15   ` David Hildenbrand (Arm)
  2026-05-13 15:00     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 20:15 UTC (permalink / raw)
  To: hawk, Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team

On 5/8/26 18:22, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Add a selftest to verify the kmem:mm_zone_lock_contended,
> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
> 
> The test has two components:
> 
> zone_lock_contention.c - a workload that spawns threads doing rapid
> page allocation and freeing to generate zone->lock contention. It
> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
> free_pcppages_bulk() and rmqueue_bulk() calls.
> 
> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
> exist, have the expected fields, fire under load, and that wait_ns
> is populated when contention occurs.
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  tools/testing/selftests/mm/Makefile           |   2 +
>  .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
>  .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++

This really looks excessive and ... not really how we usually treat tracepoints?

I don't know about others, but I don't think this is really what we want as a MM
selftest.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
  2026-05-08 20:15   ` David Hildenbrand (Arm)
@ 2026-05-13 15:00     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-13 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, linux-mm
  Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
	linux-trace-kernel, kernel-team



On 08/05/2026 22.15, David Hildenbrand (Arm) wrote:
> On 5/8/26 18:22, hawk@kernel.org wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> Add a selftest to verify the kmem:mm_zone_lock_contended,
>> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
>>
>> The test has two components:
>>
>> zone_lock_contention.c - a workload that spawns threads doing rapid
>> page allocation and freeing to generate zone->lock contention. It
>> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
>> free_pcppages_bulk() and rmqueue_bulk() calls.
>>
>> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
>> exist, have the expected fields, fire under load, and that wait_ns
>> is populated when contention occurs.
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   tools/testing/selftests/mm/Makefile           |   2 +
>>   .../mm/test_zone_lock_tracepoints.sh          | 212 ++++++++++++++++++
>>   .../selftests/mm/zone_lock_contention.c       | 166 ++++++++++++++
> 
> This really looks excessive and ... not really how we usually treat tracepoints?
> 
> I don't know about others, but I don't think this is really what we want as a MM
> selftest.
> 

I wanted to have a program that tested the code I changed, so I simply
made AI write a verification test and asked it to create as a selftest,
that I've run to verify my code change was correctly implemented.  As I
needed to trigger lock contention the test is more advanced, but luckily
AI solved it in only the 2nd attempt.

It makes sense to drop this patch. We shouldn't keep this code in the
kernel tree, it simply verified that my code works. There is little
chance that this test will catch meaningful regressions for these
tracepoints.

--Jesper

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
  2026-05-08 18:07       ` Dmitry Ilvokhin
@ 2026-05-13 15:32         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-13 15:32 UTC (permalink / raw)
  To: Dmitry Ilvokhin, Vlastimil Babka (SUSE)
  Cc: Andrew Morton, Matthew Wilcox, linux-mm, Steven Rostedt,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, David Hildenbrand,
	Lorenzo Stoakes, Shuah Khan, linux-kernel, linux-trace-kernel,
	kernel-team



On 08/05/2026 20.07, Dmitry Ilvokhin wrote:
> On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
>>> On 5/8/26 7:29 PM, Andrew Morton wrote:
>>>> e .configOn Fri,  8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>>>>
>>>>> Add tracepoints to the page allocator fast paths that acquire
>>>>> zone->lock, allowing diagnosis of lock contention in production.
>>>>
>>>> Thanks, I'm surprised we haven't done this yet.
>>>
>>> There was a recent attempt [1]. Not being a generic solution wasn't welcome.
>>>
>>> [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
>>
>> And this is the generic solution I think?
>>
>> https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
> 
> Thanks for cc'ing me, Vlastimil.
> 
> Yes, this is an attempt at a generic solution for tracing contended
> locks, including spinlocks, so it should also cover the use case
> proposed in this patchset.
>

I'm aware of the generic solution and often use `perf lock contention`.
And the tool libbpf-tools/klockstat. My experience is unfortunately that
enabling these tracepoint is prohibitive expensive on production server,
and production suffers when I run these tools.

I'm very happy to see a patchset adding a contended case. But I worry
that tracing all contented locks in the system is also too much to have
enabled continuously for production.

This patch is carefully constructed to minimize overhead, such that I
can enable this continuously on production to catch issues.  If I
identify issue I will use the generic tracpoints for further debugging.


> In fact, zone->lock contention was one of the primary motivations for
> this work.

In the generic solution I'm loosing the "zone" and pages "count".  I
need this information to get the answers I'm looking for.  Specifically
I'm looking at reducing CONFIG_PCP_BATCH_SCALE_MAX, but I want to this
to be a data-driven decision (my first principle is: if you cannot
measure it you cannot improve it).

I'm likely going to apply this patch to our production system, such that
I can get my data-driven decision.  I need to deploy it widely enough to
get enough server experiencing direct-reclaim.  I'll report back if
people are interested in these learning?

--Jesper

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-13 15:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
2026-05-08 20:15   ` David Hildenbrand (Arm)
2026-05-13 15:00     ` Jesper Dangaard Brouer
2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
2026-05-08 17:38   ` Vlastimil Babka (SUSE)
2026-05-08 17:40     ` Vlastimil Babka (SUSE)
2026-05-08 18:07       ` Dmitry Ilvokhin
2026-05-13 15:32         ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox