* [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
@ 2026-05-08 16:22 hawk
2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
0 siblings, 2 replies; 7+ messages in thread
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
To: Andrew Morton, linux-mm
Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
linux-kernel, linux-trace-kernel, kernel-team, hawk
From: Jesper Dangaard Brouer <hawk@kernel.org>
Add tracepoints to the page allocator fast paths that acquire
zone->lock, allowing diagnosis of lock contention in production.
Three tracepoints are introduced:
kmem:mm_zone_lock_contended - fires when trylock fails (lock is held)
kmem:mm_zone_locked - fires on every acquisition
kmem:mm_zone_lock_unlock - fires on every release
Each event records the NUMA node, zone name, batch count, and caller.
The mm_zone_locked event additionally records wait_ns: the time spent
spinning when contended, measured via local_clock() with IRQs disabled
to ensure accurate same-CPU timestamps.
The lock/unlock paths are wrapped in __zone_lock()/__zone_unlock()
helpers that use trylock-first to separate the contended and
uncontended cases. Only the fast paths (free_pcppages_bulk,
rmqueue_bulk, free_one_page) are covered. Other zone->lock holders
such as compaction, page isolation, and memory hotplug are not
instrumented.
For minimum overhead in production, enable only mm_zone_lock_contended
which fires only on actual contention. Enable mm_zone_locked for
wait-time analysis, and add mm_zone_lock_unlock for hold-time
measurement.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
include/trace/events/kmem.h | 101 ++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 50 +++++++++++++++---
2 files changed, 145 insertions(+), 6 deletions(-)
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..870c68c70d57 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -458,6 +458,107 @@ TRACE_EVENT(rss_stat,
__print_symbolic(__entry->member, TRACE_MM_PAGES),
__entry->size)
);
+
+/*
+ * Tracepoints for zone->lock on the page allocator fast paths only.
+ * Other code paths that acquire zone->lock (compaction, isolation,
+ * memory hotplug, vmstat, etc.) are not covered here.
+ *
+ * Three events:
+ * mm_zone_lock_contended - trylock failed, about to spin
+ * mm_zone_locked - lock acquired, includes wait_ns when
+ * contended (zero otherwise)
+ * mm_zone_lock_unlock - lock released
+ *
+ * For production use with minimum overhead, enable only
+ * mm_zone_lock_contended -- it fires only when trylock detects the
+ * lock is already held.
+ *
+ * For wait-time analysis, enable mm_zone_locked -- its wait_ns
+ * field gives the spin duration directly. Adding unlock allows
+ * hold-time measurement, at the cost of one event per acquisition.
+ */
+TRACE_EVENT(mm_zone_lock_contended,
+
+ TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+ TP_ARGS(zone, count, caller),
+
+ TP_STRUCT__entry(
+ __field( int, node_id )
+ __string( name, zone->name )
+ __field( int, count )
+ __field( unsigned long, caller )
+ ),
+
+ TP_fast_assign(
+ __entry->node_id = zone_to_nid(zone);
+ __assign_str(name);
+ __entry->count = count;
+ __entry->caller = caller;
+ ),
+
+ TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+ __entry->node_id, __get_str(name),
+ __entry->count, (void *)__entry->caller)
+);
+
+TRACE_EVENT(mm_zone_locked,
+
+ TP_PROTO(struct zone *zone, int count, bool contended,
+ unsigned long caller, u64 wait_ns),
+
+ TP_ARGS(zone, count, contended, caller, wait_ns),
+
+ TP_STRUCT__entry(
+ __field( int, node_id )
+ __string( name, zone->name )
+ __field( int, count )
+ __field( bool, contended )
+ __field( unsigned long, caller )
+ __field( u64, wait_ns )
+ ),
+
+ TP_fast_assign(
+ __entry->node_id = zone_to_nid(zone);
+ __assign_str(name);
+ __entry->count = count;
+ __entry->contended = contended;
+ __entry->caller = caller;
+ __entry->wait_ns = wait_ns;
+ ),
+
+ TP_printk("node=%d zone=%-8s count=%-5d contended=%d caller=%pS wait=%llu ns",
+ __entry->node_id, __get_str(name),
+ __entry->count, __entry->contended,
+ (void *)__entry->caller, __entry->wait_ns)
+);
+
+TRACE_EVENT(mm_zone_lock_unlock,
+
+ TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+ TP_ARGS(zone, count, caller),
+
+ TP_STRUCT__entry(
+ __field( int, node_id )
+ __string( name, zone->name )
+ __field( int, count )
+ __field( unsigned long, caller )
+ ),
+
+ TP_fast_assign(
+ __entry->node_id = zone_to_nid(zone);
+ __assign_str(name);
+ __entry->count = count;
+ __entry->caller = caller;
+ ),
+
+ TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+ __entry->node_id, __get_str(name),
+ __entry->count, (void *)__entry->caller)
+);
+
#endif /* _TRACE_KMEM_H */
/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..08018e9beab4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -19,6 +19,7 @@
#include <linux/highmem.h>
#include <linux/interrupt.h>
#include <linux/jiffies.h>
+#include <linux/sched/clock.h>
#include <linux/compiler.h>
#include <linux/kernel.h>
#include <linux/kasan.h>
@@ -1447,6 +1448,43 @@ bool free_pages_prepare(struct page *page, unsigned int order)
return __free_pages_prepare(page, order, FPI_NONE);
}
+/*
+ * Helper functions for locking zone->lock with tracepoints.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments. The @count parameter indicates the number
+ * of pages being freed or allocated in the batch operation.
+ *
+ * For minimum overhead attach to kmem:mm_zone_lock_contended, which
+ * only gets activated when trylock detects lock is contended.
+ */
+static inline void
+__zone_lock(struct zone *zone, int count, unsigned long *flags)
+ __acquires(&zone->lock)
+{
+ unsigned long caller = _RET_IP_;
+ u64 wait_start, wait_time = 0;
+ bool contended;
+
+ local_irq_save(*flags);
+ contended = !spin_trylock(&zone->lock);
+ if (contended) {
+ wait_start = local_clock();
+ trace_mm_zone_lock_contended(zone, count, caller);
+ spin_lock(&zone->lock);
+ wait_time = local_clock() - wait_start;
+ }
+ trace_mm_zone_locked(zone, count, contended, caller, wait_time);
+}
+
+static inline void
+__zone_unlock(struct zone *zone, int count, unsigned long *flags)
+ __releases(&zone->lock)
+{
+ trace_mm_zone_lock_unlock(zone, count, _RET_IP_);
+ spin_unlock_irqrestore(&zone->lock, *flags);
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone.
@@ -1469,7 +1507,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* Ensure requested pindex is drained first. */
pindex = pindex - 1;
- spin_lock_irqsave(&zone->lock, flags);
+ __zone_lock(zone, count, &flags);
while (count > 0) {
struct list_head *list;
@@ -1502,7 +1540,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (count > 0 && !list_empty(list));
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ __zone_unlock(zone, count, &flags);
}
/* Split a multi-block free page into its individual pageblocks. */
@@ -1551,7 +1589,7 @@ static void free_one_page(struct zone *zone, struct page *page,
return;
}
} else {
- spin_lock_irqsave(&zone->lock, flags);
+ __zone_lock(zone, 1 << order, &flags);
}
/* The lock succeeded. Process deferred pages. */
@@ -1569,7 +1607,7 @@ static void free_one_page(struct zone *zone, struct page *page,
}
}
split_large_buddy(zone, page, pfn, order, fpi_flags);
- spin_unlock_irqrestore(&zone->lock, flags);
+ __zone_unlock(zone, 1 << order, &flags);
__count_vm_events(PGFREE, 1 << order);
}
@@ -2525,7 +2563,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
if (!spin_trylock_irqsave(&zone->lock, flags))
return 0;
} else {
- spin_lock_irqsave(&zone->lock, flags);
+ __zone_lock(zone, count, &flags);
}
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
@@ -2545,7 +2583,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
list_add_tail(&page->pcp_list, list);
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ __zone_unlock(zone, i, &flags);
return i;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
@ 2026-05-08 16:22 ` hawk
2026-05-08 20:15 ` David Hildenbrand (Arm)
2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
1 sibling, 1 reply; 7+ messages in thread
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
To: Andrew Morton, linux-mm
Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
linux-kernel, linux-trace-kernel, kernel-team, hawk
From: Jesper Dangaard Brouer <hawk@kernel.org>
Add a selftest to verify the kmem:mm_zone_lock_contended,
kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
The test has two components:
zone_lock_contention.c - a workload that spawns threads doing rapid
page allocation and freeing to generate zone->lock contention. It
shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
free_pcppages_bulk() and rmqueue_bulk() calls.
test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
exist, have the expected fields, fire under load, and that wait_ns
is populated when contention occurs.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
tools/testing/selftests/mm/Makefile | 2 +
.../mm/test_zone_lock_tracepoints.sh | 212 ++++++++++++++++++
.../selftests/mm/zone_lock_contention.c | 166 ++++++++++++++
3 files changed, 380 insertions(+)
create mode 100755 tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
create mode 100644 tools/testing/selftests/mm/zone_lock_contention.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27..af6cfdf3c8a0 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
TEST_GEN_FILES += merge
TEST_GEN_FILES += rmap
TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += zone_lock_contention
ifneq ($(ARCH),arm64)
TEST_GEN_FILES += soft-dirty
@@ -173,6 +174,7 @@ TEST_PROGS += ksft_thp.sh
TEST_PROGS += ksft_userfaultfd.sh
TEST_PROGS += ksft_vma_merge.sh
TEST_PROGS += ksft_vmalloc.sh
+TEST_PROGS += test_zone_lock_tracepoints.sh
TEST_FILES := test_vmalloc.sh
TEST_FILES += test_hmm.sh
diff --git a/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
new file mode 100755
index 000000000000..7fa3dab1f6c5
--- /dev/null
+++ b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
@@ -0,0 +1,212 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# test_zone_lock_tracepoints.sh - Verify mm_zone_lock tracepoints fire
+#
+# Generates zone->lock contention and uses bpftrace to verify that the
+# kmem:mm_zone_lock_contended, kmem:mm_zone_locked, and
+# kmem:mm_zone_lock_unlock tracepoints activate and produce output.
+#
+# Requirements: bpftrace, root privileges, CONFIG_FTRACE=y
+#
+# Usage: ./test_zone_lock_tracepoints.sh [duration_sec]
+# Default duration: 5 seconds
+#
+# For running in a VM via virtme-ng:
+# make -C tools/testing/selftests/mm zone_lock_contention
+# vng --cpus 4 --memory 2G \
+# --rwdir tools/testing/selftests/mm \
+# --exec "cd tools/testing/selftests/mm && ./test_zone_lock_tracepoints.sh 5"
+
+set -e
+
+DURATION=${1:-5}
+TESTDIR="$(cd "$(dirname "$0")" && pwd)"
+WORKLOAD="$TESTDIR/zone_lock_contention"
+NR_THREADS=4
+PASS=0
+FAIL=0
+SKIP=0
+
+# --- helpers ---
+
+pass() { echo "PASS: $1"; PASS=$((PASS + 1)); }
+fail() { echo "FAIL: $1"; FAIL=$((FAIL + 1)); }
+skip() { echo "SKIP: $1"; SKIP=$((SKIP + 1)); }
+
+check_root() {
+ if [ "$(id -u)" -ne 0 ]; then
+ echo "ERROR: must run as root"
+ exit 4 # ksft SKIP
+ fi
+}
+
+check_bpftrace() {
+ if ! command -v bpftrace >/dev/null 2>&1; then
+ echo "SKIP: bpftrace not found"
+ exit 4
+ fi
+}
+
+check_workload() {
+ if [ ! -x "$WORKLOAD" ]; then
+ echo "SKIP: $WORKLOAD not found, run 'make -C tools/testing/selftests/mm' first"
+ exit 4
+ fi
+}
+
+check_tracepoint_exists() {
+ local tp="$1"
+ if [ ! -d "/sys/kernel/tracing/events/kmem/$tp" ]; then
+ skip "$tp tracepoint not in kernel"
+ return 1
+ fi
+ return 0
+}
+
+# --- Test 1: verify tracepoints exist in tracefs ---
+
+test_tracepoints_exist() {
+ echo "--- Test 1: tracepoints exist in tracefs ---"
+ for tp in mm_zone_lock_contended mm_zone_locked mm_zone_lock_unlock; do
+ if check_tracepoint_exists "$tp"; then
+ pass "$tp exists"
+ fi
+ done
+}
+
+# --- Test 2: verify format fields ---
+
+test_tracepoint_fields() {
+ echo "--- Test 2: tracepoint format fields ---"
+ local fmt
+
+ if [ -f /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format ]; then
+ fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format)
+ for field in node_id name count caller; do
+ if echo "$fmt" | grep -q "field.*$field"; then
+ pass "mm_zone_lock_contended has field '$field'"
+ else
+ fail "mm_zone_lock_contended missing field '$field'"
+ fi
+ done
+ fi
+
+ if [ -f /sys/kernel/tracing/events/kmem/mm_zone_locked/format ]; then
+ fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_locked/format)
+ for field in node_id name count contended caller wait_ns; do
+ if echo "$fmt" | grep -q "field.*$field"; then
+ pass "mm_zone_locked has field '$field'"
+ else
+ fail "mm_zone_locked missing field '$field'"
+ fi
+ done
+ fi
+}
+
+# --- Test 3: bpftrace counts tracepoint hits under load ---
+
+test_bpftrace_counts() {
+ echo "--- Test 3: bpftrace tracepoint activation under contention ---"
+
+ if ! check_tracepoint_exists mm_zone_locked; then
+ return
+ fi
+
+ local BPFTRACE_OUT
+ BPFTRACE_OUT=$(mktemp /tmp/zone_lock_bt.XXXXXX)
+
+ # bpftrace one-liner: count hits per tracepoint
+ bpftrace -e '
+ tracepoint:kmem:mm_zone_lock_contended { @contended = count(); }
+ tracepoint:kmem:mm_zone_locked { @locked = count(); }
+ tracepoint:kmem:mm_zone_lock_unlock { @unlock = count(); }
+ ' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+ local BT_PID=$!
+
+ # Wait for bpftrace + workload to finish
+ wait $BT_PID 2>/dev/null || true
+
+ echo "bpftrace output:"
+ cat "$BPFTRACE_OUT"
+
+ # Check that mm_zone_locked fired (it fires on every acquisition)
+ if grep -q '@locked: [0-9]' "$BPFTRACE_OUT"; then
+ pass "mm_zone_locked tracepoint fired"
+ else
+ fail "mm_zone_locked tracepoint did NOT fire"
+ fi
+
+ # Check that mm_zone_lock_unlock fired
+ if grep -q '@unlock: [0-9]' "$BPFTRACE_OUT"; then
+ pass "mm_zone_lock_unlock tracepoint fired"
+ else
+ fail "mm_zone_lock_unlock tracepoint did NOT fire"
+ fi
+
+ # contended may or may not fire depending on actual contention
+ if grep -q '@contended: [0-9]' "$BPFTRACE_OUT"; then
+ pass "mm_zone_lock_contended tracepoint fired (contention detected)"
+ else
+ skip "mm_zone_lock_contended did not fire (no contention observed)"
+ fi
+
+ rm -f "$BPFTRACE_OUT"
+}
+
+# --- Test 4: bpftrace verifies wait_ns > 0 when contended ---
+
+test_wait_ns() {
+ echo "--- Test 4: wait_ns is populated when contended ---"
+
+ if ! check_tracepoint_exists mm_zone_locked; then
+ return
+ fi
+
+ local BPFTRACE_OUT
+ BPFTRACE_OUT=$(mktemp /tmp/zone_lock_wait.XXXXXX)
+
+ bpftrace -e '
+ tracepoint:kmem:mm_zone_locked /args->contended/ {
+ @has_wait_ns = count();
+ @wait_ns = hist(args->wait_ns);
+ }
+ ' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+ local BT_PID=$!
+
+ wait $BT_PID 2>/dev/null || true
+
+ echo "bpftrace wait_ns output:"
+ cat "$BPFTRACE_OUT"
+
+ if grep -q '@has_wait_ns: [0-9]' "$BPFTRACE_OUT"; then
+ pass "wait_ns populated on contended acquisitions"
+ else
+ skip "no contended acquisitions observed for wait_ns check"
+ fi
+
+ rm -f "$BPFTRACE_OUT"
+}
+
+# --- Main ---
+
+echo "=== zone->lock tracepoint selftest ==="
+echo "Duration: ${DURATION}s, Threads: ${NR_THREADS}"
+echo
+
+check_root
+check_bpftrace
+check_workload
+
+test_tracepoints_exist
+test_tracepoint_fields
+test_bpftrace_counts
+test_wait_ns
+
+echo
+echo "=== Results: $PASS passed, $FAIL failed, $SKIP skipped ==="
+
+if [ "$FAIL" -gt 0 ]; then
+ exit 1
+fi
+exit 0
diff --git a/tools/testing/selftests/mm/zone_lock_contention.c b/tools/testing/selftests/mm/zone_lock_contention.c
new file mode 100644
index 000000000000..35ddad7670b1
--- /dev/null
+++ b/tools/testing/selftests/mm/zone_lock_contention.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zone_lock_contention.c - Generate zone->lock contention for tracepoint testing
+ *
+ * Spawns multiple threads that rapidly allocate and free pages to force
+ * PCP (per-cpu pageset) drains and refills, which acquire zone->lock via
+ * free_pcppages_bulk() and rmqueue_bulk().
+ *
+ * Reducing percpu_pagelist_high_fraction makes PCP lists smaller, causing
+ * more frequent zone->lock acquisitions and thus more contention.
+ *
+ * Usage: zone_lock_contention [duration_sec] [nr_threads]
+ * Defaults: 5 seconds, 4 threads
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <time.h>
+
+/* Each thread mmaps/touches/munmaps in a loop to churn pages */
+#define CHUNK_SIZE (2 * 1024 * 1024) /* 2 MB per iteration */
+#define PAGE_SZ 4096
+
+static volatile int stop;
+
+struct thread_stats {
+ unsigned long iterations;
+ unsigned long pages_touched;
+};
+
+static void *churn_thread(void *arg)
+{
+ struct thread_stats *stats = arg;
+ unsigned long iter = 0;
+ unsigned long pages = 0;
+
+ while (!stop) {
+ char *p;
+ size_t i;
+
+ p = mmap(NULL, CHUNK_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+ if (p == MAP_FAILED) {
+ perror("mmap");
+ break;
+ }
+
+ /* Touch every page to ensure allocation */
+ for (i = 0; i < CHUNK_SIZE; i += PAGE_SZ)
+ p[i] = 1;
+
+ pages += CHUNK_SIZE / PAGE_SZ;
+
+ /* Free pages back - forces PCP drain */
+ munmap(p, CHUNK_SIZE);
+ iter++;
+ }
+
+ stats->iterations = iter;
+ stats->pages_touched = pages;
+ return NULL;
+}
+
+static int write_sysctl(const char *path, const char *val)
+{
+ FILE *f = fopen(path, "w");
+
+ if (!f)
+ return -1;
+ fputs(val, f);
+ fclose(f);
+ return 0;
+}
+
+static int read_sysctl(const char *path, char *buf, size_t len)
+{
+ FILE *f = fopen(path, "r");
+
+ if (!f)
+ return -1;
+ if (!fgets(buf, len, f)) {
+ fclose(f);
+ return -1;
+ }
+ fclose(f);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int duration = 5;
+ int nr_threads = 4;
+ char orig_fraction[32] = "";
+ const char *sysctl_path = "/proc/sys/vm/percpu_pagelist_high_fraction";
+ pthread_t *threads;
+ struct thread_stats *stats;
+ unsigned long total_iter = 0, total_pages = 0;
+ int i;
+
+ if (argc > 1)
+ duration = atoi(argv[1]);
+ if (argc > 2)
+ nr_threads = atoi(argv[2]);
+
+ if (duration <= 0 || nr_threads <= 0) {
+ fprintf(stderr, "Usage: %s [duration_sec] [nr_threads]\n", argv[0]);
+ return 1;
+ }
+
+ printf("zone_lock_contention: %d threads, %d seconds\n",
+ nr_threads, duration);
+
+ /* Shrink PCP lists to force more zone->lock acquisitions */
+ read_sysctl(sysctl_path, orig_fraction, sizeof(orig_fraction));
+ if (write_sysctl(sysctl_path, "100") < 0)
+ fprintf(stderr, "WARNING: cannot write %s (not root?)\n",
+ sysctl_path);
+ else
+ printf("Set percpu_pagelist_high_fraction=100 (was %s)\n",
+ orig_fraction);
+
+ threads = calloc(nr_threads, sizeof(*threads));
+ stats = calloc(nr_threads, sizeof(*stats));
+ if (!threads || !stats) {
+ perror("calloc");
+ return 1;
+ }
+
+ for (i = 0; i < nr_threads; i++) {
+ if (pthread_create(&threads[i], NULL, churn_thread, &stats[i])) {
+ perror("pthread_create");
+ return 1;
+ }
+ }
+
+ sleep(duration);
+ stop = 1;
+
+ for (i = 0; i < nr_threads; i++) {
+ pthread_join(threads[i], NULL);
+ total_iter += stats[i].iterations;
+ total_pages += stats[i].pages_touched;
+ }
+
+ printf("Total: %lu iterations, %lu pages (%lu MB) churned\n",
+ total_iter, total_pages,
+ (total_pages * PAGE_SZ) / (1024 * 1024));
+
+ /* Restore original sysctl */
+ if (orig_fraction[0]) {
+ /* Strip trailing newline */
+ orig_fraction[strcspn(orig_fraction, "\n")] = '\0';
+ write_sysctl(sysctl_path, orig_fraction);
+ printf("Restored percpu_pagelist_high_fraction=%s\n",
+ orig_fraction);
+ }
+
+ free(threads);
+ free(stats);
+ return 0;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
@ 2026-05-08 17:29 ` Andrew Morton
2026-05-08 17:38 ` Vlastimil Babka (SUSE)
1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2026-05-08 17:29 UTC (permalink / raw)
To: hawk
Cc: linux-mm, Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan,
Michal Hocko, Zi Yan, David Hildenbrand, Lorenzo Stoakes,
Shuah Khan, linux-kernel, linux-trace-kernel, kernel-team
e .configOn Fri, 8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> Add tracepoints to the page allocator fast paths that acquire
> zone->lock, allowing diagnosis of lock contention in production.
Thanks, I'm surprised we haven't done this yet.
Unfortunately "mm: use spinlock guards for zone lock" messed this up
(https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
So please let's give it a few days for reviewers to comment then redo
against mm.git's mm-unstable branch?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
@ 2026-05-08 17:38 ` Vlastimil Babka (SUSE)
2026-05-08 17:40 ` Vlastimil Babka (SUSE)
0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-08 17:38 UTC (permalink / raw)
To: Andrew Morton, hawk, Dmitry Ilvokhin, Matthew Wilcox
Cc: linux-mm, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
linux-kernel, linux-trace-kernel, kernel-team
On 5/8/26 7:29 PM, Andrew Morton wrote:
> e .configOn Fri, 8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>
>> Add tracepoints to the page allocator fast paths that acquire
>> zone->lock, allowing diagnosis of lock contention in production.
>
> Thanks, I'm surprised we haven't done this yet.
There was a recent attempt [1]. Not being a generic solution wasn't welcome.
[1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
> Unfortunately "mm: use spinlock guards for zone lock" messed this up
> (https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
>
> So please let's give it a few days for reviewers to comment then redo
> against mm.git's mm-unstable branch?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
2026-05-08 17:38 ` Vlastimil Babka (SUSE)
@ 2026-05-08 17:40 ` Vlastimil Babka (SUSE)
2026-05-08 18:07 ` Dmitry Ilvokhin
0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-08 17:40 UTC (permalink / raw)
To: Andrew Morton, hawk, Dmitry Ilvokhin, Matthew Wilcox
Cc: linux-mm, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
linux-kernel, linux-trace-kernel, kernel-team
On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> On 5/8/26 7:29 PM, Andrew Morton wrote:
>> e .configOn Fri, 8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
>>
>>> Add tracepoints to the page allocator fast paths that acquire
>>> zone->lock, allowing diagnosis of lock contention in production.
>>
>> Thanks, I'm surprised we haven't done this yet.
>
> There was a recent attempt [1]. Not being a generic solution wasn't welcome.
>
> [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
And this is the generic solution I think?
https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
>> Unfortunately "mm: use spinlock guards for zone lock" messed this up
>> (https://lore.kernel.org/all/cover.1777462630.git.d@ilvokhin.com/).
>>
>> So please let's give it a few days for reviewers to comment then redo
>> against mm.git's mm-unstable branch?
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
2026-05-08 17:40 ` Vlastimil Babka (SUSE)
@ 2026-05-08 18:07 ` Dmitry Ilvokhin
0 siblings, 0 replies; 7+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-08 18:07 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Andrew Morton, hawk, Matthew Wilcox, linux-mm, Steven Rostedt,
Suren Baghdasaryan, Michal Hocko, Zi Yan, David Hildenbrand,
Lorenzo Stoakes, Shuah Khan, linux-kernel, linux-trace-kernel,
kernel-team
On Fri, May 08, 2026 at 07:40:51PM +0200, Vlastimil Babka (SUSE) wrote:
> On 5/8/26 7:38 PM, Vlastimil Babka (SUSE) wrote:
> > On 5/8/26 7:29 PM, Andrew Morton wrote:
> >> e .configOn Fri, 8 May 2026 18:22:06 +0200 hawk@kernel.org wrote:
> >>
> >>> Add tracepoints to the page allocator fast paths that acquire
> >>> zone->lock, allowing diagnosis of lock contention in production.
> >>
> >> Thanks, I'm surprised we haven't done this yet.
> >
> > There was a recent attempt [1]. Not being a generic solution wasn't welcome.
> >
> > [1] https://lore.kernel.org/all/cover.1772206930.git.d@ilvokhin.com/
>
> And this is the generic solution I think?
>
> https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
Thanks for cc'ing me, Vlastimil.
Yes, this is an attempt at a generic solution for tracing contended
locks, including spinlocks, so it should also cover the use case
proposed in this patchset.
In fact, zone->lock contention was one of the primary motivations for
this work.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
@ 2026-05-08 20:15 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 20:15 UTC (permalink / raw)
To: hawk, Andrew Morton, linux-mm
Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Lorenzo Stoakes, Shuah Khan, linux-kernel,
linux-trace-kernel, kernel-team
On 5/8/26 18:22, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
>
> Add a selftest to verify the kmem:mm_zone_lock_contended,
> kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
>
> The test has two components:
>
> zone_lock_contention.c - a workload that spawns threads doing rapid
> page allocation and freeing to generate zone->lock contention. It
> shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
> free_pcppages_bulk() and rmqueue_bulk() calls.
>
> test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
> exist, have the expected fields, fire under load, and that wait_ns
> is populated when contention occurs.
>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> tools/testing/selftests/mm/Makefile | 2 +
> .../mm/test_zone_lock_tracepoints.sh | 212 ++++++++++++++++++
> .../selftests/mm/zone_lock_contention.c | 166 ++++++++++++++
This really looks excessive and ... not really how we usually treat tracepoints?
I don't know about others, but I don't think this is really what we want as a MM
selftest.
--
Cheers,
David
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-05-08 20:15 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 16:22 [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions hawk
2026-05-08 16:22 ` [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test hawk
2026-05-08 20:15 ` David Hildenbrand (Arm)
2026-05-08 17:29 ` [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions Andrew Morton
2026-05-08 17:38 ` Vlastimil Babka (SUSE)
2026-05-08 17:40 ` Vlastimil Babka (SUSE)
2026-05-08 18:07 ` Dmitry Ilvokhin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox