* [PATCH 2/5] docs: fix repeated word 'that' across documentation
From: Adrien Reynard @ 2026-05-08 16:37 UTC (permalink / raw)
To: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Jonathan Corbet, Shuah Khan, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, David Howells,
Paulo Alcantara, Masami Hiramatsu,
open list:READ-COPY UPDATE (RCU), open list:DOCUMENTATION,
open list, open list:DRIVER CORE, KOBJECTS, DEBUGFS AND SYSFS,
open list:FILESYSTEMS [NETFS LIBRARY],
open list:FILESYSTEMS [NETFS LIBRARY], open list:TRACING
Cc: Adrien Reynard
Signed-off-by: Adrien Reynard <reynard.adrien.08@gmail.com>
---
Documentation/RCU/rcu.rst | 2 +-
Documentation/driver-api/driver-model/overview.rst | 2 +-
Documentation/filesystems/netfs_library.rst | 2 +-
Documentation/trace/histogram-design.rst | 2 +-
Documentation/trace/histogram.rst | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/Documentation/RCU/rcu.rst b/Documentation/RCU/rcu.rst
index bf6617b330a7..320ad3292b75 100644
--- a/Documentation/RCU/rcu.rst
+++ b/Documentation/RCU/rcu.rst
@@ -32,7 +32,7 @@ Frequently Asked Questions
Just as with spinlocks, RCU readers are not permitted to
block, switch to user-mode execution, or enter the idle loop.
Therefore, as soon as a CPU is seen passing through any of these
- three states, we know that that CPU has exited any previous RCU
+ three states, we know that CPU has exited any previous RCU
read-side critical sections. So, if we remove an item from a
linked list, and then wait until all CPUs have switched context,
executed in user mode, or executed in the idle loop, we can
diff --git a/Documentation/driver-api/driver-model/overview.rst b/Documentation/driver-api/driver-model/overview.rst
index b3f447bf9f07..c1966d506d55 100644
--- a/Documentation/driver-api/driver-model/overview.rst
+++ b/Documentation/driver-api/driver-model/overview.rst
@@ -55,7 +55,7 @@ struct pci_dev now looks like this::
Note first that the struct device dev within the struct pci_dev is
statically allocated. This means only one allocation on device discovery.
-Note also that that struct device dev is not necessarily defined at the
+Note also that struct device dev is not necessarily defined at the
front of the pci_dev structure. This is to make people think about what
they're doing when switching between the bus driver and the global driver,
and to discourage meaningless and incorrect casts between the two.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index ddd799df6ce3..4033de4535ac 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -626,7 +626,7 @@ A number of members are available for access/use by the filesystem:
These are set by the filesystem or the cache in ->prepare_read() or
->prepare_write() for each subrequest to indicate the maximum number of
- bytes and, optionally, the maximum number of segments (if not 0) that that
+ bytes and, optionally, the maximum number of segments (if not 0) that
subrequest can support.
* ``submit_extendable_to``
diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
index e92f56ebd0b5..949bbfdb0f16 100644
--- a/Documentation/trace/histogram-design.rst
+++ b/Documentation/trace/histogram-design.rst
@@ -738,7 +738,7 @@ creates its own variable, wakeup_lat, but nothing yet uses it::
Looking at the sched_waking 'hist_debug' output, in addition to the
normal key and value hist_fields, in the val fields section we see a
-field with the HIST_FIELD_FL_VAR flag, which indicates that that field
+field with the HIST_FIELD_FL_VAR flag, which indicates that field
represents a variable. Note that in addition to the variable name,
contained in the var.name field, it includes the var.idx, which is the
index into the tracing_map_elt.vars[] array of the actual variable
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 340bcb5099e7..5b303fabdf32 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -1700,7 +1700,7 @@ to that rule is that any variable used in an expression is essentially
'read-once' - once it's used by an expression in a subsequent event,
it's reset to its 'unset' state, which means it can't be used again
unless it's set again. This ensures not only that an event doesn't
-use an uninitialized variable in a calculation, but that that variable
+use an uninitialized variable in a calculation, but that variable
is used only once and not for any unrelated subsequent match.
The basic syntax for saving a variable is to simply prefix a unique
--
2.54.0
^ permalink raw reply related
* [PATCH 2/2] selftests/mm: add zone->lock tracepoint verification test
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
To: Andrew Morton, linux-mm
Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
linux-kernel, linux-trace-kernel, kernel-team, hawk
In-Reply-To: <20260508162207.3315781-1-hawk@kernel.org>
From: Jesper Dangaard Brouer <hawk@kernel.org>
Add a selftest to verify the kmem:mm_zone_lock_contended,
kmem:mm_zone_locked, and kmem:mm_zone_lock_unlock tracepoints.
The test has two components:
zone_lock_contention.c - a workload that spawns threads doing rapid
page allocation and freeing to generate zone->lock contention. It
shrinks PCP lists via percpu_pagelist_high_fraction to force frequent
free_pcppages_bulk() and rmqueue_bulk() calls.
test_zone_lock_tracepoints.sh - uses bpftrace to verify tracepoints
exist, have the expected fields, fire under load, and that wait_ns
is populated when contention occurs.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
tools/testing/selftests/mm/Makefile | 2 +
.../mm/test_zone_lock_tracepoints.sh | 212 ++++++++++++++++++
.../selftests/mm/zone_lock_contention.c | 166 ++++++++++++++
3 files changed, 380 insertions(+)
create mode 100755 tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
create mode 100644 tools/testing/selftests/mm/zone_lock_contention.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27..af6cfdf3c8a0 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
TEST_GEN_FILES += merge
TEST_GEN_FILES += rmap
TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += zone_lock_contention
ifneq ($(ARCH),arm64)
TEST_GEN_FILES += soft-dirty
@@ -173,6 +174,7 @@ TEST_PROGS += ksft_thp.sh
TEST_PROGS += ksft_userfaultfd.sh
TEST_PROGS += ksft_vma_merge.sh
TEST_PROGS += ksft_vmalloc.sh
+TEST_PROGS += test_zone_lock_tracepoints.sh
TEST_FILES := test_vmalloc.sh
TEST_FILES += test_hmm.sh
diff --git a/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
new file mode 100755
index 000000000000..7fa3dab1f6c5
--- /dev/null
+++ b/tools/testing/selftests/mm/test_zone_lock_tracepoints.sh
@@ -0,0 +1,212 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# test_zone_lock_tracepoints.sh - Verify mm_zone_lock tracepoints fire
+#
+# Generates zone->lock contention and uses bpftrace to verify that the
+# kmem:mm_zone_lock_contended, kmem:mm_zone_locked, and
+# kmem:mm_zone_lock_unlock tracepoints activate and produce output.
+#
+# Requirements: bpftrace, root privileges, CONFIG_FTRACE=y
+#
+# Usage: ./test_zone_lock_tracepoints.sh [duration_sec]
+# Default duration: 5 seconds
+#
+# For running in a VM via virtme-ng:
+# make -C tools/testing/selftests/mm zone_lock_contention
+# vng --cpus 4 --memory 2G \
+# --rwdir tools/testing/selftests/mm \
+# --exec "cd tools/testing/selftests/mm && ./test_zone_lock_tracepoints.sh 5"
+
+set -e
+
+DURATION=${1:-5}
+TESTDIR="$(cd "$(dirname "$0")" && pwd)"
+WORKLOAD="$TESTDIR/zone_lock_contention"
+NR_THREADS=4
+PASS=0
+FAIL=0
+SKIP=0
+
+# --- helpers ---
+
+pass() { echo "PASS: $1"; PASS=$((PASS + 1)); }
+fail() { echo "FAIL: $1"; FAIL=$((FAIL + 1)); }
+skip() { echo "SKIP: $1"; SKIP=$((SKIP + 1)); }
+
+check_root() {
+ if [ "$(id -u)" -ne 0 ]; then
+ echo "ERROR: must run as root"
+ exit 4 # ksft SKIP
+ fi
+}
+
+check_bpftrace() {
+ if ! command -v bpftrace >/dev/null 2>&1; then
+ echo "SKIP: bpftrace not found"
+ exit 4
+ fi
+}
+
+check_workload() {
+ if [ ! -x "$WORKLOAD" ]; then
+ echo "SKIP: $WORKLOAD not found, run 'make -C tools/testing/selftests/mm' first"
+ exit 4
+ fi
+}
+
+check_tracepoint_exists() {
+ local tp="$1"
+ if [ ! -d "/sys/kernel/tracing/events/kmem/$tp" ]; then
+ skip "$tp tracepoint not in kernel"
+ return 1
+ fi
+ return 0
+}
+
+# --- Test 1: verify tracepoints exist in tracefs ---
+
+test_tracepoints_exist() {
+ echo "--- Test 1: tracepoints exist in tracefs ---"
+ for tp in mm_zone_lock_contended mm_zone_locked mm_zone_lock_unlock; do
+ if check_tracepoint_exists "$tp"; then
+ pass "$tp exists"
+ fi
+ done
+}
+
+# --- Test 2: verify format fields ---
+
+test_tracepoint_fields() {
+ echo "--- Test 2: tracepoint format fields ---"
+ local fmt
+
+ if [ -f /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format ]; then
+ fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_lock_contended/format)
+ for field in node_id name count caller; do
+ if echo "$fmt" | grep -q "field.*$field"; then
+ pass "mm_zone_lock_contended has field '$field'"
+ else
+ fail "mm_zone_lock_contended missing field '$field'"
+ fi
+ done
+ fi
+
+ if [ -f /sys/kernel/tracing/events/kmem/mm_zone_locked/format ]; then
+ fmt=$(cat /sys/kernel/tracing/events/kmem/mm_zone_locked/format)
+ for field in node_id name count contended caller wait_ns; do
+ if echo "$fmt" | grep -q "field.*$field"; then
+ pass "mm_zone_locked has field '$field'"
+ else
+ fail "mm_zone_locked missing field '$field'"
+ fi
+ done
+ fi
+}
+
+# --- Test 3: bpftrace counts tracepoint hits under load ---
+
+test_bpftrace_counts() {
+ echo "--- Test 3: bpftrace tracepoint activation under contention ---"
+
+ if ! check_tracepoint_exists mm_zone_locked; then
+ return
+ fi
+
+ local BPFTRACE_OUT
+ BPFTRACE_OUT=$(mktemp /tmp/zone_lock_bt.XXXXXX)
+
+ # bpftrace one-liner: count hits per tracepoint
+ bpftrace -e '
+ tracepoint:kmem:mm_zone_lock_contended { @contended = count(); }
+ tracepoint:kmem:mm_zone_locked { @locked = count(); }
+ tracepoint:kmem:mm_zone_lock_unlock { @unlock = count(); }
+ ' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+ local BT_PID=$!
+
+ # Wait for bpftrace + workload to finish
+ wait $BT_PID 2>/dev/null || true
+
+ echo "bpftrace output:"
+ cat "$BPFTRACE_OUT"
+
+ # Check that mm_zone_locked fired (it fires on every acquisition)
+ if grep -q '@locked: [0-9]' "$BPFTRACE_OUT"; then
+ pass "mm_zone_locked tracepoint fired"
+ else
+ fail "mm_zone_locked tracepoint did NOT fire"
+ fi
+
+ # Check that mm_zone_lock_unlock fired
+ if grep -q '@unlock: [0-9]' "$BPFTRACE_OUT"; then
+ pass "mm_zone_lock_unlock tracepoint fired"
+ else
+ fail "mm_zone_lock_unlock tracepoint did NOT fire"
+ fi
+
+ # contended may or may not fire depending on actual contention
+ if grep -q '@contended: [0-9]' "$BPFTRACE_OUT"; then
+ pass "mm_zone_lock_contended tracepoint fired (contention detected)"
+ else
+ skip "mm_zone_lock_contended did not fire (no contention observed)"
+ fi
+
+ rm -f "$BPFTRACE_OUT"
+}
+
+# --- Test 4: bpftrace verifies wait_ns > 0 when contended ---
+
+test_wait_ns() {
+ echo "--- Test 4: wait_ns is populated when contended ---"
+
+ if ! check_tracepoint_exists mm_zone_locked; then
+ return
+ fi
+
+ local BPFTRACE_OUT
+ BPFTRACE_OUT=$(mktemp /tmp/zone_lock_wait.XXXXXX)
+
+ bpftrace -e '
+ tracepoint:kmem:mm_zone_locked /args->contended/ {
+ @has_wait_ns = count();
+ @wait_ns = hist(args->wait_ns);
+ }
+ ' -c "$WORKLOAD $DURATION $NR_THREADS" > "$BPFTRACE_OUT" 2>&1 &
+ local BT_PID=$!
+
+ wait $BT_PID 2>/dev/null || true
+
+ echo "bpftrace wait_ns output:"
+ cat "$BPFTRACE_OUT"
+
+ if grep -q '@has_wait_ns: [0-9]' "$BPFTRACE_OUT"; then
+ pass "wait_ns populated on contended acquisitions"
+ else
+ skip "no contended acquisitions observed for wait_ns check"
+ fi
+
+ rm -f "$BPFTRACE_OUT"
+}
+
+# --- Main ---
+
+echo "=== zone->lock tracepoint selftest ==="
+echo "Duration: ${DURATION}s, Threads: ${NR_THREADS}"
+echo
+
+check_root
+check_bpftrace
+check_workload
+
+test_tracepoints_exist
+test_tracepoint_fields
+test_bpftrace_counts
+test_wait_ns
+
+echo
+echo "=== Results: $PASS passed, $FAIL failed, $SKIP skipped ==="
+
+if [ "$FAIL" -gt 0 ]; then
+ exit 1
+fi
+exit 0
diff --git a/tools/testing/selftests/mm/zone_lock_contention.c b/tools/testing/selftests/mm/zone_lock_contention.c
new file mode 100644
index 000000000000..35ddad7670b1
--- /dev/null
+++ b/tools/testing/selftests/mm/zone_lock_contention.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zone_lock_contention.c - Generate zone->lock contention for tracepoint testing
+ *
+ * Spawns multiple threads that rapidly allocate and free pages to force
+ * PCP (per-cpu pageset) drains and refills, which acquire zone->lock via
+ * free_pcppages_bulk() and rmqueue_bulk().
+ *
+ * Reducing percpu_pagelist_high_fraction makes PCP lists smaller, causing
+ * more frequent zone->lock acquisitions and thus more contention.
+ *
+ * Usage: zone_lock_contention [duration_sec] [nr_threads]
+ * Defaults: 5 seconds, 4 threads
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <time.h>
+
+/* Each thread mmaps/touches/munmaps in a loop to churn pages */
+#define CHUNK_SIZE (2 * 1024 * 1024) /* 2 MB per iteration */
+#define PAGE_SZ 4096
+
+static volatile int stop;
+
+struct thread_stats {
+ unsigned long iterations;
+ unsigned long pages_touched;
+};
+
+static void *churn_thread(void *arg)
+{
+ struct thread_stats *stats = arg;
+ unsigned long iter = 0;
+ unsigned long pages = 0;
+
+ while (!stop) {
+ char *p;
+ size_t i;
+
+ p = mmap(NULL, CHUNK_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+ if (p == MAP_FAILED) {
+ perror("mmap");
+ break;
+ }
+
+ /* Touch every page to ensure allocation */
+ for (i = 0; i < CHUNK_SIZE; i += PAGE_SZ)
+ p[i] = 1;
+
+ pages += CHUNK_SIZE / PAGE_SZ;
+
+ /* Free pages back - forces PCP drain */
+ munmap(p, CHUNK_SIZE);
+ iter++;
+ }
+
+ stats->iterations = iter;
+ stats->pages_touched = pages;
+ return NULL;
+}
+
+static int write_sysctl(const char *path, const char *val)
+{
+ FILE *f = fopen(path, "w");
+
+ if (!f)
+ return -1;
+ fputs(val, f);
+ fclose(f);
+ return 0;
+}
+
+static int read_sysctl(const char *path, char *buf, size_t len)
+{
+ FILE *f = fopen(path, "r");
+
+ if (!f)
+ return -1;
+ if (!fgets(buf, len, f)) {
+ fclose(f);
+ return -1;
+ }
+ fclose(f);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int duration = 5;
+ int nr_threads = 4;
+ char orig_fraction[32] = "";
+ const char *sysctl_path = "/proc/sys/vm/percpu_pagelist_high_fraction";
+ pthread_t *threads;
+ struct thread_stats *stats;
+ unsigned long total_iter = 0, total_pages = 0;
+ int i;
+
+ if (argc > 1)
+ duration = atoi(argv[1]);
+ if (argc > 2)
+ nr_threads = atoi(argv[2]);
+
+ if (duration <= 0 || nr_threads <= 0) {
+ fprintf(stderr, "Usage: %s [duration_sec] [nr_threads]\n", argv[0]);
+ return 1;
+ }
+
+ printf("zone_lock_contention: %d threads, %d seconds\n",
+ nr_threads, duration);
+
+ /* Shrink PCP lists to force more zone->lock acquisitions */
+ read_sysctl(sysctl_path, orig_fraction, sizeof(orig_fraction));
+ if (write_sysctl(sysctl_path, "100") < 0)
+ fprintf(stderr, "WARNING: cannot write %s (not root?)\n",
+ sysctl_path);
+ else
+ printf("Set percpu_pagelist_high_fraction=100 (was %s)\n",
+ orig_fraction);
+
+ threads = calloc(nr_threads, sizeof(*threads));
+ stats = calloc(nr_threads, sizeof(*stats));
+ if (!threads || !stats) {
+ perror("calloc");
+ return 1;
+ }
+
+ for (i = 0; i < nr_threads; i++) {
+ if (pthread_create(&threads[i], NULL, churn_thread, &stats[i])) {
+ perror("pthread_create");
+ return 1;
+ }
+ }
+
+ sleep(duration);
+ stop = 1;
+
+ for (i = 0; i < nr_threads; i++) {
+ pthread_join(threads[i], NULL);
+ total_iter += stats[i].iterations;
+ total_pages += stats[i].pages_touched;
+ }
+
+ printf("Total: %lu iterations, %lu pages (%lu MB) churned\n",
+ total_iter, total_pages,
+ (total_pages * PAGE_SZ) / (1024 * 1024));
+
+ /* Restore original sysctl */
+ if (orig_fraction[0]) {
+ /* Strip trailing newline */
+ orig_fraction[strcspn(orig_fraction, "\n")] = '\0';
+ write_sysctl(sysctl_path, orig_fraction);
+ printf("Restored percpu_pagelist_high_fraction=%s\n",
+ orig_fraction);
+ }
+
+ free(threads);
+ free(stats);
+ return 0;
+}
--
2.43.0
^ permalink raw reply related
* [PATCH 1/2] mm/page_alloc: add tracepoints for zone->lock acquisitions
From: hawk @ 2026-05-08 16:22 UTC (permalink / raw)
To: Andrew Morton, linux-mm
Cc: Vlastimil Babka, Steven Rostedt, Suren Baghdasaryan, Michal Hocko,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Shuah Khan,
linux-kernel, linux-trace-kernel, kernel-team, hawk
From: Jesper Dangaard Brouer <hawk@kernel.org>
Add tracepoints to the page allocator fast paths that acquire
zone->lock, allowing diagnosis of lock contention in production.
Three tracepoints are introduced:
kmem:mm_zone_lock_contended - fires when trylock fails (lock is held)
kmem:mm_zone_locked - fires on every acquisition
kmem:mm_zone_lock_unlock - fires on every release
Each event records the NUMA node, zone name, batch count, and caller.
The mm_zone_locked event additionally records wait_ns: the time spent
spinning when contended, measured via local_clock() with IRQs disabled
to ensure accurate same-CPU timestamps.
The lock/unlock paths are wrapped in __zone_lock()/__zone_unlock()
helpers that use trylock-first to separate the contended and
uncontended cases. Only the fast paths (free_pcppages_bulk,
rmqueue_bulk, free_one_page) are covered. Other zone->lock holders
such as compaction, page isolation, and memory hotplug are not
instrumented.
For minimum overhead in production, enable only mm_zone_lock_contended
which fires only on actual contention. Enable mm_zone_locked for
wait-time analysis, and add mm_zone_lock_unlock for hold-time
measurement.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
include/trace/events/kmem.h | 101 ++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 50 +++++++++++++++---
2 files changed, 145 insertions(+), 6 deletions(-)
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..870c68c70d57 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -458,6 +458,107 @@ TRACE_EVENT(rss_stat,
__print_symbolic(__entry->member, TRACE_MM_PAGES),
__entry->size)
);
+
+/*
+ * Tracepoints for zone->lock on the page allocator fast paths only.
+ * Other code paths that acquire zone->lock (compaction, isolation,
+ * memory hotplug, vmstat, etc.) are not covered here.
+ *
+ * Three events:
+ * mm_zone_lock_contended - trylock failed, about to spin
+ * mm_zone_locked - lock acquired, includes wait_ns when
+ * contended (zero otherwise)
+ * mm_zone_lock_unlock - lock released
+ *
+ * For production use with minimum overhead, enable only
+ * mm_zone_lock_contended -- it fires only when trylock detects the
+ * lock is already held.
+ *
+ * For wait-time analysis, enable mm_zone_locked -- its wait_ns
+ * field gives the spin duration directly. Adding unlock allows
+ * hold-time measurement, at the cost of one event per acquisition.
+ */
+TRACE_EVENT(mm_zone_lock_contended,
+
+ TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+ TP_ARGS(zone, count, caller),
+
+ TP_STRUCT__entry(
+ __field( int, node_id )
+ __string( name, zone->name )
+ __field( int, count )
+ __field( unsigned long, caller )
+ ),
+
+ TP_fast_assign(
+ __entry->node_id = zone_to_nid(zone);
+ __assign_str(name);
+ __entry->count = count;
+ __entry->caller = caller;
+ ),
+
+ TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+ __entry->node_id, __get_str(name),
+ __entry->count, (void *)__entry->caller)
+);
+
+TRACE_EVENT(mm_zone_locked,
+
+ TP_PROTO(struct zone *zone, int count, bool contended,
+ unsigned long caller, u64 wait_ns),
+
+ TP_ARGS(zone, count, contended, caller, wait_ns),
+
+ TP_STRUCT__entry(
+ __field( int, node_id )
+ __string( name, zone->name )
+ __field( int, count )
+ __field( bool, contended )
+ __field( unsigned long, caller )
+ __field( u64, wait_ns )
+ ),
+
+ TP_fast_assign(
+ __entry->node_id = zone_to_nid(zone);
+ __assign_str(name);
+ __entry->count = count;
+ __entry->contended = contended;
+ __entry->caller = caller;
+ __entry->wait_ns = wait_ns;
+ ),
+
+ TP_printk("node=%d zone=%-8s count=%-5d contended=%d caller=%pS wait=%llu ns",
+ __entry->node_id, __get_str(name),
+ __entry->count, __entry->contended,
+ (void *)__entry->caller, __entry->wait_ns)
+);
+
+TRACE_EVENT(mm_zone_lock_unlock,
+
+ TP_PROTO(struct zone *zone, int count, unsigned long caller),
+
+ TP_ARGS(zone, count, caller),
+
+ TP_STRUCT__entry(
+ __field( int, node_id )
+ __string( name, zone->name )
+ __field( int, count )
+ __field( unsigned long, caller )
+ ),
+
+ TP_fast_assign(
+ __entry->node_id = zone_to_nid(zone);
+ __assign_str(name);
+ __entry->count = count;
+ __entry->caller = caller;
+ ),
+
+ TP_printk("node=%d zone=%-8s count=%-5d caller=%pS",
+ __entry->node_id, __get_str(name),
+ __entry->count, (void *)__entry->caller)
+);
+
#endif /* _TRACE_KMEM_H */
/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..08018e9beab4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -19,6 +19,7 @@
#include <linux/highmem.h>
#include <linux/interrupt.h>
#include <linux/jiffies.h>
+#include <linux/sched/clock.h>
#include <linux/compiler.h>
#include <linux/kernel.h>
#include <linux/kasan.h>
@@ -1447,6 +1448,43 @@ bool free_pages_prepare(struct page *page, unsigned int order)
return __free_pages_prepare(page, order, FPI_NONE);
}
+/*
+ * Helper functions for locking zone->lock with tracepoints.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments. The @count parameter indicates the number
+ * of pages being freed or allocated in the batch operation.
+ *
+ * For minimum overhead attach to kmem:mm_zone_lock_contended, which
+ * only gets activated when trylock detects lock is contended.
+ */
+static inline void
+__zone_lock(struct zone *zone, int count, unsigned long *flags)
+ __acquires(&zone->lock)
+{
+ unsigned long caller = _RET_IP_;
+ u64 wait_start, wait_time = 0;
+ bool contended;
+
+ local_irq_save(*flags);
+ contended = !spin_trylock(&zone->lock);
+ if (contended) {
+ wait_start = local_clock();
+ trace_mm_zone_lock_contended(zone, count, caller);
+ spin_lock(&zone->lock);
+ wait_time = local_clock() - wait_start;
+ }
+ trace_mm_zone_locked(zone, count, contended, caller, wait_time);
+}
+
+static inline void
+__zone_unlock(struct zone *zone, int count, unsigned long *flags)
+ __releases(&zone->lock)
+{
+ trace_mm_zone_lock_unlock(zone, count, _RET_IP_);
+ spin_unlock_irqrestore(&zone->lock, *flags);
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone.
@@ -1469,7 +1507,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* Ensure requested pindex is drained first. */
pindex = pindex - 1;
- spin_lock_irqsave(&zone->lock, flags);
+ __zone_lock(zone, count, &flags);
while (count > 0) {
struct list_head *list;
@@ -1502,7 +1540,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (count > 0 && !list_empty(list));
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ __zone_unlock(zone, count, &flags);
}
/* Split a multi-block free page into its individual pageblocks. */
@@ -1551,7 +1589,7 @@ static void free_one_page(struct zone *zone, struct page *page,
return;
}
} else {
- spin_lock_irqsave(&zone->lock, flags);
+ __zone_lock(zone, 1 << order, &flags);
}
/* The lock succeeded. Process deferred pages. */
@@ -1569,7 +1607,7 @@ static void free_one_page(struct zone *zone, struct page *page,
}
}
split_large_buddy(zone, page, pfn, order, fpi_flags);
- spin_unlock_irqrestore(&zone->lock, flags);
+ __zone_unlock(zone, 1 << order, &flags);
__count_vm_events(PGFREE, 1 << order);
}
@@ -2525,7 +2563,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
if (!spin_trylock_irqsave(&zone->lock, flags))
return 0;
} else {
- spin_lock_irqsave(&zone->lock, flags);
+ __zone_lock(zone, count, &flags);
}
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
@@ -2545,7 +2583,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
list_add_tail(&page->pcp_list, list);
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ __zone_unlock(zone, i, &flags);
return i;
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Mark Brown @ 2026-05-08 14:01 UTC (permalink / raw)
To: Praveen Talari
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <59e36f20-891d-4a58-8cc4-6822d03daa23@oss.qualcomm.com>
[-- Attachment #1: Type: text/plain, Size: 429 bytes --]
On Thu, May 07, 2026 at 11:03:39PM +0530, Praveen Talari wrote:
> On 07-05-2026 13:43, Mark Brown wrote:
> > By generic I mean this should not be driver specific at all.
> I hope these changes are fine. Please let me know if you have any concerns
> or feedback.
The data tracepoints look plausible but I would expect them to be
generated by the core, they'll be there for everything so I'd expect
them to work for everything.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH 2/3] init: use static buffers for bootconfig extra command line
From: Breno Leitao @ 2026-05-08 13:59 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, oss, paulmck, linux-trace-kernel, linux-kernel,
kernel-team
In-Reply-To: <20260429172721.c89072381aa98d1090ad383f@kernel.org>
Hello Masami,
On Wed, Apr 29, 2026 at 05:27:21PM +0900, Masami Hiramatsu wrote:
> On Fri, 17 Apr 2026 08:38:16 -0700
> Breno Leitao <leitao@debian.org> wrote:
> > On Fri, Apr 17, 2026 at 10:44:36AM +0900, Masami Hiramatsu wrote:
> > > On Wed, 15 Apr 2026 03:51:11 -0700
> > > Breno Leitao <leitao@debian.org> wrote:
> > >
> > > But if we can do it, should we continue using bootconfig? I mean
> > > it is easy to make a tool (or add a feature in tools/bootconfig)
> > > which converts bootconfig file to command line string and embeds
> > > it in the kernel. Hmm.
> >
> > Sure, you are talking about a a tool that embeddeds it in the kernel binary,
> > something like:
> >
> >
> > 0) Get a kernel and define CONFIG_BOOT_CONFIG_EMBED_FILE=".bootconfig"
> >
> > 1) Add an option in tools/bootconfig to convert bootconfig (.bootconfig)
> > to a cmdline string ($ bootconfig -C kernel .bootconfig).
> > Something like:
> > # tools/bootconfig/bootconfig -C kernel .bootconfig
> > mem=2G loglevel=7 debug nokaslr %
> >
> > 2) At kernel build time, run that tool on .bootconfig and embed the
> > resulting string into the kernel image as a .init.rodata symbol
> > (embedded_kernel_cmdline[]).
> >
> > # gdb -batch -ex 'x/s &embedded_kernel_cmdline' vmlinux
> > 0xffffffff87e108f8: "mem=2G loglevel=7 debug nokaslr "
> Yeah, I think this looks good to me.
Thank you for the feedback. I've begun working on the bootconfig patches
following the approach outlined in Step 1 above. Note that I've
simplified the -C option by removing the "kernel" argument mentioned in
the earlier example.
The patch series is available here:
https://lore.kernel.org/all/20260508-bootconfig_using_tools-v1-0-1132219aa773@debian.org/
I appreciate your continued support.
--breno
^ permalink raw reply
* [PATCH 2/2] tools/bootconfig: render kernel.* subtree as cmdline string with -C
From: Breno Leitao @ 2026-05-08 13:55 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton
Cc: linux-kernel, linux-trace-kernel, paulmck, oss, Breno Leitao,
kernel-team
In-Reply-To: <20260508-bootconfig_using_tools-v1-0-1132219aa773@debian.org>
Add a -C option that finds the "kernel" subtree of a bootconfig file
and prints it as a flat, space-separated cmdline string by calling the
shared xbc_snprint_cmdline() renderer. An empty or absent kernel.*
subtree produces empty output and exits successfully.
This lets the kernel build embed a bootconfig file as a plain cmdline
string at build time, so embedded bootconfig values can reach
parse_early_param() during architecture setup without parsing the
bootconfig at runtime.
The renderer is intentionally limited to the kernel.* subtree: that is
the only thing the kernel build needs to embed; init.* and other
subtrees keep going through the runtime parser.
Example of this new mode:
# cat /tmp/test.bconf
kernel {
foo = bar
baz = "hello world"
arr = 1, 2
}
init.foo = nope
# ./tools/bootconfig/bootconfig -C /tmp/test.bconf
foo=bar baz="hello world" arr=1 arr=2 %
Signed-off-by: Breno Leitao <leitao@debian.org>
---
tools/bootconfig/main.c | 60 ++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 52 insertions(+), 8 deletions(-)
diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
index 643f707b8f1da..e1bfab044fbcb 100644
--- a/tools/bootconfig/main.c
+++ b/tools/bootconfig/main.c
@@ -286,7 +286,41 @@ static int init_xbc_with_error(char *buf, int len)
return ret;
}
-static int show_xbc(const char *path, bool list)
+static int show_xbc_kernel_cmdline(void)
+{
+ struct xbc_node *root;
+ char *buf = NULL;
+ int len, ret;
+
+ root = xbc_find_node("kernel");
+ if (!root)
+ return 0; /* no kernel.* keys: emit empty output */
+
+ len = xbc_snprint_cmdline(NULL, 0, root);
+ if (len < 0) {
+ pr_err("Failed to size cmdline output: %d\n", len);
+ return len;
+ }
+ if (len == 0)
+ return 0;
+
+ buf = malloc(len + 1);
+ if (!buf)
+ return -ENOMEM;
+
+ ret = xbc_snprint_cmdline(buf, len + 1, root);
+ if (ret < 0) {
+ pr_err("Failed to render cmdline output: %d\n", ret);
+ free(buf);
+ return ret;
+ }
+
+ fputs(buf, stdout);
+ free(buf);
+ return 0;
+}
+
+static int show_xbc(const char *path, bool list, bool render_cmdline)
{
int ret, fd;
char *buf = NULL;
@@ -322,11 +356,14 @@ static int show_xbc(const char *path, bool list)
if (init_xbc_with_error(buf, ret) < 0)
goto out;
}
- if (list)
+ if (render_cmdline)
+ ret = show_xbc_kernel_cmdline();
+ else if (list)
xbc_show_list();
else
xbc_show_compact_tree();
- ret = 0;
+ if (ret > 0)
+ ret = 0;
out:
free(buf);
@@ -486,7 +523,10 @@ static int usage(void)
" Options:\n"
" -a <config>: Apply boot config to initrd\n"
" -d : Delete boot config file from initrd\n"
- " -l : list boot config in initrd or file\n\n"
+ " -l : list boot config in initrd or file\n"
+ " -C : render the kernel.* subtree as a flat cmdline\n"
+ " string (suitable for embedding in a kernel image)\n"
+ " and print it to stdout\n\n"
" If no option is given, show the bootconfig in the given file.\n");
return -1;
}
@@ -495,10 +535,11 @@ int main(int argc, char **argv)
{
char *path = NULL;
char *apply = NULL;
+ bool render_cmdline = false;
bool delete = false, list = false;
int opt;
- while ((opt = getopt(argc, argv, "hda:l")) != -1) {
+ while ((opt = getopt(argc, argv, "hda:lC")) != -1) {
switch (opt) {
case 'd':
delete = true;
@@ -509,14 +550,17 @@ int main(int argc, char **argv)
case 'l':
list = true;
break;
+ case 'C':
+ render_cmdline = true;
+ break;
case 'h':
default:
return usage();
}
}
- if ((apply && delete) || (delete && list) || (apply && list)) {
- pr_err("Error: You can give one of -a, -d or -l at once.\n");
+ if ((!!apply + !!delete + !!list + !!render_cmdline) > 1) {
+ pr_err("Error: You can give one of -a, -d, -l or -C at once.\n");
return usage();
}
@@ -532,5 +576,5 @@ int main(int argc, char **argv)
else if (delete)
return delete_xbc(path);
- return show_xbc(path, list);
+ return show_xbc(path, list, render_cmdline);
}
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH 1/2] bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
From: Breno Leitao @ 2026-05-08 13:55 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton
Cc: linux-kernel, linux-trace-kernel, paulmck, oss, Breno Leitao,
kernel-team
In-Reply-To: <20260508-bootconfig_using_tools-v1-0-1132219aa773@debian.org>
Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the
function (and its xbc_namebuf scratch buffer) becomes part of the shared
parser library. tools/bootconfig already compiles lib/bootconfig.c
directly, which lets a follow-up patch reuse the same renderer in the
userspace tool to convert a bootconfig file into a flat cmdline string
at build time.
No functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
include/linux/bootconfig.h | 3 +++
init/main.c | 45 -------------------------------------
lib/bootconfig.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 59 insertions(+), 45 deletions(-)
diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index 692a5acc2ffc4..1c7f3b74ffcf3 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -265,6 +265,9 @@ static inline struct xbc_node * __init xbc_node_get_subkey(struct xbc_node *node
int __init xbc_node_compose_key_after(struct xbc_node *root,
struct xbc_node *node, char *buf, size_t size);
+/* Render key/value pairs under @root as a flat cmdline string */
+int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root);
+
/**
* xbc_node_compose_key() - Compose full key string of the XBC node
* @node: An XBC node.
diff --git a/init/main.c b/init/main.c
index 96f93bb06c490..e363232b428b4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -324,51 +324,6 @@ static void * __init get_boot_config_from_initrd(size_t *_size)
#ifdef CONFIG_BOOT_CONFIG
-static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
-
-#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
-
-static int __init xbc_snprint_cmdline(char *buf, size_t size,
- struct xbc_node *root)
-{
- struct xbc_node *knode, *vnode;
- char *end = buf + size;
- const char *val, *q;
- int ret;
-
- xbc_node_for_each_key_value(root, knode, val) {
- ret = xbc_node_compose_key_after(root, knode,
- xbc_namebuf, XBC_KEYLEN_MAX);
- if (ret < 0)
- return ret;
-
- vnode = xbc_node_get_child(knode);
- if (!vnode) {
- ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
- if (ret < 0)
- return ret;
- buf += ret;
- continue;
- }
- xbc_array_for_each_value(vnode, val) {
- /*
- * For prettier and more readable /proc/cmdline, only
- * quote the value when necessary, i.e. when it contains
- * whitespace.
- */
- q = strpbrk(val, " \t\r\n") ? "\"" : "";
- ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
- xbc_namebuf, q, val, q);
- if (ret < 0)
- return ret;
- buf += ret;
- }
- }
-
- return buf - (end - size);
-}
-#undef rest
-
/* Make an extra command line under given key word */
static char * __init xbc_make_cmdline(const char *key)
{
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index c470b93d5dbc2..f445b7703fdd9 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -408,6 +408,62 @@ const char * __init xbc_node_find_next_key_value(struct xbc_node *root,
return ""; /* No value key */
}
+static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
+
+#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
+
+/**
+ * xbc_snprint_cmdline() - Render bootconfig keys under @root as a cmdline string
+ * @buf: Destination buffer (may be NULL when @size is 0 to query the length)
+ * @size: Size of @buf in bytes
+ * @root: Subtree root whose key=value pairs should be rendered
+ *
+ * Walk all key/value pairs under @root and emit them as a space-separated
+ * cmdline string into @buf. Values containing whitespace are quoted with
+ * double quotes. Returns the number of bytes that would be written if @buf
+ * were large enough (matching snprintf semantics), or a negative errno on
+ * failure.
+ */
+int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
+{
+ struct xbc_node *knode, *vnode;
+ char *end = buf + size;
+ const char *val, *q;
+ int ret;
+
+ xbc_node_for_each_key_value(root, knode, val) {
+ ret = xbc_node_compose_key_after(root, knode,
+ xbc_namebuf, XBC_KEYLEN_MAX);
+ if (ret < 0)
+ return ret;
+
+ vnode = xbc_node_get_child(knode);
+ if (!vnode) {
+ ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
+ if (ret < 0)
+ return ret;
+ buf += ret;
+ continue;
+ }
+ xbc_array_for_each_value(vnode, val) {
+ /*
+ * For prettier and more readable /proc/cmdline, only
+ * quote the value when necessary, i.e. when it contains
+ * whitespace.
+ */
+ q = strpbrk(val, " \t\r\n") ? "\"" : "";
+ ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
+ xbc_namebuf, q, val, q);
+ if (ret < 0)
+ return ret;
+ buf += ret;
+ }
+ }
+
+ return buf - (end - size);
+}
+#undef rest
+
/* XBC parse and tree build */
static int __init xbc_init_node(struct xbc_node *node, char *data, uint16_t flag)
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH 0/2] tools/bootconfig: render kernel.* subtree as a cmdline string
From: Breno Leitao @ 2026-05-08 13:55 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton
Cc: linux-kernel, linux-trace-kernel, paulmck, oss, Breno Leitao,
kernel-team
Add a bootconfig -> kernel cmdline rendering capability shared between
the kernel parser library and the userspace tools/bootconfig binary.
The new userspace mode "tools/bootconfig -C <file>" walks a bootconfig
file's "kernel" subtree and prints it as a flat, space-separated
cmdline string suitable for direct use as (or appending to) a kernel
command line.
This series prepares tools/bootconfig and lib/bootconfig.c for an
upcoming feature that lets the kernel build render an embedded
bootconfig file's "kernel" subtree to a flat cmdline string and embed
it in the kernel image.
The follow-up series (sent separately) wires this into setup_arch() so
early_param() handlers see values supplied via CONFIG_BOOT_CONFIG_EMBED_FILE,
following Masami suggestion in [1]
These two patches are pure groundwork. They add no kernel feature,
change no runtime behavior, and are useful on their own (the new
"tools/bootconfig -C" mode lets anyone render a .bootconfig file to
a cmdline string from the shell).
Landing them independently lets the follow-up series focus on the
kernel-side plumbing without dragging the refactor and tool addition
through the same review cycle.
Patch 1 lifts xbc_snprint_cmdline() from init/main.c into
lib/bootconfig.c so both the kernel runtime path
(xbc_make_cmdline -> extra_command_line) and the userspace tool can
share a single renderer.
- tools/bootconfig already compiles lib/bootconfig.c directly, so no
new shared-code mechanism is introduced.
Patch 2 adds a -C option to tools/bootconfig that walks the "kernel"
subtree of a bootconfig file and prints it as a flat, space-separated
cmdline string. Missing or empty kernel.* produces empty output and
exits 0.
- This is the renderer the kernel build will invoke.
Once this lands, the follow up patches will use it in the following way:
1) Render at build time.
The kernel build invokes the userspace bootconfig tool — using the -C mode
prep added — to convert the embedded bootconfig file into a flat kernel cmdline
string.
2) Bake the string into the kernel image.
A small assembly stub embeds the rendered file into the kernel's discardable
read-only init data, bracketed by two markers so the runtime can locate it.
3) Add a runtime helper to consume it.
A new helper in the shared bootconfig source — sitting next to the renderer
prep moved there — prepends the embedded blob to a cmdline buffer, panicking
rather than truncating if it overflows. The public header declares it with a
no-op stub when the feature is off.
4) Plumb it at architecture early setup.
The arch's early setup calls the helper after the existing builtin-cmdline
merge but before early-param parsing, so values from the embedded bootconfig
influence early-param handlers (console, log level, memory overrides) right
when architecture setup runs — not later in
Background: the v1 attempt at this feature moved bootconfig parsing
before setup_arch() with ~96KB of static __initdata buffers [1].
Masami suggested doing the conversion at build time instead [2], which
avoids the early-boot allocator dance and the start_kernel() reordering.
This series, plus the follow-up, aims to implement that approach.
[1] https://lore.kernel.org/all/20260415-bootconfig_earlier-v1-0-cf160175de5e@debian.org/
[2] https://lore.kernel.org/all/20260417104436.ece29fd5e2cb7a59c8cf8ac1@kernel.org/
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
tools/bootconfig: render kernel.* subtree as cmdline string with -C
include/linux/bootconfig.h | 3 +++
init/main.c | 45 ----------------------------------
lib/bootconfig.c | 56 +++++++++++++++++++++++++++++++++++++++++++
tools/bootconfig/main.c | 60 +++++++++++++++++++++++++++++++++++++++-------
4 files changed, 111 insertions(+), 53 deletions(-)
---
base-commit: 17c7841d09ee7d33557fd075562d9289b6018c90
change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply
* [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Chen Jun @ 2026-05-08 12:26 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
Cc: chenjun102
Low-level functions have many call paths, and sometimes
we only care about the calls on a specific call path.
Add a new filter to filter based on the call stack.
Usage:
1. echo 'caller=="$function_name"' > events/../filter
Only support OP_EQ and OP_NE
Signed-off-by: Chen Jun <chenjun102@huawei.com>
---
include/linux/trace_events.h | 1 +
kernel/trace/trace.h | 3 ++-
kernel/trace/trace_events.c | 1 +
kernel/trace/trace_events_filter.c | 40 ++++++++++++++++++++++++++++--
4 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 40a43a4c7caf..1f109669a391 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -851,6 +851,7 @@ enum {
FILTER_COMM,
FILTER_CPU,
FILTER_STACKTRACE,
+ FILTER_CALLER,
};
extern int trace_event_raw_init(struct trace_event_call *call);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..4e4b92ce264f 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
field->filter_type == FILTER_RDYN_STRING ||
field->filter_type == FILTER_STATIC_STRING ||
field->filter_type == FILTER_PTR_STRING ||
- field->filter_type == FILTER_COMM;
+ field->filter_type == FILTER_COMM ||
+ field->filter_type == FILTER_CALLER;
}
static inline bool is_function_field(struct ftrace_event_field *field)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..6d220d7eec73 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -199,6 +199,7 @@ static int trace_define_generic_fields(void)
__generic_field(char *, comm, FILTER_COMM);
__generic_field(char *, stacktrace, FILTER_STACKTRACE);
__generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
+ __generic_field(char *, caller, FILTER_CALLER);
return ret;
}
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 609325f57942..1cf040065abe 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -72,6 +72,7 @@ enum filter_pred_fn {
FILTER_PRED_FN_CPUMASK,
FILTER_PRED_FN_CPUMASK_CPU,
FILTER_PRED_FN_FUNCTION,
+ FILTER_PRED_FN_CALLER,
FILTER_PRED_FN_,
FILTER_PRED_TEST_VISITED,
};
@@ -1009,6 +1010,21 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
return pred->op == OP_EQ ? ret : !ret;
}
+/* Filter predicate for caller. */
+static int filter_pred_caller(struct filter_pred *pred, void *event)
+{
+ unsigned long entries[32];
+ unsigned int nr_entries;
+ int i;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
+ for (i = 0; i < nr_entries ; i++)
+ if (pred->val <= entries[i] && entries[i] < pred->val2)
+ return !pred->not;
+
+ return pred->not;
+}
+
/*
* regex_match_foo - Basic regex callbacks
*
@@ -1617,6 +1633,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
return filter_pred_cpumask_cpu(pred, event);
case FILTER_PRED_FN_FUNCTION:
return filter_pred_function(pred, event);
+ case FILTER_PRED_FN_CALLER:
+ return filter_pred_caller(pred, event);
case FILTER_PRED_TEST_VISITED:
return test_pred_visited_fn(pred, event);
default:
@@ -2002,10 +2020,28 @@ static int parse_pred(const char *str, void *data,
} else if (field->filter_type == FILTER_DYN_STRING) {
pred->fn_num = FILTER_PRED_FN_STRLOC;
- } else if (field->filter_type == FILTER_RDYN_STRING)
+ } else if (field->filter_type == FILTER_RDYN_STRING) {
pred->fn_num = FILTER_PRED_FN_STRRELLOC;
- else {
+ } else if (field->filter_type == FILTER_CALLER) {
+ unsigned long caller;
+
+ if (op == OP_GLOB)
+ goto err_free;
+ pred->fn_num = FILTER_PRED_FN_CALLER;
+ caller = kallsyms_lookup_name(pred->regex->pattern);
+ if (!caller) {
+ parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
+ goto err_free;
+ }
+ /* Now find the function start and end address */
+ if (!kallsyms_lookup_size_offset(caller, &size, &offset)) {
+ parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
+ goto err_free;
+ }
+ pred->val = caller - offset;
+ pred->val2 = pred->val + size;
+ } else {
if (!ustring_per_cpu) {
/* Once allocated, keep it around for good */
ustring_per_cpu = alloc_percpu(struct ustring_buffer);
--
2.22.0
^ permalink raw reply related
* Re: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Lance Yang @ 2026-05-08 11:11 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260419185750.260784-4-npache@redhat.com>
On Sun, Apr 19, 2026 at 12:57:40PM -0600, Nico Pache wrote:
>The following cleanup reworks all the max_ptes_* handling into helper
>functions. This increases the code readability and will later be used to
>implement the mTHP handling of these variables.
>
>With these changes we abstract all the madvise_collapse() special casing
>(dont respect the sysctls) away from the functions that utilize them. And
>will later in this series to cleanly restrict mTHP collapses behaviors.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
Nice. It should all be an equivalence-preserving refactor.
With David's suggested changes:
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH v14 12/19] unwind_user/sframe: Add .sframe validation option
From: Jens Remus @ 2026-05-08 10:51 UTC (permalink / raw)
To: Steven Rostedt, Josh Poimboeuf, Indu Bhagat, Dylan Hatch
Cc: bpf, linux-kernel, linux-mm, linux-trace-kernel, x86,
Namhyung Kim, Andrii Nakryiko, Jose E. Marchesi, Beau Belgrave,
Florian Weimer, Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google), Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
In-Reply-To: <20260505121718.3572346-13-jremus@linux.ibm.com>
On 5/5/2026 2:17 PM, Jens Remus wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Add a debug feature to validate all .sframe sections when first loading
> the file rather than on demand.
>
> [ Jens Remus: Add support for SFrame V3. Add support for PC-relative
> FDE function start offset. Adjust to rename of struct sframe_fre to
> sframe_fre_internal. Use %#x/%#lx format specifiers. ]
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> +static int safe_read_fde(struct sframe_section *sec,
> + unsigned int fde_num, struct sframe_fde_internal *fde)
> +{
> + int ret;
> +
> + if (!user_read_access_begin((void __user *)sec->sframe_start,
> + sec->sframe_end - sec->sframe_start))
> + return -EFAULT;
> + ret = __read_fde(sec, fde_num, fde);
> + user_read_access_end();
> + return ret;
> +}
> +static int sframe_validate_section(struct sframe_section *sec)
> +{
> + unsigned long prev_ip = 0;
> + unsigned int i;
> +
> + for (i = 0; i < sec->num_fdes; i++) {
> + struct sframe_fre_internal *fre, *prev_fre = NULL;
> + unsigned long ip, fre_addr;
> + struct sframe_fde_internal fde;
> + struct sframe_fre_internal fres[2];
> + bool which = false;
> + unsigned int j;
> + int ret;
> +
> + ret = safe_read_fde(sec, i, &fde);
Iterating over all FDEs may cause __read_fde() and thus safe_read_fde()
to fail if one sframe section covers multiple text sections (regardless
of whether it is also registered for multiple text sections), as
__read_fde() checks whether the read FDE function start address is
within [sec->text_start, sec->text_end[.
See my related comments in my reply to [PATCH v14 05/19] unwind_user/
sframe: Add support for reading .sframe contents.
> + if (ret) {
> + dbg_sec("safe_read_fde(%u) failed\n", i);
> + return ret;
> + }
> +
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply
* Re: [PATCH v14 05/19] unwind_user/sframe: Add support for reading .sframe contents
From: Jens Remus @ 2026-05-08 10:50 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Josh Poimboeuf, Indu Bhagat,
Dylan Hatch
Cc: bpf, linux-mm, linux-trace-kernel, x86, Namhyung Kim,
Andrii Nakryiko, Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google), Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
In-Reply-To: <20260505121718.3572346-6-jremus@linux.ibm.com>
On 5/5/2026 2:17 PM, Jens Remus wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> In preparation for using sframe to unwind user space stacks, add an
> sframe_find() interface for finding the sframe information associated
> with a given text address.
>
> For performance, use user_read_access_begin() and the corresponding
> unsafe_*() accessors. Note that use of pr_debug() in uaccess-enabled
> regions would break noinstr validation, so there aren't any debug
> messages yet. That will be added in a subsequent commit.
>
> Link: https://lore.kernel.org/all/77c0d1ec143bf2a53d66c4ecb190e7e0a576fbfd.1737511963.git.jpoimboe@kernel.org/
> Link: https://lore.kernel.org/all/b35ca3a3-8de5-4d32-8d30-d4e562f6b0de@linux.ibm.com/
>
> [ Jens Remus: Add initial support for SFrame V3 (limited to regular
> FDEs). Add support for PC-relative FDE function start offset. Simplify
> logic by using an internal FDE representation. Rename struct sframe_fre
> to sframe_fre_internal to align with struct sframe_fde_internal.
> Cleanup includes. Fix checkpatch errors "spaces required around that
> ':'". ]
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> +static __always_inline int __read_fde(struct sframe_section *sec,
> + unsigned int fde_num,
> + struct sframe_fde_internal *fde)
> +{
> + unsigned long fde_addr, fda_addr, func_addr;
unsigned long fde_addr, fda_addr, func_start, func_end;
> + struct sframe_fde_v3 _fde;
> + struct sframe_fda_v3 _fda;
> +
> + fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde_v3));
> + unsafe_copy_from_user(&_fde, (void __user *)fde_addr,
> + sizeof(struct sframe_fde_v3), Efault);
> +
> + func_addr = fde_addr + _fde.func_start_off;
> + if (func_addr < sec->text_start || func_addr >= sec->text_end)
> + return -EINVAL;
func_start = fde_addr + _fde.func_start_off;
func_end = func_start + _fde.func_size;
if (func_start < sec->text_start || func_end > sec->text_end)
return -EINVAL;
This would validate that the whole function described by the FDE is
within the text section and not only the function start.
Note that, unrelated to above change, this check in general causes
sframe_validate_section() to fail, if one sframe section covers more
than one text section (unrelated to whether it is actually registered
for multiple text sections), for instance in case of Dylan's sframe
kernel stacktracer on arm64. Should the check therefore be made
conditional on whether __read_fde() is called from __find_fde() or
sframe_validate_section()? Or shall we drop this check as it does not
provide that much benefit during normal stacktracing use:
- sframe_find() obtains the struct sframe_section *sec from the
mm->sframe_mt based on IP. So IP must be within sec->text_start and
sec->text_end.
- __find_fde() only returns a FDE, if the IP is within
[fde->func_addr, fde->func_addr + fde->func_size[.
Dropping the check would allow the function start/end to be outside the
text section.
> +
> + fda_addr = sec->fres_start + _fde.fres_off;
> + if (fda_addr + sizeof(struct sframe_fda_v3) > sec->fres_end)
> + return -EINVAL;
> + unsafe_copy_from_user(&_fda, (void __user *)fda_addr,
> + sizeof(struct sframe_fda_v3), Efault);
Can unsafe_copy_from_user() be used for unaligned fda_addr, at least
on x86-64, s390 64-bit, and amr64?
Do the FDE type, FDE PC type, and FRE type values need to be validated
here as well?
unsigned char fde_type = SFRAME_V3_FDE_TYPE(_fda.info2);
unsigned char fde_pctype = SFRAME_V3_FDE_PCTYPE(_fda.info);
unsigned char fre_type = SFRAME_V3_FDE_FRE_TYPE(_fda.info);
The FDE type would get validatd by __read_fre_datawords(), which is
called after __read_fde(), if the read FDE is the one of interest.
So that does not neccessarily need to be checked here. Do you agree?
The FDE PC type is currently not checked for supported values anywhere.
That one would make sense to be checked here:
if (fde_pctype != SFRAME_FDE_PCTYPE_INC &&
fde_pctype != SFRAME_FDE_PCTYPE_MASK)
return -EINVAL;
The FRE type would get validated by __read_fre(), which is called
somewhere down the line after __read_fde(). So that does not
neccesarily need to be checked here. Do you agree?
> +
> + fde->func_addr = func_addr;
fde->func_addr = func_start;
> + fde->func_size = _fde.func_size;
> + fde->fda_off = _fde.fres_off;
> + fde->fres_off = _fde.fres_off + sizeof(struct sframe_fda_v3);
> + fde->fres_num = _fda.fres_num;
> + fde->info = _fda.info;
> + fde->info2 = _fda.info2;
> + fde->rep_size = _fda.rep_size;
> +
> + return 0;
> +
> +Efault:
> + return -EFAULT;
> +}
> +static __always_inline int __read_fre(struct sframe_section *sec,
> + struct sframe_fde_internal *fde,
> + unsigned long fre_addr,
> + struct sframe_fre_internal *fre)
> +{
> + unsigned char fde_type = SFRAME_V3_FDE_TYPE(fde->info2);
> + unsigned char fde_pctype = SFRAME_V3_FDE_PCTYPE(fde->info);
> + unsigned char fre_type = SFRAME_V3_FDE_FRE_TYPE(fde->info);
> + unsigned char dataword_count, dataword_size;
> + s32 cfa_off, ra_off, fp_off;
> + unsigned long cur = fre_addr;
> + unsigned char addr_size;
> + u32 ip_off;
> + u8 info;
> +
> + addr_size = fre_type_to_size(fre_type);
> + if (!addr_size)
> + return -EFAULT;
> +
> + if (fre_addr + addr_size + 1 > sec->fres_end)
> + return -EFAULT;
> +
> + UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
> + if (fde_pctype == SFRAME_FDE_PCTYPE_INC && ip_off > fde->func_size)
if ((fde_pctype == SFRAME_FDE_PCTYPE_INC && ip_off >= fde->func_size) ||
(fde_pctype == SFRAME_FDE_PCTYPE_MASK && ip_off >= fde->rep_size))
For PCTYPE_INC the FRE IP offset must be less than the FDE function size.
For PCTYPE_MASK the FRE IP offset must be less than the FDE repetition size.
> + return -EFAULT;
> +
> + UNSAFE_GET_USER_INC(info, cur, 1, Efault);
> + dataword_count = SFRAME_V3_FRE_DATAWORD_COUNT(info);
> + dataword_size = dataword_size_enum_to_size(SFRAME_V3_FRE_DATAWORD_SIZE(info));
> + if (!dataword_count || !dataword_size)
> + return -EFAULT;
> +
> + if (cur + (dataword_count * dataword_size) > sec->fres_end)
> + return -EFAULT;
> +
> + /* TODO: Support for flexible FDEs not implemented yet. */
> + if (fde_type != SFRAME_FDE_TYPE_DEFAULT)
> + return -EFAULT;
> +
> + UNSAFE_GET_USER_INC(cfa_off, cur, dataword_size, Efault);
> + dataword_count--;
> +
> + ra_off = sec->ra_off;
> + if (!ra_off) {
> + if (!dataword_count--)
> + return -EFAULT;
> +
> + UNSAFE_GET_USER_INC(ra_off, cur, dataword_size, Efault);
> + }
> +
> + fp_off = sec->fp_off;
> + if (!fp_off && dataword_count) {
> + dataword_count--;
> + UNSAFE_GET_USER_INC(fp_off, cur, dataword_size, Efault);
> + }
> +
> + if (dataword_count)
> + return -EFAULT;
> +
> + fre->size = addr_size + 1 + (dataword_count * dataword_size);
> + fre->ip_off = ip_off;
> + fre->cfa_off = cfa_off;
> + fre->ra_off = ra_off;
> + fre->fp_off = fp_off;
> + fre->info = info;
> +
> + return 0;
> +
> +Efault:
> + return -EFAULT;
> +}
Thanks and regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Jens Remus @ 2026-05-08 7:46 UTC (permalink / raw)
To: Steven Rostedt, LKML, Linux Trace Kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429114355.6c712e6a@gandalf.local.home>
On 4/29/2026 5:43 PM, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> [
> This is an RFC that adds a system call for dynamic linkers to use to
> tell the kernel where the sframe sections are when it loads dynamic
> libraries.
>
> It is built on top of Jens's sframe implementation for v3:
>
> https://lore.kernel.org/linux-trace-kernel/20260127150554.2760964-1-jremus@linux.ibm.com/
>
> I have a repo with that code that this applies on top of here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git sframe/core
>
>
> The name of the system call is "stacktrace_setup", but I'm not attached
> to this name. If anyone can think of a better name I'm happy to take
> suggestions.
>
> This patch is just to get the conversation going and the final result
> may be much different. I tested this with the attached program which is a
> major hack. I built glibc with sframe v3 support and I used readelf to
> find the sframe size and location of glibc.
>
> readelf -e /work/usr/lib/libc.so.6 | grep sframe
> [19] .sframe GNU_SFRAME 00000000001d3fc0 001d3fc0
>
> Then I wrote a program that takes the above location and size of the
> .sframe section in libc as parameters, scans /proc/self/maps to find
> where it loaded libc and then calls this new system call with a pointer
> to the location of the sframe along with its size, as well as where the
> libc text is located.
>
> It then spins for 2 seconds, calls the system call again to remove the
> sframe section it loaded, and spins for another 2 seconds.
>
> I ran perf record --call-graph fp,defer on the program and looked for
> the do_spin() function.
>
> With sframe loaded:
>
> sframe-test 1350 1396.333593: 202366 cpu/cycles/P:
> 7fdf0ec38a44 [unknown] ([vdso])
> 5621a6b97243 get_time+0x19 (/work/c/sframe-test)
> 5621a6b9727f do_spin+0x1f (/work/c/sframe-test)
> 5621a6b975cd main+0xd4 (/work/c/sframe-test)
> 7fdf0ea26bda __libc_start_call_main+0x6a (/work/usr/lib/libc.so.6)
> 7fdf0ea26d05 __libc_start_main@@GLIBC_2.34+0x85 (/work/usr/lib/libc.so.6)
> 5621a6b97131 _start+0x21 (/work/c/sframe-test)
>
> After it unloads the sframe:
>
> sframe-test 1350 1400.332902: 657582 cpu/cycles/P:
> 7fdf0ec38a5e [unknown] ([vdso])
> 5621a6b97243 get_time+0x19 (/work/c/sframe-test)
> 5621a6b9727f do_spin+0x1f (/work/c/sframe-test)
> 5621a6b97602 main+0x109 (/work/c/sframe-test)
> 7fdf0ea26bda __libc_start_call_main+0x6a (/work/usr/lib/libc.so.6)
>
> As you can see, with the sframe loaded, it was able to walk further up
> the libc library.
>
> Again, this is just an RFC, but I would like to get agreement on the
> system call so that we can then update the dynamic linker to do this
> instead of using my hack ;-)
> ]
>
> Add a system call that can be used by dynamic linkers to tell the kernel
> where the sframe section is in memory for libraries it loads.
>
> The system call stacktrace_setup takes 5 parameters:
>
> op - the type of operation to perform
> addr_start - The virtual address of the sframe section
> addr_length - The length of the sframe section
> text_start - the text section the sframe represents
> test_length - the length of the section
>
> The current op values are:
>
> STACKTRACE_REGISTER_SFRAME - This registers the sframe
> STACKTRACE_UNREGISTER_SFRAME - This removes the sframe
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
LGTM. Some comments/questions below.
> diff --git a/include/uapi/linux/stacktrace.h b/include/uapi/linux/stacktrace.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_STACKTRACE_H
> +#define _UAPI_LINUX_STACKTRACE_H
> +
> +enum stacktrace_setup_types {
> + STACKTRACE_REGISTER_SFRAME = 1,
> + STACKTRACE_UNREGISTER_SFRAME = 2,
> +};
> +
> +#endif /* _UAPI_LINUX_STACKTRACE_H */
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
Having the syscall live in kernel/unwind/sframe.c means it is only
available if config option HAVE_UNWIND_USER_SFRAME is selected (which
triggers sframe.o to be built and linked into the kernel), which makes
sense as long as it only implements sframe-specific functionality.
I suppose it could be moved elsewhere if non-sframe use cases would
arise in the future?
Would Dylan need to guard it when introducing HAVE_UNWIND_KERNEL_SFRAME?
Provided the syscall fails with -ENOSYS if not implemented (e.g. when
HAVE_UNWIND_USER_SFRAME is not enabled) the dummy implementations of
sframe_add_section() and sframe_remove_section() in linux/sframe.h also
return -ENOSYS, so the user observable behavior would be the same and
it would not matter. Do you agree?
> @@ -12,8 +12,10 @@
> #include <linux/mm.h>
> #include <linux/string_helpers.h>
> #include <linux/sframe.h>
> +#include <linux/syscalls.h>
> #include <asm/unwind_user_sframe.h>
> #include <linux/unwind_user_types.h>
> +#include <uapi/linux/stacktrace.h>
>
> #include "sframe.h"
> #include "sframe_debug.h"
> @@ -838,3 +840,38 @@ void sframe_free_mm(struct mm_struct *mm)
>
> mtree_destroy(&mm->sframe_mt);
> }
> +
> +/**
> + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> + * @op: Type of operation to perform
> + * @addr_start: The virtual address of the stacktrace information
> + * @addr_length: The length of the stacktrace information
> + * @text_start: The virtual address of the text that @addr_start represents
> + * @text_length: The length of teh text
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Currently only sframes are supported, but in the future, this may be used
> + * to tell the kernel about JIT code which will most likely have a different
> + * format.
> + *
> + * The type command may be extended and parameters may be used for other
> + * purposes.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> + unsigned long, addr_length, unsigned long, text_start,
> + unsigned long, text_length)
Would it make sense to keep the parameters generic from start, similar
to how it is done in prctl()? Or can this be changed later, if the need
arises?
SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, arg2,
unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)
> +{
> + switch (op) {
> + case STACKTRACE_REGISTER_SFRAME:
> + return sframe_add_section(addr_start, addr_start + addr_length,
> + text_start, text_start+text_length);
Nit:
text_start, text_start + text_length);
> + case STACKTRACE_UNREGISTER_SFRAME:
> + return sframe_remove_section(addr_start);
> + }
> + return -EINVAL;
> +}
Thanks and regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Jose E. Marchesi @ 2026-05-08 7:32 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, mhiramat, mathieu.desnoyers,
jremus, jpoimboe, peterz, mingo, jolsa, acme, namhyung, tglx,
andrii, indu.bhagat, beaub, torvalds, akpm, fweimer, kees,
codonell, sam, dylanbhatch, bp, dave.hansen, david, hpa,
Liam.Howlett, lorenzo.stoakes, mhocko, rppt, surenb, vbabka, hca,
gor
In-Reply-To: <20260507215751.0c286695@fedora>
> On Thu, 07 May 2026 14:37:36 +0200
> "Jose E. Marchesi" <jemarch@gnu.org> wrote:
>
>> FWIW passing start and end of both the tracing data and the text segment
>> it covers seems reasonable to me. This covers the case in which the
>> same tracing data describes several text segments, which can happen with
>> SFrame and other similar formats.
>
> Just so I understand you. You are suggesting to pass in the end address
> instead of the length?
No no, was talking conceptually. Denoting the end of the region by its
length is better. I see no reason to do otherwise in this case.
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-05-08 1:57 UTC (permalink / raw)
To: Jose E. Marchesi
Cc: linux-kernel, linux-trace-kernel, mhiramat, mathieu.desnoyers,
jremus, jpoimboe, peterz, mingo, jolsa, acme, namhyung, tglx,
andrii, indu.bhagat, beaub, torvalds, akpm, fweimer, kees,
codonell, sam, dylanbhatch, bp, dave.hansen, david, hpa,
Liam.Howlett, lorenzo.stoakes, mhocko, rppt, surenb, vbabka, hca,
gor
In-Reply-To: <87zf2bl7jj.fsf@gnu.org>
On Thu, 07 May 2026 14:37:36 +0200
"Jose E. Marchesi" <jemarch@gnu.org> wrote:
> FWIW passing start and end of both the tracing data and the text segment
> it covers seems reasonable to me. This covers the case in which the
> same tracing data describes several text segments, which can happen with
> SFrame and other similar formats.
Just so I understand you. You are suggesting to pass in the end address
instead of the length?
-- Steve
^ permalink raw reply
* Re: [PATCH] test_kprobes: clear kprobes between test runs
From: Masami Hiramatsu @ 2026-05-08 0:33 UTC (permalink / raw)
To: Martin Kaiser
Cc: Naveen N Rao, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <20260507134615.1010905-1-martin@kaiser.cx>
On Thu, 7 May 2026 15:44:33 +0200
Martin Kaiser <martin@kaiser.cx> wrote:
> Running the kprobes sanity tests twice makes all tests fail and
> eventually crashes the kernel.
>
> [root@martin-riscv-1 ~]# echo 1 > /sys/kernel/debug/kunit/kprobes_test/run
> ...
> # Totals: pass:5 fail:0 skip:0 total:5
> ok 1 kprobes_test
> [root@martin-riscv-1 ~]# echo 1 > /sys/kernel/debug/kunit/kprobes_test/run
> ...
> # test_kprobe: EXPECTATION FAILED at lib/tests/test_kprobes.c:64
> Expected 0 == register_kprobe(&kp), but
> register_kprobe(&kp) == -22 (0xffffffffffffffea)
> ...
> Unable to handle kernel paging request ...
Oops, good catch! Thanks for fixing!
>
> The testsuite defines several kprobes and kretprobes as static variables
> that are preserved across test runs.
>
> After register_kprobe and unregister_kprobe, a kprobe contains some
> leftover data that must be cleared before the kprobe can be registered
> again. The tests are setting symbol_name to define the probe location.
> Address and flags must be cleared.
>
> The existing code clears some of the probes between subsequent tests, but
> not between two test runs. The leftover data from a previous test run
> makes the registrations fail in the next run.
>
> Move the cleanups for all kprobes into kprobes_test_init, this function
> is called before each single test (including the first test of a test
> run).
Yeah, this looks good to me. Let me pick it.
>
> Signed-off-by: Martin Kaiser <martin@kaiser.cx>
> ---
> lib/tests/test_kprobes.c | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/lib/tests/test_kprobes.c b/lib/tests/test_kprobes.c
> index b7582010125c..06e729e4de05 100644
> --- a/lib/tests/test_kprobes.c
> +++ b/lib/tests/test_kprobes.c
> @@ -12,6 +12,12 @@
>
> #define div_factor 3
>
> +#define KP_CLEAR(_kp) \
> +do { \
> + (_kp).addr = NULL; \
> + (_kp).flags = 0; \
> +} while (0)
> +
> static u32 rand1, preh_val, posth_val;
> static u32 (*target)(u32 value);
> static u32 (*recursed_target)(u32 value);
> @@ -125,10 +131,6 @@ static void test_kprobes(struct kunit *test)
>
> current_test = test;
>
> - /* addr and flags should be cleard for reusing kprobe. */
> - kp.addr = NULL;
> - kp.flags = 0;
> -
> KUNIT_EXPECT_EQ(test, 0, register_kprobes(kps, 2));
> preh_val = 0;
> posth_val = 0;
> @@ -226,9 +228,6 @@ static void test_kretprobes(struct kunit *test)
> struct kretprobe *rps[2] = {&rp, &rp2};
>
> current_test = test;
> - /* addr and flags should be cleard for reusing kprobe. */
> - rp.kp.addr = NULL;
> - rp.kp.flags = 0;
> KUNIT_EXPECT_EQ(test, 0, register_kretprobes(rps, 2));
>
> krph_val = 0;
> @@ -290,8 +289,6 @@ static void test_stacktrace_on_kretprobe(struct kunit *test)
> unsigned long myretaddr = (unsigned long)__builtin_return_address(0);
>
> current_test = test;
> - rp3.kp.addr = NULL;
> - rp3.kp.flags = 0;
>
> /*
> * Run the stacktrace_driver() to record correct return address in
> @@ -352,8 +349,6 @@ static void test_stacktrace_on_nested_kretprobe(struct kunit *test)
> struct kretprobe *rps[2] = {&rp3, &rp4};
>
> current_test = test;
> - rp3.kp.addr = NULL;
> - rp3.kp.flags = 0;
>
> //KUNIT_ASSERT_NE(test, myretaddr, stacktrace_driver());
>
> @@ -367,6 +362,18 @@ static void test_stacktrace_on_nested_kretprobe(struct kunit *test)
>
> static int kprobes_test_init(struct kunit *test)
> {
> + KP_CLEAR(kp);
> + KP_CLEAR(kp2);
> + KP_CLEAR(kp_missed);
> +#ifdef CONFIG_KRETPROBES
> + KP_CLEAR(rp.kp);
> + KP_CLEAR(rp2.kp);
> +#ifdef CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
> + KP_CLEAR(rp3.kp);
> + KP_CLEAR(rp4.kp);
> +#endif
> +#endif
> +
> target = kprobe_target;
> target2 = kprobe_target2;
> recursed_target = kprobe_recursed_target;
> --
> 2.43.7
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Jose E. Marchesi @ 2026-05-07 12:37 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, mhiramat, mathieu.desnoyers,
jremus, jpoimboe, peterz, mingo, jolsa, acme, namhyung, tglx,
andrii, indu.bhagat, beaub, torvalds, akpm, fweimer, kees,
codonell, sam, dylanbhatch, bp, dave.hansen, david, hpa,
Liam.Howlett, lorenzo.stoakes, mhocko, rppt, surenb, vbabka, hca,
gor
In-Reply-To: <20260429114355.6c712e6a@gandalf.local.home>
> +/**
> + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> + * @op: Type of operation to perform
> + * @addr_start: The virtual address of the stacktrace information
> + * @addr_length: The length of the stacktrace information
> + * @text_start: The virtual address of the text that @addr_start represents
> + * @text_length: The length of teh text
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Currently only sframes are supported, but in the future, this may be used
> + * to tell the kernel about JIT code which will most likely have a different
> + * format.
> + *
> + * The type command may be extended and parameters may be used for other
> + * purposes.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> + unsigned long, addr_length, unsigned long, text_start,
> + unsigned long, text_length)
> +{
> + switch (op) {
> + case STACKTRACE_REGISTER_SFRAME:
> + return sframe_add_section(addr_start, addr_start + addr_length,
> + text_start, text_start+text_length);
> + case STACKTRACE_UNREGISTER_SFRAME:
> + return sframe_remove_section(addr_start);
> + }
> + return -EINVAL;
> +}
FWIW passing start and end of both the tracing data and the text segment
it covers seems reasonable to me. This covers the case in which the
same tracing data describes several text segments, which can happen with
SFrame and other similar formats.
> diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
> index 7a42b32b6577..54a99cffeec4 100644
> --- a/scripts/syscall.tbl
> +++ b/scripts/syscall.tbl
> @@ -412,3 +412,4 @@
> 469 common file_setattr sys_file_setattr
> 470 common listns sys_listns
> 471 common rseq_slice_yield sys_rseq_slice_yield
> +472 common stacktrace_setup sys_stacktrace_setup
^ permalink raw reply
* [POC PATCH 5/5] KVM: selftests: Test conversions for SNP
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 198 +++++++++++++++++-
1 file changed, 193 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 8b859adf4cf6f..8869cca748879 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -253,17 +253,205 @@ static void test_sev_smoke(void *guest, u32 type, u64 policy)
}
}
+#define GHCB_MSR_REG_GPA_REQ 0x012
+#define GHCB_MSR_REG_GPA_REQ_VAL(v) \
+ /* GHCBData[63:12] */ \
+ (((u64)((v) & GENMASK_ULL(51, 0)) << 12) | \
+ /* GHCBData[11:0] */ \
+ GHCB_MSR_REG_GPA_REQ)
+
+#define GHCB_MSR_REG_GPA_RESP 0x013
+#define GHCB_MSR_REG_GPA_RESP_VAL(v) \
+ /* GHCBData[63:12] */ \
+ (((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
+
+#define GHCB_DATA_LOW 12
+#define GHCB_MSR_INFO_MASK (BIT_ULL(GHCB_DATA_LOW) - 1)
+#define GHCB_RESP_CODE(v) ((v) & GHCB_MSR_INFO_MASK)
+
+/*
+ * SNP Page State Change Operation
+ *
+ * GHCBData[55:52] - Page operation:
+ * 0x0001 Page assignment, Private
+ * 0x0002 Page assignment, Shared
+ */
+enum psc_op {
+ SNP_PAGE_STATE_PRIVATE = 1,
+ SNP_PAGE_STATE_SHARED,
+};
+
+#define GHCB_MSR_PSC_REQ 0x014
+#define GHCB_MSR_PSC_REQ_GFN(gfn, op) \
+ /* GHCBData[55:52] */ \
+ (((u64)((op) & 0xf) << 52) | \
+ /* GHCBData[51:12] */ \
+ ((u64)((gfn) & GENMASK_ULL(39, 0)) << 12) | \
+ /* GHCBData[11:0] */ \
+ GHCB_MSR_PSC_REQ)
+
+#define GHCB_MSR_PSC_RESP 0x015
+#define GHCB_MSR_PSC_RESP_VAL(val) \
+ /* GHCBData[63:32] */ \
+ (((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
+
+static u64 ghcb_gpa;
+static void snp_register_ghcb(void)
+{
+ u64 ghcb_pfn = ghcb_gpa >> PAGE_SHIFT;
+ u64 val;
+
+ GUEST_ASSERT(ghcb_gpa);
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_REG_GPA_REQ_VAL(ghcb_gpa >> PAGE_SHIFT));
+ vmgexit();
+
+ val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+ GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_REG_GPA_RESP);
+ GUEST_ASSERT_EQ(GHCB_MSR_REG_GPA_RESP_VAL(val), ghcb_pfn);
+}
+
+static void snp_page_state_change(u64 gpa, enum psc_op op)
+{
+ u64 val;
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_PSC_REQ_GFN(gpa >> PAGE_SHIFT, op));
+ vmgexit();
+
+ val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+ GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_PSC_RESP);
+ GUEST_ASSERT_EQ(GHCB_MSR_PSC_RESP_VAL(val), 0);
+}
+
+#define RMP_PG_SIZE_4K 0
+static inline void pvalidate(void *vaddr, bool validate)
+{
+ bool no_rmpupdate;
+ int rc;
+
+ /* "pvalidate" mnemonic support in binutils 2.36 and newer */
+ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFF\n\t"
+ : "=@ccc"(no_rmpupdate), "=a"(rc)
+ : "a"(vaddr), "c"(RMP_PG_SIZE_4K), "d"(validate)
+ : "memory", "cc");
+
+ GUEST_ASSERT(!no_rmpupdate);
+ GUEST_ASSERT_EQ(rc, 0);
+}
+
+#define CONVERSION_TEST_VALUE_SHARED_1 0xab
+#define CONVERSION_TEST_VALUE_SHARED_2 0xcd
+#define CONVERSION_TEST_VALUE_PRIVATE 0xef
+#define CONVERSION_TEST_VALUE_SHARED_3 0xbc
+#define CONVERSION_TEST_VALUE_SHARED_4 0xde
+static void guest_code_conversion(u8 *test_shared_gva, u8 *test_private_gva, u64 test_gpa)
+{
+ snp_register_ghcb();
+
+ GUEST_ASSERT_EQ(READ_ONCE(*test_shared_gva), CONVERSION_TEST_VALUE_SHARED_1);
+ WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_2);
+
+ snp_page_state_change(test_gpa, SNP_PAGE_STATE_PRIVATE);
+ pvalidate(test_private_gva, true);
+
+ WRITE_ONCE(*test_private_gva, CONVERSION_TEST_VALUE_PRIVATE);
+ GUEST_ASSERT_EQ(READ_ONCE(*test_private_gva), CONVERSION_TEST_VALUE_PRIVATE);
+
+ pvalidate(test_private_gva, false);
+ snp_page_state_change(test_gpa, SNP_PAGE_STATE_SHARED);
+
+ GUEST_ASSERT_EQ(READ_ONCE(*test_shared_gva), CONVERSION_TEST_VALUE_SHARED_3);
+ WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_4);
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_TERM_REQ);
+ vmgexit();
+}
+
+static void test_conversion(u64 policy)
+{
+ gva_t test_private_gva;
+ gva_t test_shared_gva;
+ struct kvm_vcpu *vcpu;
+ gva_t ghcb_gva;
+ gpa_t test_gpa;
+ struct kvm_vm *vm;
+ void *ghcb_hva;
+ void *test_hva;
+
+ vm = vm_sev_create_with_one_vcpu(KVM_X86_SNP_VM, guest_code_conversion, &vcpu);
+
+ ghcb_gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ ghcb_hva = addr_gva2hva(vm, ghcb_gva);
+ ghcb_gpa = addr_gva2gpa(vm, ghcb_gva);
+ sync_global_to_guest(vm, ghcb_gpa);
+
+ test_shared_gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ test_hva = addr_gva2hva(vm, test_shared_gva);
+ test_gpa = addr_gva2gpa(vm, test_shared_gva);
+
+ test_private_gva = vm_unused_gva_gap(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR);
+ ___virt_pg_map(vm, &vm->mmu, test_private_gva, test_gpa, PG_SIZE_4K, true);
+
+ vcpu_args_set(vcpu, 3, test_shared_gva, test_private_gva, test_gpa);
+
+ vm_sev_launch(vm, policy, NULL);
+
+ WRITE_ONCE(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_1);
+
+ fprintf(stderr, "ghcb_hva=%p ghcb_gpa=%lx ghcb_gva=%lx\n", ghcb_hva, ghcb_gpa, ghcb_gva);
+ fprintf(stderr, "test_hva=%p test_gpa=%lx test_private_gva=%lx test_shared_gva=%lx\n", test_hva, test_gpa, test_private_gva, test_shared_gva);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_ENCRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+ vm_mem_set_private(vm, test_gpa, PAGE_SIZE);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_DECRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+ vm_mem_set_shared(vm, test_gpa, PAGE_SIZE);
+
+ fprintf(stderr, "test_hva contents = %x\n", READ_ONCE(*(u8 *)test_hva));
+
+ WRITE_ONCE(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_3);
+ TEST_ASSERT_EQ(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_3);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SYSTEM_EVENT);
+ TEST_ASSERT_EQ(vcpu->run->system_event.type, KVM_SYSTEM_EVENT_SEV_TERM);
+ TEST_ASSERT_EQ(vcpu->run->system_event.ndata, 1);
+ TEST_ASSERT_EQ(vcpu->run->system_event.data[0], GHCB_MSR_TERM_REQ);
+
+ TEST_ASSERT_EQ(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_4);
+}
+
int main(int argc, char *argv[])
{
TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
- test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+ // test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+
+ // if (kvm_cpu_has(X86_FEATURE_SEV_ES))
+ // test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
- if (kvm_cpu_has(X86_FEATURE_SEV_ES))
- test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+ if (kvm_cpu_has(X86_FEATURE_SEV_SNP)) {
+ test_conversion(snp_default_policy());
- if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
- test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ // test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ }
return 0;
}
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 4/5] KVM: selftests: Allow specifying CoCo-privateness while mapping a page
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/include/x86/processor.h | 2 ++
tools/testing/selftests/kvm/lib/x86/processor.c | 13 ++++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 77f576ee7789d..683f21452db58 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -1507,6 +1507,8 @@ enum pg_level {
void tdp_mmu_init(struct kvm_vm *vm, int pgtable_levels,
struct pte_masks *pte_masks);
+void ___virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level, bool private);
void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
gpa_t gpa, int level);
void virt_map_level(struct kvm_vm *vm, gva_t gva, gpa_t gpa,
diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
index b51467d70f6e7..02781194f51a2 100644
--- a/tools/testing/selftests/kvm/lib/x86/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86/processor.c
@@ -256,8 +256,8 @@ static u64 *virt_create_upper_pte(struct kvm_vm *vm,
return pte;
}
-void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
- gpa_t gpa, int level)
+void ___virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level, bool private)
{
const u64 pg_size = PG_LEVEL_SIZE(level);
u64 *pte = &mmu->pgd;
@@ -309,12 +309,19 @@ void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
* Neither SEV nor TDX supports shared page tables, so only the final
* leaf PTE needs manually set the C/S-bit.
*/
- if (vm_is_gpa_protected(vm, gpa))
+ if (private)
*pte |= PTE_C_BIT_MASK(mmu);
else
*pte |= PTE_S_BIT_MASK(mmu);
}
+void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level)
+{
+ ___virt_pg_map(vm, mmu, gva, gpa, level,
+ vm_is_gpa_protected(vm, gpa));
+}
+
void virt_arch_pg_map(struct kvm_vm *vm, gva_t gva, gpa_t gpa)
{
__virt_pg_map(vm, &vm->mmu, gva, gpa, PG_LEVEL_4K);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 3/5] KVM: selftests: Make guest_code_xsave more friendly
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
The original implementation of guest_code_xsave makes a jmp to
guest_sev_es_code in inline assembly. When code that uses guest_sev_es_code
is removed, guest_sev_es_code will be optimized out, leading to a linking
error since guest_code_xsave still tries to jmp to guest_sev_es_code.
Rewrite guest_code_xsave() to instead make a call, in C, to
guest_sev_es_code(), so that usage of guest_sev_es_code() is made known to
the compiler.
This rewriting also gives a name to the xsave inline assembly, improving
readability.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 24 +++++++++++++------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 1a49ee3915864..8b859adf4cf6f 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -80,13 +80,23 @@ static void guest_sev_code(void)
GUEST_DONE();
}
-/* Stash state passed via VMSA before any compiled code runs. */
-extern void guest_code_xsave(void);
-asm("guest_code_xsave:\n"
- "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n"
- "xor %edx, %edx\n"
- "xsave (%rdi)\n"
- "jmp guest_sev_es_code");
+static void xsave_all_registers(void *addr)
+{
+ __asm__ __volatile__(
+ "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n"
+ "xor %edx, %edx\n"
+ "xsave (%0)"
+ :
+ : "r"(addr)
+ : "eax", "edx", "memory"
+ );
+}
+
+static void guest_code_xsave(void *vmsa_gva)
+{
+ xsave_all_registers(vmsa_gva);
+ guest_sev_es_code();
+}
static void compare_xsave(u8 *from_host, u8 *from_guest)
{
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 2/5] KVM: selftests: Use guest_memfd memory contents in-place for SNP launch update
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Update the SEV-SNP launch update flow to utilize guest_memfd in-place
conversion.
Include the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE flag when setting memory
attributes to private. This is permitted before the SNP VM is finalized.
In snp_launch_update_data, pass 0 as the host virtual address. This
instructs the kernel to perform the launch update using the guest_memfd
backing the guest physical address rather than a userspace-provided
buffer.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/lib/x86/sev.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index 93f9169034617..074ab0eff1e27 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -37,8 +37,7 @@ static void encrypt_region(struct kvm_vm *vm, struct userspace_mem_region *regio
if (is_sev_snp_vm(vm))
snp_launch_update_data(vm, gpa_base + offset,
- (u64)addr_gpa2hva(vm, gpa_base + offset),
- size, page_type);
+ 0, size, page_type);
else
sev_launch_update_data(vm, gpa_base + offset, size);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 1/5] KVM: selftests: Initialize guest_memfd with INIT_SHARED
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu, Sagi Shahar
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Initialize guest_memfd with INIT_SHARED for VM types that require
guest_memfd.
Memory in the first memslot is used by the selftest framework to load
code, page tables, interrupt descriptor tables, and basically everything
the selftest needs to run. The selftest framework sets all of these up
assuming that the memory in the memslot can be written to from the
host. Align with that behavior by initializing guest_memfd as shared so
that all the writes from the host are permitted.
guest_memfd memory can later be marked private if necessary by CoCo
platform-specific initialization functions.
Suggested-by: Sagi Shahar <sagis@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d1befa3f4b305..a377e5f333116 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -483,8 +483,10 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
{
u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
nr_extra_pages);
+ enum vm_mem_backing_src_type src_type;
struct userspace_mem_region *slot0;
struct kvm_vm *vm;
+ u64 gmem_flags;
int i, flags;
kvm_set_files_rlimit(nr_runnable_vcpus);
@@ -502,7 +504,15 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
if (is_guest_memfd_required(shape))
flags |= KVM_MEM_GUEST_MEMFD;
- vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
+ gmem_flags = 0;
+ src_type = VM_MEM_SRC_ANONYMOUS;
+ if (is_guest_memfd_required(shape) && kvm_has_gmem_attributes) {
+ src_type = VM_MEM_SRC_SHMEM;
+ gmem_flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED;
+ }
+
+ vm_mem_add(vm, src_type, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
+
for (i = 0; i < NR_MEM_REGIONS; i++)
vm->memslots[i] = 0;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 0/5] guest_memfd in-place conversion selftests for SNP
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com>
With these POC patches, I was able to test the set memory
attributes/conversion ioctls with SNP.
After allowing src_addr to be NULL, SNP_LAUNCH_UPDATE can accept NULL
for source address and the SNP VM runs fine. :)
Ackerley Tng (5):
KVM: selftests: Initialize guest_memfd with INIT_SHARED
KVM: selftests: Use guest_memfd memory contents in-place for SNP
launch update
KVM: selftests: Make guest_code_xsave more friendly
KVM: selftests: Allow specifying CoCo-privateness while mapping a page
KVM: selftests: Test conversions for SNP
.../selftests/kvm/include/x86/processor.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +-
.../testing/selftests/kvm/lib/x86/processor.c | 13 +-
tools/testing/selftests/kvm/lib/x86/sev.c | 3 +-
.../selftests/kvm/x86/sev_smoke_test.c | 222 +++++++++++++++++-
5 files changed, 234 insertions(+), 18 deletions(-)
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply
* [PATCH v6 42/43] KVM: selftests: Add script to exercise private_mem_conversions_test
From: Ackerley Tng via B4 Relay @ 2026-05-07 20:23 UTC (permalink / raw)
To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Add a wrapper script to simplify running the private_mem_conversions_test
with a variety of configurations. Manually invoking the test for all
supported memory backing source types is tedious.
The script automatically detects the availability of 2MB and 1GB hugepages
and builds a list of source types to test. It then iterates through the
list, running the test for each type with both a single memslot and
multiple memslots.
This makes it easier to get comprehensive test coverage across different
memory configurations.
Add and use a helper program in C to be able to read
KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES as defined in header files and then
issue the ioctl to read the KVM CAP.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/Makefile.kvm | 4 +
.../selftests/kvm/kvm_has_gmem_attributes.c | 17 +++
.../kvm/x86/private_mem_conversions_test.sh | 128 +++++++++++++++++++++
3 files changed, 149 insertions(+)
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 6232881be500a..e5769268936a7 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -54,6 +54,7 @@ LIBKVM_loongarch += lib/loongarch/exception.S
# Non-compiled test targets
TEST_PROGS_x86 += x86/nx_huge_pages_test.sh
+TEST_PROGS_x86 += x86/private_mem_conversions_test.sh
# Compiled test targets valid on all architectures with libkvm support
TEST_GEN_PROGS_COMMON = demand_paging_test
@@ -67,6 +68,8 @@ TEST_GEN_PROGS_COMMON += set_memory_region_test
TEST_GEN_PROGS_COMMON += memslot_modification_stress_test
TEST_GEN_PROGS_COMMON += memslot_perf_test
+TEST_GEN_PROGS_EXTENDED_COMMON += kvm_has_gmem_attributes
+
# Compiled test targets
TEST_GEN_PROGS_x86 = $(TEST_GEN_PROGS_COMMON)
TEST_GEN_PROGS_x86 += x86/cpuid_test
@@ -245,6 +248,7 @@ SPLIT_TESTS += get-reg-list
TEST_PROGS += $(TEST_PROGS_$(ARCH))
TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(ARCH))
+TEST_GEN_PROGS_EXTENDED += $(TEST_GEN_PROGS_EXTENDED_COMMON)
TEST_GEN_PROGS_EXTENDED += $(TEST_GEN_PROGS_EXTENDED_$(ARCH))
LIBKVM += $(LIBKVM_$(ARCH))
diff --git a/tools/testing/selftests/kvm/kvm_has_gmem_attributes.c b/tools/testing/selftests/kvm/kvm_has_gmem_attributes.c
new file mode 100644
index 0000000000000..4f361349412fb
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_has_gmem_attributes.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Utility to check if KVM supports guest_memfd attributes.
+ *
+ * Copyright (C) 2025, Google LLC.
+ */
+
+#include <stdio.h>
+
+#include "kvm_util.h"
+
+int main(void)
+{
+ printf("%u\n", kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) > 0);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
new file mode 100755
index 0000000000000..7179a4fcdd498
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
@@ -0,0 +1,128 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Wrapper script which runs different test setups of
+# private_mem_conversions_test.
+#
+# Copyright (C) 2025, Google LLC.
+
+NUM_VCPUS_TO_TEST=4
+NUM_MEMSLOTS_TO_TEST=$NUM_VCPUS_TO_TEST
+
+# Required pages are based on the test setup in the C code.
+REQUIRED_NUM_2M_HUGEPAGES=$((1024 * NUM_VCPUS_TO_TEST))
+REQUIRED_NUM_1G_HUGEPAGES=$((2 * NUM_VCPUS_TO_TEST))
+
+get_hugepage_count() {
+ local page_size_kb=$1
+ local path="/sys/kernel/mm/hugepages/hugepages-${page_size_kb}kB/nr_hugepages"
+ if [ -f "$path" ]; then
+ cat "$path"
+ else
+ echo 0
+ fi
+}
+
+get_default_hugepage_size_in_kb() {
+ local size=$(grep "Hugepagesize:" /proc/meminfo | awk '{print $2}')
+ echo "$size"
+}
+
+run_tests() {
+ local executable_path=$1
+ local src_type=$2
+ local num_memslots=$3
+ local num_vcpus=$4
+
+ echo "$executable_path -s $src_type -m $num_memslots -n $num_vcpus"
+ "$executable_path" -s "$src_type" -m "$num_memslots" -n "$num_vcpus"
+}
+
+script_dir=$(dirname "$(realpath "$0")")
+test_executable="${script_dir}/private_mem_conversions_test"
+kvm_has_gmem_attributes_tool="${script_dir}/../kvm_has_gmem_attributes"
+
+if [ ! -f "$test_executable" ]; then
+ echo "Error: Test executable not found at '$test_executable'" >&2
+ exit 1
+fi
+
+if [ ! -f "$kvm_has_gmem_attributes_tool" ]; then
+ echo "Error: kvm_has_gmem_attributes utility not found at '$kvm_has_gmem_attributes_tool'" >&2
+ exit 1
+fi
+
+kvm_has_gmem_attributes=$("$kvm_has_gmem_attributes_tool" | tail -n1)
+
+if [ "$kvm_has_gmem_attributes" -eq 1 ]; then
+ backing_src_types=("shmem")
+else
+ hugepage_2mb_count=$(get_hugepage_count 2048)
+ hugepage_2mb_enabled=$((hugepage_2mb_count >= REQUIRED_NUM_2M_HUGEPAGES))
+ hugepage_1gb_count=$(get_hugepage_count 1048576)
+ hugepage_1gb_enabled=$((hugepage_1gb_count >= REQUIRED_NUM_1G_HUGEPAGES))
+
+ default_hugepage_size_kb=$(get_default_hugepage_size_in_kb)
+ hugepage_default_enabled=0
+ if [ "$default_hugepage_size_kb" -eq 2048 ]; then
+ hugepage_default_enabled=$hugepage_2mb_enabled
+ elif [ "$default_hugepage_size_kb" -eq 1048576 ]; then
+ hugepage_default_enabled=$hugepage_1gb_enabled
+ fi
+
+ backing_src_types=("anonymous" "anonymous_thp")
+
+ if [ "$hugepage_default_enabled" -eq 1 ]; then
+ backing_src_types+=("anonymous_hugetlb")
+ else
+ echo "skipping anonymous_hugetlb backing source type"
+ fi
+
+ if [ "$hugepage_2mb_enabled" -eq 1 ]; then
+ backing_src_types+=("anonymous_hugetlb_2mb")
+ else
+ echo "skipping anonymous_hugetlb_2mb backing source type"
+ fi
+
+ if [ "$hugepage_1gb_enabled" -eq 1 ]; then
+ backing_src_types+=("anonymous_hugetlb_1gb")
+ else
+ echo "skipping anonymous_hugetlb_1gb backing source type"
+ fi
+
+ backing_src_types+=("shmem")
+
+ if [ "$hugepage_default_enabled" -eq 1 ]; then
+ backing_src_types+=("shared_hugetlb")
+ else
+ echo "skipping shared_hugetlb backing source type"
+ fi
+fi
+
+return_code=0
+for i in "${!backing_src_types[@]}"; do
+ src_type=${backing_src_types[$i]}
+ if [ "$i" -gt 0 ]; then
+ echo
+ fi
+
+ if ! run_tests "$test_executable" "$src_type" 1 1; then
+ return_code=$?
+ echo "Test failed for source type '$src_type'. Arguments: -s $src_type -m 1 -n 1" >&2
+ break
+ fi
+
+ if ! run_tests "$test_executable" "$src_type" 1 "$NUM_VCPUS_TO_TEST"; then
+ return_code=$?
+ echo "Test failed for source type '$src_type'. Arguments: -s $src_type -m 1 -n $NUM_VCPUS_TO_TEST" >&2
+ break
+ fi
+
+ if ! run_tests "$test_executable" "$src_type" "$NUM_MEMSLOTS_TO_TEST" "$NUM_VCPUS_TO_TEST"; then
+ return_code=$?
+ echo "Test failed for source type '$src_type'. Arguments: -s $src_type -m $NUM_MEMSLOTS_TO_TEST -n $NUM_VCPUS_TO_TEST" >&2
+ break
+ fi
+done
+
+exit "$return_code"
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v6 43/43] KVM: selftests: Update private memory exits test to work with per-gmem attributes
From: Ackerley Tng via B4 Relay @ 2026-05-07 20:23 UTC (permalink / raw)
To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com>
From: Sean Christopherson <seanjc@google.com>
Skip setting memory to private in the private memory exits test when using
per-gmem memory attributes, as memory is initialized to private by default
for guest_memfd, and using vm_mem_set_private() on a guest_memfd instance
requires creating guest_memfd with GUEST_MEMFD_FLAG_MMAP (which is totally
doable, but would need to be conditional and is ultimately unnecessary).
Expect an emulated MMIO instead of a memory fault exit when attributes are
per-gmem, as deleting the memslot effectively drops the private status,
i.e. the GPA becomes shared and thus supports emulated MMIO.
Skip the "memslot not private" test entirely, as private vs. shared state
for x86 software-protected VMs comes from the memory attributes themselves,
and so when doing in-place conversions there can never be a disconnect
between the expected and actual states.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/private_mem_kvm_exits_test.c | 36 ++++++++++++++++++----
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
index 10db9fe6d9063..70ed16066c63e 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
@@ -62,8 +62,9 @@ static void test_private_access_memslot_deleted(void)
virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
- /* Request to access page privately */
- vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+ /* Request to access page privately. */
+ if (!kvm_has_gmem_attributes)
+ vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
pthread_create(&vm_thread, NULL,
(void *(*)(void *))run_vcpu_get_exit_reason,
@@ -74,10 +75,26 @@ static void test_private_access_memslot_deleted(void)
pthread_join(vm_thread, &thread_return);
exit_reason = (u32)(u64)thread_return;
- TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
- TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
- TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
- TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+ /*
+ * If attributes are tracked per-gmem, deleting the memslot that points
+ * at the gmem instance effectively makes the memory shared, and so the
+ * read should trigger emulated MMIO.
+ *
+ * If attributes are tracked per-VM, deleting the memslot shouldn't
+ * affect the private attribute, and so KVM should generate a memory
+ * fault exit (emulated MMIO on private GPAs is disallowed).
+ */
+ if (kvm_has_gmem_attributes) {
+ TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MMIO);
+ TEST_ASSERT_EQ(vcpu->run->mmio.phys_addr, EXITS_TEST_GPA);
+ TEST_ASSERT_EQ(vcpu->run->mmio.len, sizeof(u64));
+ TEST_ASSERT_EQ(vcpu->run->mmio.is_write, false);
+ } else {
+ TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+ }
kvm_vm_free(vm);
}
@@ -88,6 +105,13 @@ static void test_private_access_memslot_not_private(void)
struct kvm_vcpu *vcpu;
u32 exit_reason;
+ /*
+ * Accessing non-private memory as private with a software-protected VM
+ * isn't possible when doing in-place conversions.
+ */
+ if (kvm_has_gmem_attributes)
+ return;
+
vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
guest_repeatedly_read);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox