* [PATCH v2 07/14] verification/rvgen: Fix ltl2k writing True as a literal
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco,
Nam Cao
Cc: Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
The rvgen parser for LTL stores literal true values in the python
representation (capitalised True), this doesn't build in C.
The Literal class should already handle this case but ASTNode skips its
strigification method and converts the value (true/false) directly.
Fix by delegating ASTNode stringification to the Literal and Variable
classes instead of bypassing them.
Fixes: 97ffa4ce6ab32 ("verification/rvgen: Add support for linear temporal logic")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rvgen/rvgen/ltl2ba.py | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/tools/verification/rvgen/rvgen/ltl2ba.py b/tools/verification/rvgen/rvgen/ltl2ba.py
index 7f538598a868..016e7cf93bbb 100644
--- a/tools/verification/rvgen/rvgen/ltl2ba.py
+++ b/tools/verification/rvgen/rvgen/ltl2ba.py
@@ -122,10 +122,8 @@ class ASTNode:
return self.op.expand(self, node, node_set)
def __str__(self):
- if isinstance(self.op, Literal):
- return str(self.op.value)
- if isinstance(self.op, Variable):
- return self.op.name.lower()
+ if isinstance(self.op, (Literal, Variable)):
+ return str(self.op)
return "val" + str(self.id)
def normalize(self):
@@ -382,6 +380,9 @@ class Variable:
def __iter__(self):
yield from ()
+ def __str__(self):
+ return self.name.lower()
+
def negate(self):
new = ASTNode(self)
return NotOp(new)
--
2.54.0
^ permalink raw reply related
* [PATCH v2 06/14] verification/rvgen: Fix options shared among commands
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco
Cc: Nam Cao, Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
After rvgen was refactored to use subparsers, the common options (-a and
-D) were left in the main parser. This meant that they needed to be
called /before/ the subcommand and using them without subcommand was
allowed. This is not the original intent.
rvgen -D "some description" container -n name
Define the options as parent in the subparsers to allow them to be used
from both subcommands together with other options.
rvgen container -n name -D "some description"
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rvgen/__main__.py | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/tools/verification/rvgen/__main__.py b/tools/verification/rvgen/__main__.py
index 3be7f85fe37b..5c923dc10d0f 100644
--- a/tools/verification/rvgen/__main__.py
+++ b/tools/verification/rvgen/__main__.py
@@ -18,14 +18,16 @@ if __name__ == '__main__':
import sys
parser = argparse.ArgumentParser(description='Generate kernel rv monitor')
- parser.add_argument("-D", "--description", dest="description", required=False)
- parser.add_argument("-a", "--auto_patch", dest="auto_patch",
+
+ parent_parser = argparse.ArgumentParser(add_help=False)
+ parent_parser.add_argument("-D", "--description", dest="description", required=False)
+ parent_parser.add_argument("-a", "--auto_patch", dest="auto_patch",
action="store_true", required=False,
help="Patch the kernel in place")
subparsers = parser.add_subparsers(dest="subcmd", required=True)
- monitor_parser = subparsers.add_parser("monitor")
+ monitor_parser = subparsers.add_parser("monitor", parents=[parent_parser])
monitor_parser.add_argument('-n', "--model_name", dest="model_name")
monitor_parser.add_argument("-p", "--parent", dest="parent",
required=False, help="Create a monitor nested to parent")
@@ -36,7 +38,7 @@ if __name__ == '__main__':
monitor_parser.add_argument('-t', "--monitor_type", dest="monitor_type", required=True,
help=f"Available options: {', '.join(Monitor.monitor_types.keys())}")
- container_parser = subparsers.add_parser("container")
+ container_parser = subparsers.add_parser("container", parents=[parent_parser])
container_parser.add_argument('-n', "--model_name", dest="model_name", required=True)
params = parser.parse_args()
--
2.54.0
^ permalink raw reply related
* [PATCH v2 05/14] tools/rv: Add selftests
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco
Cc: Nam Cao, Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
The rv tool needs automated testing to catch regressions and verify
correct functionality across different usage scenarios.
Add selftests that validate monitor listing (including containers and
nested monitors), monitor execution with different configurations
(reactors, verbose output, tracing), and trace output format for both
per-task and per-cpu monitors. Error handling paths are also tested.
Tests use a shared engine for common patterns.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rv/Makefile | 5 +-
tools/verification/rv/tests/rv_list.t | 48 ++++++++++
tools/verification/rv/tests/rv_mon.t | 95 ++++++++++++++++++
tools/verification/tests/engine.sh | 132 ++++++++++++++++++++++++++
4 files changed, 279 insertions(+), 1 deletion(-)
create mode 100644 tools/verification/rv/tests/rv_list.t
create mode 100644 tools/verification/rv/tests/rv_mon.t
create mode 100644 tools/verification/tests/engine.sh
diff --git a/tools/verification/rv/Makefile b/tools/verification/rv/Makefile
index 5b898360ba48..8ae5fc0d1d17 100644
--- a/tools/verification/rv/Makefile
+++ b/tools/verification/rv/Makefile
@@ -78,4 +78,7 @@ clean: doc_clean fixdep-clean
$(Q)rm -f rv rv-static fixdep FEATURE-DUMP rv-*
$(Q)rm -rf feature
-.PHONY: FORCE clean
+check: $(RV)
+ RV=$(RV) prove -o --directives -f tests/
+
+.PHONY: FORCE clean check
diff --git a/tools/verification/rv/tests/rv_list.t b/tools/verification/rv/tests/rv_list.t
new file mode 100644
index 000000000000..201af33a52cc
--- /dev/null
+++ b/tools/verification/rv/tests/rv_list.t
@@ -0,0 +1,48 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+source ../tests/engine.sh
+test_begin
+
+set_timeout 30s
+
+RVDIR=/sys/kernel/tracing/rv/
+
+# Help and basic tests
+check "verify help page" \
+ "$RV --help" 0 "usage: rv command"
+
+check "verify list subcommand help" \
+ "$RV list --help" 0 "list all available monitors"
+
+all_nested=$(grep : $RVDIR/available_monitors | cut -d: -f2 | paste -s | sed 's/\t/\\|/g')
+all_non_nested=$(grep -v : $RVDIR/available_monitors | cut -d: -f2 | paste -s | sed 's/\t/\\|/g')
+sched_monitors=$(grep sched: $RVDIR/available_monitors | cut -d: -f2 | paste -s | sed 's/\t/\\|/g')
+description_state="[[:space:]]\+[[:print:]]\+\[\(OFF\|ON\)\]"
+line_nested=" - \($all_nested\)${description_state}"
+line_non_nested="\($all_non_nested\)${description_state}"
+
+# List monitors and containers
+check "list all monitors" \
+ "$RV list" 0 "" "" "^\($line_nested\|$line_non_nested\)$"
+
+check_if_exists "list container" \
+ "$RV list sched" "$RVDIR/monitors/sched" \
+ "" "-- No monitor found in container sched --" \
+ "^\($sched_monitors\)${description_state}$"
+
+check_if_exists "list non-container" \
+ "$RV list wwnr" "$RVDIR/monitors/wwnr" \
+ "-- No monitor found in container wwnr --" \
+ "^\( - \)\?[[:alnum:]]\+${description_state}$"
+
+check "list incomplete container name" \
+ "$RV list s" 0 "-- No monitor found in container s --"
+
+# Error handling tests
+check "no command" \
+ "$RV" 1 "rv requires a command"
+
+check "invalid command" \
+ "$RV invalid" 1 "rv does not know the invalid command"
+
+test_end
diff --git a/tools/verification/rv/tests/rv_mon.t b/tools/verification/rv/tests/rv_mon.t
new file mode 100644
index 000000000000..cbc346c74c71
--- /dev/null
+++ b/tools/verification/rv/tests/rv_mon.t
@@ -0,0 +1,95 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+source ../tests/engine.sh
+test_begin
+
+set_timeout 30s
+
+RVDIR=/sys/kernel/tracing/rv/
+
+# Help and basic tests
+check "verify mon subcommand help" \
+ "$RV mon --help" 0 "run a monitor"
+
+# Error handling tests
+check "mon without monitor name" \
+ "$RV mon" 1 "usage: rv mon"
+
+check "invalid monitor name" \
+ "$RV mon invalid" 1 "monitor invalid does not exist"
+
+if [ -d $RVDIR/monitors/wwnr ]; then
+
+check "invalid reactor name" \
+ "$RV mon wwnr -r invalid" 1 "failed to set invalid reactor, is it available?"
+
+check "monitor name is substring of another monitor" \
+ "$RV mon nr" 1 "monitor nr does not exist"
+
+check "already enabled monitor returns error" \
+ "echo 1 > $RVDIR/monitors/wwnr/enable; $RV mon wwnr" 1 \
+ "monitor wwnr (in-kernel) is already enabled"
+echo 0 > $RVDIR/monitors/wwnr/enable
+
+fi
+
+# rv mon runs until terminated
+set_expected_timeout 2s
+
+# Run monitors with different configurations
+check_if_exists "run the monitor without parameters" \
+ "$RV mon wwnr" "$RVDIR/monitors/wwnr" "" "."
+
+check_if_exists "run the monitor as verbose" \
+ "$RV mon wwnr -v" "$RVDIR/monitors/wwnr" \
+ "my pid is \$pid" "\(event\|error\)"
+
+check_if_exists "run the monitor with a reactor" \
+ "$RV mon wwnr -r printk & sleep .5 && cat $RVDIR/monitors/wwnr/reactors && wait" \
+ "$RVDIR/monitors/wwnr/reactors" "\[printk\]"
+
+check_if_exists "reactor is restored after exit" \
+ "cat $RVDIR/monitors/wwnr/reactors" \
+ "$RVDIR/monitors/wwnr/reactors" "\[nop\]"
+
+check_if_exists "run a nested monitor with a reactor" \
+ "$RV mon snroc -r printk & sleep .5 && cat $RVDIR/monitors/sched/snroc/reactors && wait" \
+ "$RVDIR/monitors/sched/snroc/reactors" "\[printk\]"
+
+check_if_exists "run an explicitly nested monitor with a reactor" \
+ "$RV mon sched:sssw -r printk & sleep .5 && cat $RVDIR/monitors/sched/sssw/reactors && wait" \
+ "$RVDIR/monitors/sched/sssw/reactors" "\[printk\]"
+
+check_if_exists "run container monitor" \
+ "$RV mon sched & sleep .5 && cat $RVDIR/monitors/sched/{sssw,sco}/enable && wait" \
+ "$RVDIR/monitors/sched" "1" "0" "^1$"
+
+# Regexes for the trace
+header="^[[:space:]]\+\(\([][A-Z_x<>-]\+\||\)[[:space:]]*\)\+$"
+type="\(event\|error\)[[:space:]]\+"
+genpid="[0-9]\+[[:space:]]\+"
+selfpid="\$pid[[:space:]]\+"
+cpu="\[[0-9]\{3\}\][[:space:]]\+"
+state="[a-z_]\+ "
+trace_task="${genpid}${cpu}${type}${genpid}${state}"
+trace_task_self="${genpid}${cpu}${type}${selfpid}${state}"
+trace_cpu="${genpid}${cpu}${type}${state}"
+trace_cpu_self="${selfpid}${cpu}${type}${state}"
+
+check_if_exists "run per-task monitor with tracing" \
+ "$RV mon sssw -t" "$RVDIR/monitors/sched/sssw" \
+ "$header" "$trace_task_self" "\($header\|$trace_task\)"
+
+check_if_exists "run per-task monitor tracing also self" \
+ "$RV mon sched:sssw -t -s" "$RVDIR/monitors/sched/sssw" \
+ "$trace_task_self" "" "\($header\|$trace_task\)"
+
+check_if_exists "run per-cpu monitor with tracing" \
+ "$RV mon sched:sco -t" "$RVDIR/monitors/sched/sco" \
+ "$header" "$trace_cpu_self" "\($header\|$trace_cpu\)"
+
+check_if_exists "run per-cpu monitor tracing also self" \
+ "$RV mon sco -t -s" "$RVDIR/monitors/sched/sco" \
+ "$trace_cpu_self" "" "\($header\|$trace_cpu\)"
+
+test_end
diff --git a/tools/verification/tests/engine.sh b/tools/verification/tests/engine.sh
new file mode 100644
index 000000000000..76cc254ff94c
--- /dev/null
+++ b/tools/verification/tests/engine.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+test_begin() {
+ # Count tests to allow the test harness to double-check if all were
+ # included correctly.
+ ctr=0
+ [ -z "$RV" ] && RV="../rv/rv"
+ [ -n "$TEST_COUNT" ] && echo "1..$TEST_COUNT"
+}
+
+failure() {
+ fail=1
+ if [ $# -gt 0 ]; then
+ failbuf+="$1"
+ failbuf+=$'\n'
+ fi
+}
+
+report() {
+ local desc="$1"
+
+ if [ "$fail" -eq 0 ]; then
+ echo "ok $ctr - $desc"
+ else
+ # Add output and exit code as comments in case of failure
+ echo "not ok $ctr - $desc"
+ echo -n "$failbuf"
+ echo "$result" | col -b | while read -r line; do echo "# $line"; done
+ printf "#\n# exit code %s\n" "$exitcode"
+ fi
+}
+
+_check() {
+ local command=$2
+ local expected_exitcode=${3:-0}
+ local expected_output=$4
+ local unexpected_output=$5
+ local all_lines_pattern=$6
+ local patterns="$expected_output $unexpected_output $all_lines_pattern"
+
+ eval "$TIMEOUT" "$command" &> check_output.$$ &
+ bgpid=$!
+ pid=$(pgrep -f "${command%%[|;&>]*}" | tail -n1)
+ wait $bgpid
+ exitcode=$?
+ result=$(tr -d '\0' < check_output.$$)
+ rm -f check_output.$$
+
+ failbuf=''
+ fail=0
+
+ # Suppress any other error if a needed pid is empty
+ if [ -z "$pid" ] && grep -q "\$pid" <<< "$patterns"; then
+ result=''
+ failure "# Empty pid for $command"
+ return 1
+ fi
+
+ expected_output="${expected_output//\$pid/$pid}"
+ unexpected_output="${unexpected_output//\$pid/$pid}"
+ all_lines_pattern="${all_lines_pattern//\$pid/$pid}"
+
+ # Test if the results matches if requested
+ if [ -n "$expected_output" ] && ! grep -qe "$expected_output" <<< "$result"; then
+ failure "# Output match failed: \"$expected_output\""
+ fi
+
+ if [ -n "$unexpected_output" ] && grep -qe "$unexpected_output" <<< "$result"; then
+ failure "# Output non-match failed: \"$unexpected_output\""
+ fi
+
+ if [ -n "$all_lines_pattern" ] && grep -vqe "$all_lines_pattern" <<< "$result"; then
+ failure "# All-lines pattern failed: \"$all_lines_pattern\""
+ fi
+
+ if [ $exitcode -ne "$expected_exitcode" ]; then
+ failure "# Expected exit code $expected_exitcode"
+ fi
+}
+
+check() {
+ # Simple check: run the command with given arguments and test exit code.
+ # If TEST_COUNT is set, run the test. Otherwise, just count.
+ ctr=$((ctr + 1))
+ if [ -n "$TEST_COUNT" ]; then
+ _check "$@"
+ report "$1"
+ fi
+}
+
+check_if_exists() {
+ # Conditional check that skips if a file or folder doesn't exist
+ local desc=$1
+ local command=$2
+ local file=$3
+ local expected_output=$4
+ local unexpected_output=$5
+ local all_lines_pattern=$6
+
+ ctr=$((ctr + 1))
+ if [ -n "$TEST_COUNT" ]; then
+ if [ ! -e "$file" ]; then
+ echo "ok $ctr - $desc # SKIP file not found: $file"
+ else
+ _check "$desc" "$command" 0 "$expected_output" \
+ "$unexpected_output" "$all_lines_pattern"
+ report "$desc"
+ fi
+ fi
+}
+
+set_timeout() {
+ TIMEOUT="timeout -v -k 15s $1"
+}
+
+set_expected_timeout() {
+ TIMEOUT="timeout --preserve-status -k 15s $1"
+}
+
+unset_timeout() {
+ unset TIMEOUT
+}
+
+test_end() {
+ # If running without TEST_COUNT, tests are not actually run, just
+ # counted. In that case, re-run the test with the correct count.
+ [ -z "$TEST_COUNT" ] && TEST_COUNT=$ctr exec bash "$0" || true
+}
+
+# Avoid any environmental discrepancies
+export LC_ALL=C
+unset_timeout
--
2.54.0
^ permalink raw reply related
* [PATCH v2 04/14] tools/rv: Fix cleanup after failed trace setup
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco
Cc: Nam Cao, Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
Currently if ikm_setup_trace_instance() fails, the tool returns without
any cleanup, if rv was called with both -t and -r, this means the
reactor is not going to be cleared.
Jump to the cleanup label to restore the reactor if necessary.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rv/src/in_kernel.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/verification/rv/src/in_kernel.c b/tools/verification/rv/src/in_kernel.c
index 95adae321c11..131a6787d639 100644
--- a/tools/verification/rv/src/in_kernel.c
+++ b/tools/verification/rv/src/in_kernel.c
@@ -809,7 +809,7 @@ int ikm_run_monitor(char *monitor_name, int argc, char **argv)
if (config_trace) {
inst = ikm_setup_trace_instance(nested_name);
if (!inst)
- return -1;
+ goto out_free_instance;
}
retval = ikm_enable(full_name);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 03/14] tools/rv: Fix exit status when monitor execution fails
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco
Cc: Nam Cao, Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
When running "rv mon" on a monitor that is already enabled, the tool
fails to start but incorrectly exits with a success status (0).
Fix the exit condition to ensure it returns a failure code on any
execution error.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rv/src/rv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/verification/rv/src/rv.c b/tools/verification/rv/src/rv.c
index b8fe24a87d97..6d65f4037581 100644
--- a/tools/verification/rv/src/rv.c
+++ b/tools/verification/rv/src/rv.c
@@ -127,7 +127,7 @@ static void rv_mon(int argc, char **argv)
if (!run)
err_msg("rv: monitor %s does not exist\n", monitor_name);
- exit(!run);
+ exit(run <= 0);
}
static void usage(int exit_val, const char *fmt, ...)
--
2.54.0
^ permalink raw reply related
* [PATCH v2 02/14] tools/rv: Fix substring match when listing container monitors
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco
Cc: Nam Cao, Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
When listing monitors within a specific container (rv list <container>),
the tool incorrectly matched monitors if the requested container name
was only a prefix of the actual container (e.g., 'rv list sche' would
incorrectly list monitors from 'sched:').
Fix this by ensuring the container name is an exact match and is
immediately followed by the ':' separator.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rv/src/in_kernel.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/tools/verification/rv/src/in_kernel.c b/tools/verification/rv/src/in_kernel.c
index a200b63f4509..95adae321c11 100644
--- a/tools/verification/rv/src/in_kernel.c
+++ b/tools/verification/rv/src/in_kernel.c
@@ -193,8 +193,12 @@ static int ikm_fill_monitor_definition(char *name, struct monitor *ikm, char *co
nested_name = strstr(name, ":");
if (nested_name) {
/* it belongs in container if it starts with "container:" */
- if (container && strstr(name, container) != name)
- return 1;
+ if (container) {
+ int len = strlen(container);
+
+ if (strncmp(name, container, len) || name[len] != ':')
+ return 1;
+ }
*nested_name = '/';
++nested_name;
ikm->nested = 1;
--
2.54.0
^ permalink raw reply related
* [PATCH v2 01/14] tools/rv: Fix substring match bug in monitor name search
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, Steven Rostedt, Gabriele Monaco
Cc: Nam Cao, Thomas Weissschuh, Tomas Glozar, John Kacur, Wen Yang
In-Reply-To: <20260514152055.229162-1-gmonaco@redhat.com>
__ikm_find_monitor_name() relies on strstr() to find a monitor by name,
which fails if the target monitor is a substring of a previously listed
monitor.
Fix it by tokenizing the available_monitors file and matching full
tokens instead.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/verification/rv/src/in_kernel.c | 48 ++++++++++++++-------------
1 file changed, 25 insertions(+), 23 deletions(-)
diff --git a/tools/verification/rv/src/in_kernel.c b/tools/verification/rv/src/in_kernel.c
index 4bb746ea6e17..a200b63f4509 100644
--- a/tools/verification/rv/src/in_kernel.c
+++ b/tools/verification/rv/src/in_kernel.c
@@ -58,38 +58,40 @@ static int __ikm_read_enable(char *monitor_name)
*/
static int __ikm_find_monitor_name(char *monitor_name, char *out_name)
{
- char *available_monitors, container[MAX_DA_NAME_LEN+1], *cursor, *end;
- int retval = 1;
+ char *available_monitors, *cursor, *line;
+ int len = strlen(monitor_name);
+ int found = 0;
available_monitors = tracefs_instance_file_read(NULL, "rv/available_monitors", NULL);
if (!available_monitors)
return -1;
- cursor = strstr(available_monitors, monitor_name);
- if (!cursor) {
- retval = 0;
- goto out_free;
- }
+ config_is_container = 0;
+ cursor = available_monitors;
+ while ((line = strsep(&cursor, "\n"))) {
+ char *colon = strchr(line, ':');
- for (; cursor > available_monitors; cursor--)
- if (*(cursor-1) == '\n')
- break;
- end = strstr(cursor, "\n");
- memcpy(out_name, cursor, end-cursor);
- out_name[end-cursor] = '\0';
-
- cursor = strstr(out_name, ":");
- if (cursor)
- *cursor = '/';
- else {
- sprintf(container, "%s:", monitor_name);
- if (strstr(available_monitors, container))
- config_is_container = 1;
+ if (strcmp(line, monitor_name) && (!colon || strcmp(colon + 1, monitor_name)))
+ continue;
+
+ strncpy(out_name, line, 2 * MAX_DA_NAME_LEN);
+ out_name[2 * MAX_DA_NAME_LEN - 1] = '\0';
+
+ if (colon) {
+ out_name[colon - line] = '/';
+ } else {
+ /* If there are children, they are on the next line. */
+ line = strsep(&cursor, "\n");
+ if (line && !strncmp(line, monitor_name, len) && line[len] == ':')
+ config_is_container = 1;
+ }
+
+ found = 1;
+ break;
}
-out_free:
free(available_monitors);
- return retval;
+ return found;
}
/*
--
2.54.0
^ permalink raw reply related
* [PATCH v2 00/14] rv: Add selftests to tools and KUnit tests
From: Gabriele Monaco @ 2026-05-14 15:20 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Gabriele Monaco, Steven Rostedt, Nam Cao, Thomas Weissschuh,
Tomas Glozar, John Kacur, Wen Yang
This series adds support to the make check target in the rv userspace
tool and the rvgen script, this allows to quickly validate its
functionality. The selftest framework is inspired by the one used in
RTLA.
A few bugs in both tools were also discovered and are fixed as part of
this series.
Additionally it adds unit tests for models. This is achieved by running
the handlers functions directly within KUnit, emulating all modules
paths as if real kernel events fired.
Unit tests emulate a series of events that are expected to trigger
violations and checks that a reaction occurred, stub structs and
functions are used so the kernel is not affected by the test.
Differences since RFC [1]:
* Fix issue with LTL generator printing literals as uppercase
* Add missing state label in selftest dot spec
* Fail selftest if pid was required but not found (harness error)
* Remove useless static keywords in KUnit tests
* Assert after kunit_kzalloc()
* Use RV_KUNIT_EXPECT_REACTION_HERE to avoid false positives
* Prevent running RV monitors and events together with KUnit
* Rearrange KUnit testing headers
* Expect no reaction at the end of KUnit test cases
* Fix broken nomiss test and allocation
[1] - https://lore.kernel.org/lkml/20260427151134.192971-1-gmonaco@redhat.com
To: linux-trace-kernel@vger.kernel.org
To: linux-kernel@vger.kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Thomas Weissschuh <thomas.weissschuh@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Wen Yang <wen.yang@linux.dev>
Gabriele Monaco (14):
tools/rv: Fix substring match bug in monitor name search
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix exit status when monitor execution fails
tools/rv: Fix cleanup after failed trace setup
tools/rv: Add selftests
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix ltl2k writing True as a literal
verification/rvgen: Add golden and spec folders for tests
verification/rvgen: Add selftests
rv: Add KUnit stub to rv_react() and rv_*_task_monitor_slot()
rv: Add KUnit tests for some DA/HA monitors
rv: Add KUnit stubs for current and smp_processor_id()
rv: Prevent unintentional tracepoints during KUnit tests
rv: Add KUnit tests for some LTL monitors
include/rv/da_monitor.h | 38 +++
include/rv/instrumentation.h | 5 +
include/rv/kunit.h | 59 +++++
include/rv/ltl_monitor.h | 38 +++
kernel/trace/rv/Kconfig | 14 +
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/monitors/nomiss/nomiss.c | 44 ++++
kernel/trace/rv/monitors/opid/opid.c | 26 ++
.../trace/rv/monitors/pagefault/pagefault.c | 27 +-
kernel/trace/rv/monitors/sco/sco.c | 24 ++
kernel/trace/rv/monitors/sleep/sleep.c | 76 +++++-
kernel/trace/rv/monitors/sssw/sssw.c | 29 ++
kernel/trace/rv/monitors/sts/sts.c | 35 +++
kernel/trace/rv/rv.c | 5 +
kernel/trace/rv/rv_monitors_test.c | 166 ++++++++++++
kernel/trace/rv/rv_reactors.c | 7 +
tools/verification/rv/Makefile | 5 +-
tools/verification/rv/src/in_kernel.c | 58 ++--
tools/verification/rv/src/rv.c | 2 +-
tools/verification/rv/tests/rv_list.t | 48 ++++
tools/verification/rv/tests/rv_mon.t | 95 +++++++
tools/verification/rvgen/Makefile | 4 +
tools/verification/rvgen/__main__.py | 10 +-
tools/verification/rvgen/rvgen/ltl2ba.py | 9 +-
.../rvgen/tests/golden/da_global/Kconfig | 9 +
.../rvgen/tests/golden/da_global/da_global.c | 95 +++++++
.../rvgen/tests/golden/da_global/da_global.h | 47 ++++
.../tests/golden/da_global/da_global_trace.h | 15 ++
.../tests/golden/da_perobj_parent/Kconfig | 11 +
.../da_perobj_parent/da_perobj_parent.c | 110 ++++++++
.../da_perobj_parent/da_perobj_parent.h | 64 +++++
.../da_perobj_parent/da_perobj_parent_trace.h | 15 ++
.../tests/golden/da_pertask_desc/Kconfig | 9 +
.../golden/da_pertask_desc/da_pertask_desc.c | 105 ++++++++
.../golden/da_pertask_desc/da_pertask_desc.h | 64 +++++
.../da_pertask_desc/da_pertask_desc_trace.h | 15 ++
.../rvgen/tests/golden/ha_percpu/Kconfig | 9 +
.../rvgen/tests/golden/ha_percpu/ha_percpu.c | 244 +++++++++++++++++
.../rvgen/tests/golden/ha_percpu/ha_percpu.h | 72 +++++
.../tests/golden/ha_percpu/ha_percpu_trace.h | 19 ++
.../rvgen/tests/golden/ltl_pertask/Kconfig | 9 +
.../tests/golden/ltl_pertask/ltl_pertask.c | 107 ++++++++
.../tests/golden/ltl_pertask/ltl_pertask.h | 108 ++++++++
.../golden/ltl_pertask/ltl_pertask_trace.h | 14 +
.../rvgen/tests/golden/test_container/Kconfig | 5 +
.../golden/test_container/test_container.c | 35 +++
.../golden/test_container/test_container.h | 3 +
.../rvgen/tests/golden/test_da/Kconfig | 9 +
.../rvgen/tests/golden/test_da/test_da.c | 95 +++++++
.../rvgen/tests/golden/test_da/test_da.h | 47 ++++
.../tests/golden/test_da/test_da_trace.h | 15 ++
.../rvgen/tests/golden/test_ha/Kconfig | 9 +
.../rvgen/tests/golden/test_ha/test_ha.c | 247 ++++++++++++++++++
.../rvgen/tests/golden/test_ha/test_ha.h | 72 +++++
.../tests/golden/test_ha/test_ha_trace.h | 19 ++
.../rvgen/tests/golden/test_ltl/Kconfig | 11 +
.../rvgen/tests/golden/test_ltl/test_ltl.c | 108 ++++++++
.../rvgen/tests/golden/test_ltl/test_ltl.h | 108 ++++++++
.../tests/golden/test_ltl/test_ltl_trace.h | 14 +
.../rvgen/tests/rvgen_container.t | 20 ++
.../verification/rvgen/tests/rvgen_monitor.t | 87 ++++++
.../rvgen/tests/specs/test_da.dot | 16 ++
.../rvgen/tests/specs/test_da2.dot | 19 ++
.../rvgen/tests/specs/test_ha.dot | 27 ++
.../rvgen/tests/specs/test_invalid.dot | 8 +
.../rvgen/tests/specs/test_invalid.ltl | 1 +
.../rvgen/tests/specs/test_invalid_ha.dot | 16 ++
.../rvgen/tests/specs/test_ltl.ltl | 1 +
tools/verification/tests/engine.sh | 166 ++++++++++++
69 files changed, 3075 insertions(+), 49 deletions(-)
create mode 100644 include/rv/kunit.h
create mode 100644 kernel/trace/rv/rv_monitors_test.c
create mode 100644 tools/verification/rv/tests/rv_list.t
create mode 100644 tools/verification/rv/tests/rv_mon.t
create mode 100644 tools/verification/rvgen/tests/golden/da_global/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/da_global/da_global.c
create mode 100644 tools/verification/rvgen/tests/golden/da_global/da_global.h
create mode 100644 tools/verification/rvgen/tests/golden/da_global/da_global_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/da_perobj_parent.c
create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/da_perobj_parent.h
create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/da_perobj_parent_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/da_pertask_desc.c
create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/da_pertask_desc.h
create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/da_pertask_desc_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/ha_percpu.c
create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/ha_percpu.h
create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/ha_percpu_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/ltl_pertask.c
create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/ltl_pertask.h
create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/ltl_pertask_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/test_container/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/test_container/test_container.c
create mode 100644 tools/verification/rvgen/tests/golden/test_container/test_container.h
create mode 100644 tools/verification/rvgen/tests/golden/test_da/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/test_da/test_da.c
create mode 100644 tools/verification/rvgen/tests/golden/test_da/test_da.h
create mode 100644 tools/verification/rvgen/tests/golden/test_da/test_da_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/test_ha/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/test_ha/test_ha.c
create mode 100644 tools/verification/rvgen/tests/golden/test_ha/test_ha.h
create mode 100644 tools/verification/rvgen/tests/golden/test_ha/test_ha_trace.h
create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/Kconfig
create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/test_ltl.c
create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/test_ltl.h
create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/test_ltl_trace.h
create mode 100644 tools/verification/rvgen/tests/rvgen_container.t
create mode 100644 tools/verification/rvgen/tests/rvgen_monitor.t
create mode 100644 tools/verification/rvgen/tests/specs/test_da.dot
create mode 100644 tools/verification/rvgen/tests/specs/test_da2.dot
create mode 100644 tools/verification/rvgen/tests/specs/test_ha.dot
create mode 100644 tools/verification/rvgen/tests/specs/test_invalid.dot
create mode 100644 tools/verification/rvgen/tests/specs/test_invalid.ltl
create mode 100644 tools/verification/rvgen/tests/specs/test_invalid_ha.dot
create mode 100644 tools/verification/rvgen/tests/specs/test_ltl.ltl
create mode 100644 tools/verification/tests/engine.sh
base-commit: e1914add2799225a87502051415fc5c32aeb02ae
--
2.54.0
^ permalink raw reply
* Re: [PATCH 7/7] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: bot+bpf-ci @ 2026-05-14 14:55 UTC (permalink / raw)
To: jolsa, oleg, peterz, mingo, mhiramat, andrii
Cc: bpf, linux-trace-kernel, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260514135342.22130-8-jolsa@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 82b3c0ce9253..d553485e7db5 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
[ ... ]
> diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> index be34c4087ff5..606601ccdc42 100644
> --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
> index f00cb52874e0..0ee78fb050a1 100644
> --- a/tools/testing/selftests/bpf/progs/test_usdt.c
> +++ b/tools/testing/selftests/bpf/progs/test_usdt.c
[ ... ]
> diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> index b359b389f6c0..5e38f8605b02 100644
> --- a/tools/testing/selftests/bpf/usdt_2.c
> +++ b/tools/testing/selftests/bpf/usdt_2.c
> @@ -13,4 +13,17 @@ void usdt_2(void)
> USDT(optimized_attach, usdt_2);
> }
>
> +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> +
> +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> +{
> + unsigned long a1 = usdt_red_zone_arg1;
> + unsigned long a2 = usdt_red_zone_arg2;
> + unsigned long a3 = usdt_red_zone_arg3;
> +
> + USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
> +}
> +
> #endif
Is this test reliable across different compiler versions and optimizations?
The USDT macro uses the "nor" inline assembly constraint, which allows the
compiler to allocate these operands to CPU registers. Because there is no
register pressure in usdt_red_zone_trigger(), modern compilers might allocate
a1, a2, and a3 directly to registers rather than the stack's red zone.
If the operands are placed in registers, the uprobe CALL optimization
clobbering [rsp-8] will not corrupt them. This could cause the test to pass
even on buggy kernels, creating a false positive.
Would it be safer to explicitly force these operands into the red zone using
inline assembly constraints, rather than depending on the compiler's register
allocator?
Reference: https://lore.kernel.org/bpf/20260509021238.AE98FC2BCB0@smtp.kernel.org/
Raised by: sashiko-bot@kernel.org on Sat, 09 May 2026
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25864669093
^ permalink raw reply
* Re: [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: bot+bpf-ci @ 2026-05-14 14:55 UTC (permalink / raw)
To: jolsa, oleg, peterz, mingo, mhiramat, andrii
Cc: bpf, linux-trace-kernel, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260514135342.22130-3-jolsa@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 630 bytes --]
> libbpf: Change has_nop_combo to work on top of nop10
>
> We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> fixing has_nop_combo to reflect that.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
This looks like a bug fix for kernel/libbpf integration. Should this
include:
Fixes: 9734c3ac2f23 ("uprobes/x86: Move optimized uprobe from nop5 to nop10")
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25864669093
^ permalink raw reply
* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-14 14:37 UTC (permalink / raw)
To: Lance Yang
Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <20260514132830.25622-1-lance.yang@linux.dev>
On Thu, May 14, 2026 at 09:28:30PM +0800, Lance Yang wrote:
>
> On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
> >get_any_page() collapses three different failure modes into a single
> >-EIO return:
> >
> > * the put_page race in the !count_increased path;
> > * the HWPoisonHandlable() rejection that bounces out of
> > __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
> > * the HWPoisonHandlable() rejection that goes through the
> > count_increased / put_page / shake_page retry loop.
> >
> >The first is transient (the page is racing with the allocator). The
> >second can be either transient (a userspace folio briefly off LRU
> >during migration/compaction) or stable (slab/vmalloc/page-table/
> >kernel-stack pages). The third describes a stable kernel-owned page
> >that the count_increased=true caller already held a reference on.
> >
> >Distinguish them on the return path: keep -EIO for both the put_page
> >race and the -EBUSY-after-retries branch (shake_page() cannot drag a
> >folio back from active migration, so we cannot prove the page is
> >permanently kernel-owned from there), keep -EBUSY for the allocation
> >race (unchanged), and return -ENOTRECOVERABLE only from the
> >count_increased-true HWPoisonHandlable() rejection that exhausts its
> >retries -- the caller's reference is structural evidence that the
> >page is owned by the kernel.
> >
> >Extend the unhandlable-page pr_err() to fire for either errno and
> >update the get_hwpoison_page() kerneldoc.
> >
> >memory_failure() still folds every negative return into
> >MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> >this patch is a no-op for users of memory_failure() and only changes
> >the errno that soft_offline_page() can propagate to its callers. A
> >follow-up wires the new return code through memory_failure() and
> >reports MF_MSG_KERNEL for the unrecoverable cases.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Signed-off-by: Breno Leitao <leitao@debian.org>
> >---
> > mm/memory-failure.c | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> >diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >index 49bcfbd04d213..bae883df3ccb2 100644
> >--- a/mm/memory-failure.c
> >+++ b/mm/memory-failure.c
> >@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> > shake_page(p);
> > goto try_again;
> > }
> >+ /*
> >+ * Return -EIO rather than -ENOTRECOVERABLE: this
> >+ * branch is also reached for pages that are merely
> >+ * off-LRU transiently (e.g. a folio in the middle
> >+ * of migration or compaction), which shake_page()
> >+ * cannot drag back. The caller cannot prove the
> >+ * page is permanently kernel-owned from here, so
> >+ * keep it on the recoverable errno.
> >+ */
> > ret = -EIO;
> > goto out;
> > }
> >@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> > goto try_again;
> > }
> > put_page(p);
> >- ret = -EIO;
> >+ ret = -ENOTRECOVERABLE;
> > }
> > out:
> >- if (ret == -EIO)
> >+ if (ret == -EIO || ret == -ENOTRECOVERABLE)
> > pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> >
> > return ret;
> >@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
> > * -EIO for pages on which we can not handle memory errors,
> > * -EBUSY when get_hwpoison_page() has raced with page lifecycle
> > * operations like allocation and free,
> >- * -EHWPOISON when the page is hwpoisoned and taken off from buddy.
> >+ * -EHWPOISON when the page is hwpoisoned and taken off from buddy,
> >+ * -ENOTRECOVERABLE for stable kernel-owned pages the handler
> >+ * cannot recover (PG_reserved, slab, vmalloc, page tables,
> >+ * kernel stacks, and similar non-LRU/non-buddy pages).
>
> Did you test this patch series? I don't see how we ever get to
> -ENOTRECOVERABLE there ...
Yes, I did. I am using the following test case:
https://github.com/leitao/linux/commit/cfebe84ddeab5ac34ed456331db980d57e7025dc
# RUN_DESTRUCTIVE=1 tools/testing/selftests/mm/hwpoison-panic.sh
# enabling /proc/sys/vm/panic_on_unrecoverable_memory_failure
# injecting hwpoison at phys 0x2a00000 (Kernel rodata)
# expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'
[ 501.113256] Memory failure: 0x2a00: recovery action for reserved kernel page: Ignored
[ 501.113956] Kernel panic - not syncing: Memory failure: 0x2a00: unrecoverable page
> Even with MF_COUNT_INCREASED, the first pass does:
>
> if (flags & MF_COUNT_INCREASED)
> count_increased = true;
>
> [...]
>
> if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
> ret = 1;
> } else {
> if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
> put_page(p);
> shake_page(p);
> count_increased = false;
> goto try_again; <-
> }
> put_page(p);
> ret = -ENOTRECOVERABLE;
> }
>
> Then we come back with count_increased=false:
>
> try_again:
> if (!count_increased) {
> ret = __get_hwpoison_page(p, flags); <-
> if (!ret) {
> [...]
> } else if (ret == -EBUSY) { <-
> [...]
> ret = -EIO;
> goto out; <-
> }
> }
>
> For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:
>
> if (!HWPoisonHandlable(&folio->page, flags))
> return -EBUSY;
>
> so they still seem to end up as -EIO ... Am I missing something?
You are not, and thanks for catching this. I traced it again and the
-ENOTRECOVERABLE branch is unreachable for slab/vmalloc/page-table pages
exactly as you described. The __get_hwpoison_page() → -EBUSY → shake → retry
loop catches them first and they exit as -EIO.
The selftest I am using (link above) only validated the PageReserved
short-circuit added in patch 3, which lives in memory_failure() and never
reaches get_any_page().
I even thought about this code path, and I was not convinced we should return
-ENOTRECOVERABLE, thus I documented the following (as in this current patch)
@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
shake_page(p);
goto try_again;
}
+ /*
+ * Return -EIO rather than -ENOTRECOVERABLE: this
+ * branch is also reached for pages that are merely
+ * off-LRU transiently (e.g. a folio in the middle
+ * of migration or compaction), which shake_page()
+ * cannot drag back. The caller cannot prove the
+ * page is permanently kernel-owned from here, so
+ * keep it on the recoverable errno.
+ */
ret = -EIO;
^ permalink raw reply
* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-05-14 14:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Masami Hiramatsu,
Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
linux-arch, linux-mm, linux-trace-kernel, kernel-team,
Paul E. McKenney
In-Reply-To: <20260513114102.50f4ca68@gandalf.local.home>
On Wed, May 13, 2026 at 11:41:02AM -0400, Steven Rostedt wrote:
> On Tue, 5 May 2026 17:09:34 +0000
> Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> > Use the arch-overridable queued_spin_release(), introduced in the
> > previous commit, to ensure the tracepoint works correctly across all
>
> Remove the ", introduced in the previous commit," That's useless in git
> change logs.
Thanks for the suggestion, will do here and in other places.
[...]
> > /**
> > * queued_spin_unlock - unlock a queued spinlock
> > * @lock : Pointer to queued spinlock structure
> > + *
> > + * Generic tracing wrapper around the arch-overridable
> > + * queued_spin_release().
> > */
> > static __always_inline void queued_spin_unlock(struct qspinlock *lock)
> > {
> > + /*
> > + * Trace and release are combined in queued_spin_release_traced() so
> > + * the compiler does not need to preserve the lock pointer across the
> > + * function call, avoiding callee-saved register save/restore on the
> > + * hot path.
> > + */
> > + if (tracepoint_enabled(contended_release)) {
> > + queued_spin_release_traced(lock);
> > + return;
>
> Get rid of the "return;". What does it save you? It just makes it that you
> need to duplicate the code. Even though it's a one liner, it can cause bugs
> in the future if this changes. You could call the function:
>
> do_trace_queued_spin_release_traced(lock);
>
>
> > + }
> > queued_spin_release(lock);
> > }
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index af8d122bb649..649fdca69288 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -104,6 +104,14 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
> > #define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
> > #endif
> >
> > +void __lockfunc queued_spin_release_traced(struct qspinlock *lock)
> > +{
> > + if (queued_spin_is_contended(lock))
> > + trace_call__contended_release(lock);
> > + queued_spin_release(lock);
>
> And then remove the duplicate call of "queued_spin_release()" here.
This is the scenario the comment above the static branch describes.
Here's what it looks like in practice on x86_64 (defconfig, compiled
with GCC 11).
Current design (trace + unlock combined, with return):
endbr64
xchg %ax,%ax ; NOP (static branch)
movb $0x0,(%rdi) ; unlock
decl %gs:__preempt_count
je preempt
jmp __x86_return_thunk
call queued_spin_release_traced ; cold
jmp preempt_handling ; cold
call __SCT__preempt_schedule
jmp __x86_return_thunk
With the trace-only function (no return, unlock after the call):
endbr64
push %rbx ; saves callee-saved rbx (!)
mov %rdi,%rbx ; preserve lock across call (!)
xchg %ax,%ax ; NOP (static branch)
movb $0x0,(%rbx) ; unlock
decl %gs:__preempt_count
je preempt
pop %rbx ; callee-saved restore (!)
jmp __x86_return_thunk
call queued_spin_release_traced ; cold
jmp unlock ; cold
call __SCT__preempt_schedule
pop %rbx
jmp __x86_return_thunk
Three extra instructions marked by "!" on the hot path (push, mov, pop),
all wasted when the tracepoint is off. That's the main reason for
combining trace and unlock in the same out-of-line function.
^ permalink raw reply
* [RFC PATCH v2.1 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>
Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
mm/damon/core.c | 9 +++++++++
2 files changed, 47 insertions(+)
diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 7e25f4469b81b..2b96a03876034 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
TP_printk("sample_us=%lu", __entry->sample_us)
);
+TRACE_EVENT_CONDITION(damon_aggregated_v2,
+
+ TP_PROTO(unsigned int target_id, struct damon_region *r,
+ unsigned int nr_regions, unsigned int nr_probes),
+
+ TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+ TP_CONDITION(nr_probes > 0),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, target_id)
+ __field(unsigned long, start)
+ __field(unsigned long, end)
+ __field(unsigned int, nr_regions)
+ __field(unsigned int, nr_accesses)
+ __field(unsigned int, age)
+ __dynamic_array(unsigned char, probe_hits, nr_probes)
+ ),
+
+ TP_fast_assign(
+ __entry->target_id = target_id;
+ __entry->start = r->ar.start;
+ __entry->end = r->ar.end;
+ __entry->nr_regions = nr_regions;
+ __entry->nr_accesses = r->nr_accesses;
+ __entry->age = r->age;
+ memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+ sizeof(*r->probe_hits) * nr_probes);
+ ),
+
+ TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+ __entry->target_id, __entry->nr_regions,
+ __entry->start, __entry->end,
+ __entry->nr_accesses, __entry->age,
+ __print_hex(__get_dynamic_array(probe_hits),
+ __get_dynamic_array_len(probe_hits)))
+);
+
TRACE_EVENT(damon_aggregated,
TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fe6c789f2cecb..0ad1a0af06893 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1905,6 +1905,13 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
{
struct damon_target *t;
unsigned int ti = 0; /* target's index */
+ unsigned int nr_probes = 0;
+ struct damon_probe *probe;
+
+ if (trace_damon_aggregated_v2_enabled()) {
+ damon_for_each_probe(probe, c)
+ nr_probes++;
+ }
damon_for_each_target(t, c) {
struct damon_region *r;
@@ -1913,6 +1920,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
int i;
trace_damon_aggregated(ti, r, damon_nr_regions(t));
+ trace_damon_aggregated_v2(ti, r, damon_nr_regions(t),
+ nr_probes);
damon_warn_fix_nr_accesses_corruption(r);
r->last_nr_accesses = r->nr_accesses;
r->nr_accesses = 0;
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v2.1 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
linux-trace-kernel
TL; DR
======
Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring. In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.
Background: High Cost of Page Level Properties Monitoring
=========================================================
DAMON is initially introduced as a Data Access MONitor. It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS). But still the monitoring part is only for
data accesses.
Data access patterns is good information, but some users need more
holistic views. Particularly, users want to show the access pattern
information together with the types of the memory. For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages. Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.
For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14. Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters. Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.
This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size. It was useful for test or debugging
purposes on a small number of machines. But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches. For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet. If the runtime and the size of the
fleet is long and big enough, it should provide statistically meaningful
data.
But users are too busy to implement such controls on their own.
Data Attributes Monitoring
==========================
Extend DAMON to monitor not only data accesses, but also general data
attributes. Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.
Allow users to specify what data attributes in addition to the data
access they want to monitor. Users can install one 'data probe' per
data attribute of their interest for this purpose. The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute. E.g., if memory of physical
address 42 belongs to cgroup A. Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.
When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.
Users can read 'probe_hits' just before the values are reset. In this
way, users can know how many hot/cold memory regions have data
attributes of their interest. E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.
Patches Sequence
================
First eight patches implement the core feature, interface and the
working support. Patch 1 introduces data probe data structure, namely
damon_probe. Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter. Patch 4 updates damon_ctx commit function
to handle the probes. Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits. Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation. Patch 7 updates kdamond_fn() to invoke the probes
applying callback. Patch 8 finally implements the probes support on
paddr ops.
Ten changes for user interface (patches 9-18) come next. Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively. Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively. Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.
Patch 19 adds a selftest for the sysfs files.
Patches 20 and 21 documents the design and usage of the new feature,
respectively.
Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow. Depending on the feedback, this part might be separated
to another series in future. Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG. Patch 23 add the
support on paddr ops. Patch 24 updates the sysfs interface for setup of
the target memcg. Patch 25 move code for easy reuse of the filter
target memcg setup. Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.
Discussions
===========
This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads. Because the
sampling time for access check is reused for data attributes check, the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.
Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information. When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.
Plan for Dropping RFC tag
=========================
I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).
Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.
I'm currently hoping to drop the RFC tag by 7.2-rc1.
Future Works: Mid Term
========================
This version of implementation is limiting the maximum number of data
probes to four. I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.
Future Works: Long Term
=======================
There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring. For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive. Actually there is
another ongoing work [3] for extending DAMON with PMU events. The
motivation of the work is reducing the overhead, though.
In my work [2], I was introducing a new interface for access sampling
primitives control. Now I think this data probe interface can be used
for that, too. That is, data access becomes just one type of data
attribute. Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.
The regions adjustment mechanism is currently working based on the
access information. That's because DAMON is designed for data access
monitoring. That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.
Once data access becomes just one of data attributes, there is no reason
to think data access that special. There might be some users not
interested in access at all but want to know the location of memory of
specific type. Data probes interface will allow doing that. Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute. Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.
DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes. From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".
[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com
Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.
SeongJae Park (28):
mm/damon/core: introduce struct damon_probe
mm/damon/core: embed damon_probe objects in damon_ctx
mm/damon/core: introduce damon_filter
mm/damon/core: commit probes
mm/damon/core: introduce damon_region->probe_hits
mm/damon/core: introduce damon_ops->apply_probes
mm/damon/core: do data attributes monitoring
mm/damon/paddr: support data attributes monitoring
mm/damon/sysfs: implement probes dir
mm/damon/sysfs: implement probe dir
mm/damon/sysfs: implement filters directory
mm/damon/sysfs: implement filter dir
mm/damon/sysfs: implement filter dir files
mm/damon/sysfs: setup probes on DAMON core API parameters
mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
mm/damon/sysfs-schemes: implement probe dir
mm/damon/sysfs-schemes: implement probe/hits file
mm/damon: trace probe_hits
selftests/damon/sysfs.sh: test probes dir
Docs/mm/damon/design: document data attributes monitoring
Docs/admin-guide/mm/damon/usage: document data attributes monitoring
mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
mm/damon/sysfs: add filters/<F>/path file
mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
mm/damon/sysfs: setup damon_filter->memcg_id from path
Docs/mm/damon/design: update for memcg damon filter
Docs/admin-guide/mm/damon/usage: update for memcg damon filter
Documentation/admin-guide/mm/damon/usage.rst | 48 +-
Documentation/mm/damon/design.rst | 39 ++
include/linux/damon.h | 67 +++
include/trace/events/damon.h | 38 ++
mm/damon/core.c | 197 +++++++
mm/damon/paddr.c | 76 +++
mm/damon/sysfs-common.c | 41 ++
mm/damon/sysfs-common.h | 2 +
mm/damon/sysfs-schemes.c | 222 ++++++--
mm/damon/sysfs.c | 557 +++++++++++++++++++
tools/testing/selftests/damon/sysfs.sh | 48 ++
11 files changed, 1284 insertions(+), 51 deletions(-)
base-commit: 678b6bc7ce120b8c51d4e05fcb8eb0a92f9be3f6
--
2.47.3
^ permalink raw reply
* [PATCH 7/7] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
From: Andrii Nakryiko <andrii@kernel.org>
The uprobe nop5 optimization used to replace a 5-byte NOP with a 5-byte
CALL to a trampoline. The CALL pushes a return address onto the stack at
[rsp-8], clobbering whatever was stored there.
On x86-64, the red zone is the 128 bytes below rsp that user code may use
for temporary storage without adjusting rsp. Compilers can place USDT
argument operands there, generating specs like "8@-8(%rbp)" when rbp ==
rsp. With the CALL-based optimization, the return address overwrites that
argument before the BPF-side USDT argument fetch runs.
Add two tests for this case. The uprobe_syscall subtest stores known values
at -8(%rsp), -16(%rsp), and -24(%rsp), executes an optimized nop10 uprobe,
and verifies the red-zone data is still intact. The USDT subtest triggers a
probe in a function where the compiler places three USDT operands in the
red zone and verifies that all 10 optimized invocations deliver the expected
argument values to BPF.
On an unfixed kernel, the first hit goes through the INT3 path and later
hits use the optimized CALL path, so the red-zone checks fail after
optimization.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
[ updates to use nop10 ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 75 +++++++++++++++++++
tools/testing/selftests/bpf/prog_tests/usdt.c | 49 ++++++++++++
tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++++
tools/testing/selftests/bpf/usdt_2.c | 13 ++++
4 files changed, 162 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 82b3c0ce9253..d553485e7db5 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -357,6 +357,48 @@ __nocf_check __weak void usdt_test(void)
USDT(optimized_uprobe, usdt);
}
+/*
+ * Assembly-level red zone clobbering test. Stores known values in the
+ * red zone (below RSP), executes a nop10 (uprobe site), and checks that
+ * the values survived. Returns 0 if intact, 1 if clobbered.
+ *
+ * The nop5 optimization used CALL (which pushes a return address to
+ * [rsp-8]), the value at -8(%rsp) was overwritten. The nop10 optimization
+ * should escape that by moving stackpointer below the redzone before
+ * doing the CALL.
+ */
+__attribute__((aligned(16)))
+__nocf_check __weak __naked unsigned long uprobe_red_zone_test(void)
+{
+ asm volatile (
+ "movabs $0x1111111111111111, %%rax\n"
+ "movq %%rax, -8(%%rsp)\n"
+ "movabs $0x2222222222222222, %%rax\n"
+ "movq %%rax, -16(%%rsp)\n"
+ "movabs $0x3333333333333333, %%rax\n"
+ "movq %%rax, -24(%%rsp)\n"
+
+ ".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10: uprobe site */
+
+ "movabs $0x1111111111111111, %%rax\n"
+ "cmpq %%rax, -8(%%rsp)\n"
+ "jne 1f\n"
+ "movabs $0x2222222222222222, %%rax\n"
+ "cmpq %%rax, -16(%%rsp)\n"
+ "jne 1f\n"
+ "movabs $0x3333333333333333, %%rax\n"
+ "cmpq %%rax, -24(%%rsp)\n"
+ "jne 1f\n"
+
+ "xorl %%eax, %%eax\n"
+ "retq\n"
+ "1:\n"
+ "movl $1, %%eax\n"
+ "retq\n"
+ ::: "rax", "memory"
+ );
+}
+
static int find_uprobes_trampoline(void *tramp_addr)
{
void *start, *end;
@@ -855,6 +897,37 @@ static void test_uprobe_race(void)
#define __NR_uprobe 336
#endif
+static void test_uprobe_red_zone(void)
+{
+ struct uprobe_syscall_executed *skel;
+ struct bpf_link *link;
+ void *nop10_addr;
+ size_t offset;
+ int i;
+
+ nop10_addr = find_nop10(uprobe_red_zone_test);
+ if (!ASSERT_NEQ(nop10_addr, NULL, "find_nop10"))
+ return;
+
+ skel = uprobe_syscall_executed__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ offset = get_uprobe_offset(nop10_addr);
+ link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+ 0, "/proc/self/exe", offset, NULL);
+ if (!ASSERT_OK_PTR(link, "attach_uprobe"))
+ goto cleanup;
+
+ for (i = 0; i < 10; i++)
+ ASSERT_EQ(uprobe_red_zone_test(), 0, "red_zone_intact");
+
+ bpf_link__destroy(link);
+
+cleanup:
+ uprobe_syscall_executed__destroy(skel);
+}
+
static void test_uprobe_error(void)
{
long err = syscall(__NR_uprobe);
@@ -881,6 +954,8 @@ static void __test_uprobe_syscall(void)
test_uprobe_usdt();
if (test__start_subtest("uprobe_race"))
test_uprobe_race();
+ if (test__start_subtest("uprobe_red_zone"))
+ test_uprobe_red_zone();
if (test__start_subtest("uprobe_error"))
test_uprobe_error();
if (test__start_subtest("uprobe_regs_equal"))
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index be34c4087ff5..606601ccdc42 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -250,6 +250,7 @@ static void subtest_basic_usdt(bool optimized)
#ifdef __x86_64__
extern void usdt_1(void);
extern void usdt_2(void);
+extern void usdt_red_zone_trigger(void);
static unsigned char nop1[1] = { 0x90 };
static unsigned char nop1_nop10_combo[11] = { 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
@@ -340,6 +341,52 @@ static void subtest_optimized_attach(void)
cleanup:
test_usdt__destroy(skel);
}
+
+/*
+ * Test that USDT arguments survive nop10 optimization in a function where
+ * the compiler places operands in the red zone.
+ *
+ * Signal handlers are prone to having the compiler place USDT argument
+ * operands in the red zone (below rsp).
+ *
+ * The nop5 optimization used CALL (which pushes a return address to
+ * [rsp-8]), the value at -8(%rsp) was overwritten. The nop10 optimization
+ * should escape that by moving stackpointer below the redzone before
+ * doing the CALL.
+ */
+static void subtest_optimized_red_zone(void)
+{
+ struct test_usdt *skel;
+ int i;
+
+ skel = test_usdt__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ skel->bss->expected_arg[0] = 0xDEADBEEF;
+ skel->bss->expected_arg[1] = 0xCAFEBABE;
+ skel->bss->expected_arg[2] = 0xFEEDFACE;
+ skel->bss->expected_pid = getpid();
+
+ skel->links.usdt_check_arg = bpf_program__attach_usdt(
+ skel->progs.usdt_check_arg, 0, "/proc/self/exe",
+ "optimized_attach", "usdt_red_zone", NULL);
+ if (!ASSERT_OK_PTR(skel->links.usdt_check_arg, "attach_usdt_red_zone"))
+ goto cleanup;
+
+ for (i = 0; i < 10; i++)
+ usdt_red_zone_trigger();
+
+ ASSERT_EQ(skel->bss->arg_total, 10, "arg_total");
+ ASSERT_EQ(skel->bss->arg_bad, 0, "arg_bad");
+ ASSERT_EQ(skel->bss->arg_last[0], 0xDEADBEEF, "arg_last_1");
+ ASSERT_EQ(skel->bss->arg_last[1], 0xCAFEBABE, "arg_last_2");
+ ASSERT_EQ(skel->bss->arg_last[2], 0xFEEDFACE, "arg_last_3");
+
+cleanup:
+ test_usdt__destroy(skel);
+}
+
#endif
unsigned short test_usdt_100_semaphore SEC(".probes");
@@ -613,6 +660,8 @@ void test_usdt(void)
subtest_basic_usdt(true);
if (test__start_subtest("optimized_attach"))
subtest_optimized_attach();
+ if (test__start_subtest("optimized_red_zone"))
+ subtest_optimized_red_zone();
#endif
if (test__start_subtest("multispec"))
subtest_multispec_usdt();
diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
index f00cb52874e0..0ee78fb050a1 100644
--- a/tools/testing/selftests/bpf/progs/test_usdt.c
+++ b/tools/testing/selftests/bpf/progs/test_usdt.c
@@ -149,5 +149,30 @@ int usdt_executed(struct pt_regs *ctx)
executed++;
return 0;
}
+
+int arg_total;
+int arg_bad;
+long arg_last[3];
+long expected_arg[3];
+int expected_pid;
+
+SEC("usdt")
+int BPF_USDT(usdt_check_arg, long arg1, long arg2, long arg3)
+{
+ if (expected_pid != (bpf_get_current_pid_tgid() >> 32))
+ return 0;
+
+ __sync_fetch_and_add(&arg_total, 1);
+ arg_last[0] = arg1;
+ arg_last[1] = arg2;
+ arg_last[2] = arg3;
+
+ if (arg1 != expected_arg[0] ||
+ arg2 != expected_arg[1] ||
+ arg3 != expected_arg[2])
+ __sync_fetch_and_add(&arg_bad, 1);
+
+ return 0;
+}
#endif
char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
index b359b389f6c0..5e38f8605b02 100644
--- a/tools/testing/selftests/bpf/usdt_2.c
+++ b/tools/testing/selftests/bpf/usdt_2.c
@@ -13,4 +13,17 @@ void usdt_2(void)
USDT(optimized_attach, usdt_2);
}
+static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
+static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
+static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
+
+void __attribute__((noinline)) usdt_red_zone_trigger(void)
+{
+ unsigned long a1 = usdt_red_zone_arg1;
+ unsigned long a2 = usdt_red_zone_arg2;
+ unsigned long a3 = usdt_red_zone_arg3;
+
+ USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
+}
+
#endif
--
2.53.0
^ permalink raw reply related
* [PATCH 6/7] selftests/bpf: Add reattach tests for uprobe syscall
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
Adding reattach tests for uprobe syscall tests to make sure
we can re-attach and optimize same uprobe multiple times.
The reason is that optimized uprobe does not restore original
nop10 after detach, but instead it uses 'jmp 8' instruction.
Making sure we can still install and optimize uprobe on top
of the 'jmp 8' instruction.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 115 ++++++++++++++++--
1 file changed, 105 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index c2e9e549c737..82b3c0ce9253 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -431,21 +431,27 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
return tramp;
}
-static void check_detach(void *addr, void *tramp)
+static bool check_detach(void *addr, void *tramp)
{
+ bool ok = true;
+
/* [uprobes_trampoline] stays after detach */
- ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
- ASSERT_OK(memcmp(addr, jmp2B, 2), "jmp2B");
+ if (!ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline"))
+ ok = false;
+ if (!ASSERT_OK(memcmp(addr, jmp2B, 2), "jmp2B"))
+ ok = false;
+ return ok;
}
-static void check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
- trigger_t trigger, void *addr, int executed)
+static void *check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
+ trigger_t trigger, void *addr, int executed)
{
void *tramp;
tramp = check_attach(skel, trigger, addr, executed);
bpf_link__destroy(link);
check_detach(addr, tramp);
+ return tramp;
}
static void test_uprobe_legacy(void)
@@ -456,6 +462,7 @@ static void test_uprobe_legacy(void)
);
struct bpf_link *link;
unsigned long offset;
+ void *tramp;
offset = get_uprobe_offset(&uprobe_test);
if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
@@ -473,7 +480,28 @@ static void test_uprobe_legacy(void)
if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
goto cleanup;
- check(skel, link, uprobe_test, uprobe_test, 2);
+ tramp = check(skel, link, uprobe_test, uprobe_test, 2);
+
+ /* reattach and detach without triggering optimization */
+ link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+ 0, "/proc/self/exe", offset, NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+ goto cleanup;
+
+ bpf_link__destroy(link);
+ if (!check_detach(uprobe_test, tramp))
+ goto cleanup;
+
+ uprobe_test();
+ ASSERT_EQ(skel->bss->executed, 2, "executed_no_probe");
+
+ /* reattach with triggering optimization */
+ link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+ 0, "/proc/self/exe", offset, NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test, 4);
/* uretprobe */
skel->bss->executed = 0;
@@ -495,6 +523,7 @@ static void test_uprobe_multi(void)
LIBBPF_OPTS(bpf_uprobe_multi_opts, opts);
struct bpf_link *link;
unsigned long offset;
+ void *tramp;
offset = get_uprobe_offset(&uprobe_test);
if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
@@ -515,7 +544,28 @@ static void test_uprobe_multi(void)
if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
goto cleanup;
- check(skel, link, uprobe_test, uprobe_test, 2);
+ tramp = check(skel, link, uprobe_test, uprobe_test, 2);
+
+ /* reattach and detach without triggering optimization */
+ link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+ 0, "/proc/self/exe", NULL, &opts);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+ goto cleanup;
+
+ bpf_link__destroy(link);
+ if (!check_detach(uprobe_test, tramp))
+ goto cleanup;
+
+ uprobe_test();
+ ASSERT_EQ(skel->bss->executed, 2, "executed_no_probe");
+
+ /* reattach with triggering optimization */
+ link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+ 0, "/proc/self/exe", NULL, &opts);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test, 4);
/* uretprobe.multi */
skel->bss->executed = 0;
@@ -539,6 +589,7 @@ static void test_uprobe_session(void)
);
struct bpf_link *link;
unsigned long offset;
+ void *tramp;
offset = get_uprobe_offset(&uprobe_test);
if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
@@ -558,7 +609,28 @@ static void test_uprobe_session(void)
if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
goto cleanup;
- check(skel, link, uprobe_test, uprobe_test, 4);
+ tramp = check(skel, link, uprobe_test, uprobe_test, 4);
+
+ /* reattach and detach without triggering optimization */
+ link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_session,
+ 0, "/proc/self/exe", NULL, &opts);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+ goto cleanup;
+
+ bpf_link__destroy(link);
+ if (!check_detach(uprobe_test, tramp))
+ goto cleanup;
+
+ uprobe_test();
+ ASSERT_EQ(skel->bss->executed, 4, "executed_no_probe");
+
+ /* reattach with triggering optimization */
+ link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_session,
+ 0, "/proc/self/exe", NULL, &opts);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test, 8);
cleanup:
uprobe_syscall_executed__destroy(skel);
@@ -568,7 +640,7 @@ static void test_uprobe_usdt(void)
{
struct uprobe_syscall_executed *skel;
struct bpf_link *link;
- void *addr;
+ void *addr, *tramp;
errno = 0;
addr = find_nop10(usdt_test);
@@ -587,7 +659,30 @@ static void test_uprobe_usdt(void)
if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
goto cleanup;
- check(skel, link, usdt_test, addr, 2);
+ tramp = check(skel, link, usdt_test, addr, 2);
+
+ /* reattach and detach without triggering optimization */
+ link = bpf_program__attach_usdt(skel->progs.test_usdt,
+ -1 /* all PIDs */, "/proc/self/exe",
+ "optimized_uprobe", "usdt", NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+ goto cleanup;
+
+ bpf_link__destroy(link);
+ if (!check_detach(addr, tramp))
+ goto cleanup;
+
+ usdt_test();
+ ASSERT_EQ(skel->bss->executed, 2, "executed_no_probe");
+
+ /* reattach with triggering optimization */
+ link = bpf_program__attach_usdt(skel->progs.test_usdt,
+ -1 /* all PIDs */, "/proc/self/exe",
+ "optimized_uprobe", "usdt", NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+ goto cleanup;
+
+ check(skel, link, usdt_test, addr, 4);
cleanup:
uprobe_syscall_executed__destroy(skel);
--
2.53.0
^ permalink raw reply related
* [PATCH 5/7] selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
Changing uprobe/usdt trigger bench code to use nop10 instead
of nop5. Also changing un_bench_uprobes.sh to use nop10 triggers.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/testing/selftests/bpf/bench.c | 20 +++++------
.../selftests/bpf/benchs/bench_trigger.c | 36 +++++++++----------
.../selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
3 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 6155ce455c27..1252a1af2e84 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -539,12 +539,12 @@ extern const struct bench bench_trig_uretprobe_multi_push;
extern const struct bench bench_trig_uprobe_multi_ret;
extern const struct bench bench_trig_uretprobe_multi_ret;
#ifdef __x86_64__
-extern const struct bench bench_trig_uprobe_nop5;
-extern const struct bench bench_trig_uretprobe_nop5;
-extern const struct bench bench_trig_uprobe_multi_nop5;
-extern const struct bench bench_trig_uretprobe_multi_nop5;
+extern const struct bench bench_trig_uprobe_nop10;
+extern const struct bench bench_trig_uretprobe_nop10;
+extern const struct bench bench_trig_uprobe_multi_nop10;
+extern const struct bench bench_trig_uretprobe_multi_nop10;
extern const struct bench bench_trig_usdt_nop;
-extern const struct bench bench_trig_usdt_nop5;
+extern const struct bench bench_trig_usdt_nop10;
#endif
extern const struct bench bench_rb_libbpf;
@@ -619,12 +619,12 @@ static const struct bench *benchs[] = {
&bench_trig_uprobe_multi_ret,
&bench_trig_uretprobe_multi_ret,
#ifdef __x86_64__
- &bench_trig_uprobe_nop5,
- &bench_trig_uretprobe_nop5,
- &bench_trig_uprobe_multi_nop5,
- &bench_trig_uretprobe_multi_nop5,
+ &bench_trig_uprobe_nop10,
+ &bench_trig_uretprobe_nop10,
+ &bench_trig_uprobe_multi_nop10,
+ &bench_trig_uretprobe_multi_nop10,
&bench_trig_usdt_nop,
- &bench_trig_usdt_nop5,
+ &bench_trig_usdt_nop10,
#endif
/* ringbuf/perfbuf benchmarks */
&bench_rb_libbpf,
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index bcc4820c802e..3998ea8ff9aa 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -396,15 +396,15 @@ static void *uprobe_producer_ret(void *input)
}
#ifdef __x86_64__
-__nocf_check __weak void uprobe_target_nop5(void)
+__nocf_check __weak void uprobe_target_nop10(void)
{
asm volatile (".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
}
-static void *uprobe_producer_nop5(void *input)
+static void *uprobe_producer_nop10(void *input)
{
while (true)
- uprobe_target_nop5();
+ uprobe_target_nop10();
return NULL;
}
@@ -418,7 +418,7 @@ static void *uprobe_producer_usdt_nop(void *input)
return NULL;
}
-static void *uprobe_producer_usdt_nop5(void *input)
+static void *uprobe_producer_usdt_nop10(void *input)
{
while (true)
usdt_2();
@@ -542,24 +542,24 @@ static void uretprobe_multi_ret_setup(void)
}
#ifdef __x86_64__
-static void uprobe_nop5_setup(void)
+static void uprobe_nop10_setup(void)
{
- usetup(false, false /* !use_multi */, &uprobe_target_nop5);
+ usetup(false, false /* !use_multi */, &uprobe_target_nop10);
}
-static void uretprobe_nop5_setup(void)
+static void uretprobe_nop10_setup(void)
{
- usetup(true, false /* !use_multi */, &uprobe_target_nop5);
+ usetup(true, false /* !use_multi */, &uprobe_target_nop10);
}
-static void uprobe_multi_nop5_setup(void)
+static void uprobe_multi_nop10_setup(void)
{
- usetup(false, true /* use_multi */, &uprobe_target_nop5);
+ usetup(false, true /* use_multi */, &uprobe_target_nop10);
}
-static void uretprobe_multi_nop5_setup(void)
+static void uretprobe_multi_nop10_setup(void)
{
- usetup(true, true /* use_multi */, &uprobe_target_nop5);
+ usetup(true, true /* use_multi */, &uprobe_target_nop10);
}
static void usdt_setup(const char *name)
@@ -598,7 +598,7 @@ static void usdt_nop_setup(void)
usdt_setup("usdt_1");
}
-static void usdt_nop5_setup(void)
+static void usdt_nop10_setup(void)
{
usdt_setup("usdt_2");
}
@@ -665,10 +665,10 @@ BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
#ifdef __x86_64__
-BENCH_TRIG_USERMODE(uprobe_nop5, nop5, "uprobe-nop5");
-BENCH_TRIG_USERMODE(uretprobe_nop5, nop5, "uretprobe-nop5");
-BENCH_TRIG_USERMODE(uprobe_multi_nop5, nop5, "uprobe-multi-nop5");
-BENCH_TRIG_USERMODE(uretprobe_multi_nop5, nop5, "uretprobe-multi-nop5");
+BENCH_TRIG_USERMODE(uprobe_nop10, nop10, "uprobe-nop10");
+BENCH_TRIG_USERMODE(uretprobe_nop10, nop10, "uretprobe-nop10");
+BENCH_TRIG_USERMODE(uprobe_multi_nop10, nop10, "uprobe-multi-nop10");
+BENCH_TRIG_USERMODE(uretprobe_multi_nop10, nop10, "uretprobe-multi-nop10");
BENCH_TRIG_USERMODE(usdt_nop, usdt_nop, "usdt-nop");
-BENCH_TRIG_USERMODE(usdt_nop5, usdt_nop5, "usdt-nop5");
+BENCH_TRIG_USERMODE(usdt_nop10, usdt_nop10, "usdt-nop10");
#endif
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
index 9ec59423b949..e490b337e960 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
@@ -2,7 +2,7 @@
set -eufo pipefail
-for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret,nop5} usdt-nop usdt-nop5
+for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret,nop10} usdt-nop usdt-nop10
do
summary=$(sudo ./bench -w2 -d5 -a trig-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
printf "%-15s: %s\n" $i "$summary"
--
2.53.0
^ permalink raw reply related
* [PATCH 4/7] selftests/bpf: Change uprobe syscall tests to use nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
Optimized uprobes are now on top of 10-bytes nop instructions,
reflect that in existing tests.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/benchs/bench_trigger.c | 2 +-
.../selftests/bpf/prog_tests/uprobe_syscall.c | 29 ++++++++++---------
tools/testing/selftests/bpf/prog_tests/usdt.c | 25 +++++++++-------
tools/testing/selftests/bpf/usdt_2.c | 2 +-
4 files changed, 33 insertions(+), 25 deletions(-)
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 2f22ec61667b..bcc4820c802e 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -398,7 +398,7 @@ static void *uprobe_producer_ret(void *input)
#ifdef __x86_64__
__nocf_check __weak void uprobe_target_nop5(void)
{
- asm volatile (".byte 0x0f, 0x1f, 0x44, 0x00, 0x00");
+ asm volatile (".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
}
static void *uprobe_producer_nop5(void *input)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 955a37751b52..c2e9e549c737 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -17,7 +17,7 @@
#include "uprobe_syscall_executed.skel.h"
#include "bpf/libbpf_internal.h"
-#define USDT_NOP .byte 0x0f, 0x1f, 0x44, 0x00, 0x00
+#define USDT_NOP .byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00
#include "usdt.h"
#pragma GCC diagnostic ignored "-Wattributes"
@@ -26,7 +26,7 @@ __attribute__((aligned(16)))
__nocf_check __weak __naked unsigned long uprobe_regs_trigger(void)
{
asm volatile (
- ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00\n" /* nop5 */
+ ".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10 */
"movq $0xdeadbeef, %rax\n"
"ret\n"
);
@@ -345,9 +345,9 @@ static void test_uretprobe_syscall_call(void)
__attribute__((aligned(16)))
__nocf_check __weak __naked void uprobe_test(void)
{
- asm volatile (" \n"
- ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00 \n"
- "ret \n"
+ asm volatile (
+ ".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10 */
+ "ret\n"
);
}
@@ -388,14 +388,16 @@ static int find_uprobes_trampoline(void *tramp_addr)
return ret;
}
-static unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char jmp2B[2] = { 0xeb, 8 };
+static unsigned char nop10[10] = { 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
+static unsigned char lea_rsp[5] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
-static void *find_nop5(void *fn)
+static void *find_nop10(void *fn)
{
int i;
- for (i = 0; i < 10; i++) {
- if (!memcmp(nop5, fn + i, 5))
+ for (i = 0; i < 128; i++) {
+ if (!memcmp(nop10, fn + i, 9))
return fn + i;
}
return NULL;
@@ -420,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
ASSERT_EQ(skel->bss->executed, executed, "executed");
/* .. and check the trampoline is as expected. */
- call = (struct __arch_relative_insn *) addr;
+ ASSERT_OK(memcmp(addr, lea_rsp, 4), "lea_rsp");
+ call = (struct __arch_relative_insn *)(addr + 5);
tramp = (void *) (call + 1) + call->raddr;
ASSERT_EQ(call->op, 0xe8, "call");
ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
@@ -432,7 +435,7 @@ static void check_detach(void *addr, void *tramp)
{
/* [uprobes_trampoline] stays after detach */
ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
- ASSERT_OK(memcmp(addr, nop5, 5), "nop5");
+ ASSERT_OK(memcmp(addr, jmp2B, 2), "jmp2B");
}
static void check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
@@ -568,8 +571,8 @@ static void test_uprobe_usdt(void)
void *addr;
errno = 0;
- addr = find_nop5(usdt_test);
- if (!ASSERT_OK_PTR(addr, "find_nop5"))
+ addr = find_nop10(usdt_test);
+ if (!ASSERT_OK_PTR(addr, "find_nop10"))
return;
skel = uprobe_syscall_executed__open_and_load();
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 69759b27794d..be34c4087ff5 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -252,7 +252,7 @@ extern void usdt_1(void);
extern void usdt_2(void);
static unsigned char nop1[1] = { 0x90 };
-static unsigned char nop1_nop5_combo[6] = { 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char nop1_nop10_combo[11] = { 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
static void *find_instr(void *fn, unsigned char *instr, size_t cnt)
{
@@ -271,17 +271,17 @@ static void subtest_optimized_attach(void)
__u8 *addr_1, *addr_2;
/* usdt_1 USDT probe has single nop instruction */
- addr_1 = find_instr(usdt_1, nop1_nop5_combo, 6);
- if (!ASSERT_NULL(addr_1, "usdt_1_find_nop1_nop5_combo"))
+ addr_1 = find_instr(usdt_1, nop1_nop10_combo, 6);
+ if (!ASSERT_NULL(addr_1, "usdt_1_find_nop1_nop10_combo"))
return;
addr_1 = find_instr(usdt_1, nop1, 1);
if (!ASSERT_OK_PTR(addr_1, "usdt_1_find_nop1"))
return;
- /* usdt_2 USDT probe has nop,nop5 instructions combo */
- addr_2 = find_instr(usdt_2, nop1_nop5_combo, 6);
- if (!ASSERT_OK_PTR(addr_2, "usdt_2_find_nop1_nop5_combo"))
+ /* usdt_2 USDT probe has nop,nop10 instructions combo */
+ addr_2 = find_instr(usdt_2, nop1_nop10_combo, 6);
+ if (!ASSERT_OK_PTR(addr_2, "usdt_2_find_nop1_nop10_combo"))
return;
skel = test_usdt__open_and_load();
@@ -309,12 +309,12 @@ static void subtest_optimized_attach(void)
bpf_link__destroy(skel->links.usdt_executed);
- /* we expect the nop5 ip */
+ /* we expect the nop10 ip */
skel->bss->expected_ip = (unsigned long) addr_2 + 1;
/*
* Attach program on top of usdt_2 which is probe defined on top
- * of nop1,nop5 combo, so the probe gets optimized on top of nop5.
+ * of nop1,nop10 combo, so the probe gets optimized on top of nop10.
*/
skel->links.usdt_executed = bpf_program__attach_usdt(skel->progs.usdt_executed,
0 /*self*/, "/proc/self/exe",
@@ -328,8 +328,13 @@ static void subtest_optimized_attach(void)
/* nop stays on addr_2 address */
ASSERT_EQ(*addr_2, 0x90, "nop");
- /* call is on addr_2 + 1 address */
- ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
+ /*
+ * lea -0x80(%rsp), %rsp
+ * call ...
+ */
+ static unsigned char expected[] = { 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8 };
+
+ ASSERT_MEMEQ(addr_2 + 1, expected, sizeof(expected), "lea_and_call");
ASSERT_EQ(skel->bss->executed, 4, "executed");
cleanup:
diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
index 789883aaca4c..b359b389f6c0 100644
--- a/tools/testing/selftests/bpf/usdt_2.c
+++ b/tools/testing/selftests/bpf/usdt_2.c
@@ -3,7 +3,7 @@
#if defined(__x86_64__)
/*
- * Include usdt.h with default nop,nop5 instructions combo.
+ * Include usdt.h with default nop,nop10 instructions combo.
*/
#include "usdt.h"
--
2.53.0
^ permalink raw reply related
* [PATCH 3/7] selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
Syncing latest usdt.h change [1].
Now that we have nop10 optimization support in kernel, let's emit
nop,nop10 for usdt probe. We leave it up to the library to use
desirable nop instruction.
[1] TBD
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/testing/selftests/bpf/usdt.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/usdt.h b/tools/testing/selftests/bpf/usdt.h
index c71e21df38b3..d359663b9c32 100644
--- a/tools/testing/selftests/bpf/usdt.h
+++ b/tools/testing/selftests/bpf/usdt.h
@@ -313,7 +313,7 @@ struct usdt_sema { volatile unsigned short active; };
#if defined(__ia64__) || defined(__s390__) || defined(__s390x__)
#define USDT_NOP nop 0
#elif defined(__x86_64__)
-#define USDT_NOP .byte 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x0 /* nop, nop5 */
+#define USDT_NOP .byte 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 /* nop, nop10 */
#else
#define USDT_NOP nop
#endif
--
2.53.0
^ permalink raw reply related
* [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
fixing has_nop_combo to reflect that.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/lib/bpf/usdt.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
index e3710933fd52..7e62e4d5bedd 100644
--- a/tools/lib/bpf/usdt.c
+++ b/tools/lib/bpf/usdt.c
@@ -305,7 +305,7 @@ struct usdt_manager *usdt_manager_new(struct bpf_object *obj)
/*
* Detect kernel support for uprobe() syscall, it's presence means we can
- * take advantage of faster nop5 uprobe handling.
+ * take advantage of faster nop10 uprobe handling.
* Added in: 56101b69c919 ("uprobes/x86: Add uprobe syscall to speed up uprobe")
*/
man->has_uprobe_syscall = kernel_supports(obj, FEAT_UPROBE_SYSCALL);
@@ -596,14 +596,14 @@ static int parse_usdt_spec(struct usdt_spec *spec, const struct usdt_note *note,
#if defined(__x86_64__)
static bool has_nop_combo(int fd, long off)
{
- unsigned char nop_combo[6] = {
- 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 /* nop,nop5 */
+ unsigned char nop_combo[11] = {
+ 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00,
};
- unsigned char buf[6];
+ unsigned char buf[11];
- if (pread(fd, buf, 6, off) != 6)
+ if (pread(fd, buf, 11, off) != 11)
return false;
- return memcmp(buf, nop_combo, 6) == 0;
+ return memcmp(buf, nop_combo, 11) == 0;
}
#else
static bool has_nop_combo(int fd, long off)
@@ -814,8 +814,8 @@ static int collect_usdt_targets(struct usdt_manager *man, struct elf_fd *elf_fd,
memset(target, 0, sizeof(*target));
/*
- * We have uprobe syscall and usdt with nop,nop5 instructions combo,
- * so we can place the uprobe directly on nop5 (+1) and get this probe
+ * We have uprobe syscall and usdt with nop,nop10 instructions combo,
+ * so we can place the uprobe directly on nop10 (+1) and get this probe
* optimized.
*/
if (man->has_uprobe_syscall && has_nop_combo(elf_fd->fd, usdt_rel_ip)) {
--
2.53.0
^ permalink raw reply related
* [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.
Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call, like:
lea -0x80(%rsp), %rsp
call tramp
Note the lea instruction is used to adjust the rsp register without
changing the flags.
The optimized uprobe performance stays the same:
uprobe-nop : 3.129 ± 0.013M/s
uprobe-push : 3.045 ± 0.006M/s
uprobe-ret : 1.095 ± 0.004M/s
--> uprobe-nop10 : 7.170 ± 0.020M/s
uretprobe-nop : 2.143 ± 0.021M/s
uretprobe-push : 2.090 ± 0.000M/s
uretprobe-ret : 0.942 ± 0.000M/s
--> uretprobe-nop10: 3.381 ± 0.003M/s
usdt-nop : 3.245 ± 0.004M/s
--> usdt-nop10 : 7.256 ± 0.023M/s
[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
1 file changed, 86 insertions(+), 35 deletions(-)
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..f7c4101a4039 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -636,9 +636,21 @@ struct uprobe_trampoline {
unsigned long vaddr;
};
+#define LEA_INSN_SIZE 5
+#define OPT_INSN_SIZE (LEA_INSN_SIZE + CALL_INSN_SIZE)
+#define OPT_JMP8_OFFSET (OPT_INSN_SIZE - JMP8_INSN_SIZE)
+#define REDZONE_SIZE 0x80
+
+static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
+
+static bool is_lea_insn(const uprobe_opcode_t *insn)
+{
+ return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
+}
+
static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
{
- long delta = (long)(vaddr + 5 - vtramp);
+ long delta = (long)(vaddr + OPT_INSN_SIZE - vtramp);
return delta >= INT_MIN && delta <= INT_MAX;
}
@@ -651,7 +663,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
};
unsigned long low_limit, high_limit;
unsigned long low_tramp, high_tramp;
- unsigned long call_end = vaddr + 5;
+ unsigned long call_end = vaddr + OPT_INSN_SIZE;
if (check_add_overflow(call_end, INT_MIN, &low_limit))
low_limit = PAGE_SIZE;
@@ -826,8 +838,8 @@ SYSCALL_DEFINE0(uprobe)
regs->ax = args.ax;
regs->r11 = args.r11;
regs->cx = args.cx;
- regs->ip = args.retaddr - 5;
- regs->sp += sizeof(args);
+ regs->ip = args.retaddr - OPT_INSN_SIZE;
+ regs->sp += sizeof(args) + REDZONE_SIZE;
regs->orig_ax = -1;
sp = regs->sp;
@@ -844,12 +856,12 @@ SYSCALL_DEFINE0(uprobe)
*/
if (regs->sp != sp) {
/* skip the trampoline call */
- if (args.retaddr - 5 == regs->ip)
- regs->ip += 5;
+ if (args.retaddr - OPT_INSN_SIZE == regs->ip)
+ regs->ip += OPT_INSN_SIZE;
return regs->ax;
}
- regs->sp -= sizeof(args);
+ regs->sp -= sizeof(args) + REDZONE_SIZE;
/* for the case uprobe_consumer has changed ax/r11/cx */
args.ax = regs->ax;
@@ -857,7 +869,7 @@ SYSCALL_DEFINE0(uprobe)
args.cx = regs->cx;
/* keep return address unless we are instructed otherwise */
- if (args.retaddr - 5 != regs->ip)
+ if (args.retaddr - OPT_INSN_SIZE != regs->ip)
args.retaddr = regs->ip;
if (shstk_push(args.retaddr) == -EFAULT)
@@ -891,7 +903,7 @@ asm (
"pop %rax\n"
"pop %r11\n"
"pop %rcx\n"
- "ret\n"
+ "ret $" __stringify(REDZONE_SIZE) "\n"
"int3\n"
".balign " __stringify(PAGE_SIZE) "\n"
".popsection\n"
@@ -909,7 +921,7 @@ late_initcall(arch_uprobes_init);
enum {
EXPECT_SWBP,
- EXPECT_CALL,
+ EXPECT_OPTIMIZED,
};
struct write_opcode_ctx {
@@ -930,17 +942,18 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
int nbytes, void *data)
{
struct write_opcode_ctx *ctx = data;
- uprobe_opcode_t old_opcode[5];
+ uprobe_opcode_t old_opcode[OPT_INSN_SIZE];
- uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+ uprobe_copy_from_page(page, ctx->base, old_opcode, OPT_INSN_SIZE);
switch (ctx->expect) {
case EXPECT_SWBP:
if (is_swbp_insn(&old_opcode[0]))
return 1;
break;
- case EXPECT_CALL:
- if (is_call_insn(&old_opcode[0]))
+ case EXPECT_OPTIMIZED:
+ if (is_lea_insn(&old_opcode[0]) &&
+ is_call_insn(&old_opcode[LEA_INSN_SIZE]))
return 1;
break;
}
@@ -963,7 +976,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
* - SMP sync all CPUs
*/
static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
- unsigned long vaddr, char *insn, bool optimize)
+ unsigned long vaddr, char *insn, int size, bool optimize)
{
uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
struct write_opcode_ctx ctx = {
@@ -978,7 +991,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
* so we can skip this step for optimize == true.
*/
if (!optimize) {
- ctx.expect = EXPECT_CALL;
+ ctx.expect = EXPECT_OPTIMIZED;
err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
true /* is_register */, false /* do_update_ref_ctr */,
&ctx);
@@ -990,7 +1003,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
/* Write all but the first byte of the patched range. */
ctx.expect = EXPECT_SWBP;
- err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+ err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, size - 1, verify_insn,
true /* is_register */, false /* do_update_ref_ctr */,
&ctx);
if (err)
@@ -1017,17 +1030,32 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr, unsigned long tramp)
{
- u8 call[5];
+ u8 insn[OPT_INSN_SIZE], *call = &insn[LEA_INSN_SIZE];
- __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
+ /*
+ * We have nop10 instruction (with first byte overwritten to int3),
+ * changing it to:
+ * lea -0x80(%rsp), %rsp
+ * call tramp
+ */
+ memcpy(insn, lea_rsp, LEA_INSN_SIZE);
+ __text_gen_insn(call, CALL_INSN_OPCODE,
+ (const void *) (vaddr + LEA_INSN_SIZE),
(const void *) tramp, CALL_INSN_SIZE);
- return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+ return int3_update(auprobe, vma, vaddr, insn, OPT_INSN_SIZE, true /* optimize */);
}
static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr)
{
- return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+ /*
+ * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
+ * end of the 10-byte slot instead of restoring the original nop10,
+ * because we could have thread already inside lea instruction.
+ */
+ u8 jmp[OPT_INSN_SIZE] = { JMP8_INSN_OPCODE, OPT_JMP8_OFFSET };
+
+ return int3_update(auprobe, vma, vaddr, jmp, JMP8_INSN_SIZE, false /* optimize */);
}
static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
@@ -1049,19 +1077,21 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
struct __packed __arch_relative_insn {
u8 op;
s32 raddr;
- } *call = (struct __arch_relative_insn *) insn;
+ } *call = (struct __arch_relative_insn *)(insn + LEA_INSN_SIZE);
- if (!is_call_insn(insn))
+ if (!is_lea_insn(insn))
+ return false;
+ if (!is_call_insn(insn + LEA_INSN_SIZE))
return false;
- return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+ return __in_uprobe_trampoline(vaddr + OPT_INSN_SIZE + call->raddr);
}
static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
{
- uprobe_opcode_t insn[5];
+ uprobe_opcode_t insn[OPT_INSN_SIZE];
int err;
- err = copy_from_vaddr(mm, vaddr, &insn, 5);
+ err = copy_from_vaddr(mm, vaddr, &insn, OPT_INSN_SIZE);
if (err)
return err;
return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
@@ -1095,14 +1125,25 @@ int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr)
{
if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
- int ret = is_optimized(vma->vm_mm, vaddr);
- if (ret < 0)
+ uprobe_opcode_t insn[OPT_INSN_SIZE];
+ int ret;
+
+ ret = copy_from_vaddr(vma->vm_mm, vaddr, &insn, OPT_INSN_SIZE);
+ if (ret)
return ret;
- if (ret) {
+ if (__is_optimized((uprobe_opcode_t *)&insn, vaddr)) {
ret = swbp_unoptimize(auprobe, vma, vaddr);
WARN_ON_ONCE(ret);
return ret;
}
+ /*
+ * We can have re-attached probe on top of jmp8 instruction,
+ * which did not get optimized. We need to restore the jmp8
+ * instruction, instead of the original instruction (nop10).
+ */
+ if (is_swbp_insn(&insn[0]) && insn[1] == OPT_JMP8_OFFSET)
+ return uprobe_write_opcode(auprobe, vma, vaddr, JMP8_INSN_OPCODE,
+ false /* is_register */);
}
return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn,
false /* is_register */);
@@ -1131,7 +1172,7 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
{
struct mm_struct *mm = current->mm;
- uprobe_opcode_t insn[5];
+ uprobe_opcode_t insn[OPT_INSN_SIZE];
if (!should_optimize(auprobe))
return;
@@ -1142,7 +1183,7 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
* Check if some other thread already optimized the uprobe for us,
* if it's the case just go away silently.
*/
- if (copy_from_vaddr(mm, vaddr, &insn, 5))
+ if (copy_from_vaddr(mm, vaddr, &insn, OPT_INSN_SIZE))
goto unlock;
if (!is_swbp_insn((uprobe_opcode_t*) &insn))
goto unlock;
@@ -1160,14 +1201,24 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
static bool can_optimize(struct insn *insn, unsigned long vaddr)
{
- if (!insn->x86_64 || insn->length != 5)
+ if (!insn->x86_64)
return false;
- if (!insn_is_nop(insn))
+ /* We can't do cross page atomic writes yet. */
+ if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < OPT_INSN_SIZE)
return false;
- /* We can't do cross page atomic writes yet. */
- return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+ /* We can optimize on top of nop10.. */
+ if (insn->length == OPT_INSN_SIZE && insn_is_nop(insn))
+ return true;
+
+ /* .. and JMP rel8 to end of slot — check swbp_unoptimize. */
+ if (insn->length == 2 &&
+ insn->opcode.bytes[0] == JMP8_INSN_OPCODE &&
+ insn->immediate.value == OPT_JMP8_OFFSET)
+ return true;
+
+ return false;
}
#else /* 32-bit: */
/*
--
2.53.0
^ permalink raw reply related
* [PATCH 0/7] uprobes/x86: Fix red zone issue for optimized uprobes
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
hi,
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.
Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call.
Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
if we decide to take this change.
thanks,
jirka
[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
---
Andrii Nakryiko (1):
selftests/bpf: Add tests for uprobe nop10 red zone clobbering
Jiri Olsa (6):
uprobes/x86: Move optimized uprobe from nop5 to nop10
libbpf: Change has_nop_combo to work on top of nop10
selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
selftests/bpf: Change uprobe syscall tests to use nop10
selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
selftests/bpf: Add reattach tests for uprobe syscall
arch/x86/kernel/uprobes.c | 121 ++++++++++++++++++++++++++++------------
tools/lib/bpf/usdt.c | 16 +++---
tools/testing/selftests/bpf/bench.c | 20 +++----
tools/testing/selftests/bpf/benchs/bench_trigger.c | 38 ++++++-------
tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 217 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
tools/testing/selftests/bpf/prog_tests/usdt.c | 74 +++++++++++++++++++++----
tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++++++
tools/testing/selftests/bpf/usdt.h | 2 +-
tools/testing/selftests/bpf/usdt_2.c | 15 ++++-
10 files changed, 423 insertions(+), 107 deletions(-)
^ permalink raw reply
* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-05-14 13:28 UTC (permalink / raw)
To: leitao
Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-2-be2e578e61da@debian.org>
On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
>get_any_page() collapses three different failure modes into a single
>-EIO return:
>
> * the put_page race in the !count_increased path;
> * the HWPoisonHandlable() rejection that bounces out of
> __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
> * the HWPoisonHandlable() rejection that goes through the
> count_increased / put_page / shake_page retry loop.
>
>The first is transient (the page is racing with the allocator). The
>second can be either transient (a userspace folio briefly off LRU
>during migration/compaction) or stable (slab/vmalloc/page-table/
>kernel-stack pages). The third describes a stable kernel-owned page
>that the count_increased=true caller already held a reference on.
>
>Distinguish them on the return path: keep -EIO for both the put_page
>race and the -EBUSY-after-retries branch (shake_page() cannot drag a
>folio back from active migration, so we cannot prove the page is
>permanently kernel-owned from there), keep -EBUSY for the allocation
>race (unchanged), and return -ENOTRECOVERABLE only from the
>count_increased-true HWPoisonHandlable() rejection that exhausts its
>retries -- the caller's reference is structural evidence that the
>page is owned by the kernel.
>
>Extend the unhandlable-page pr_err() to fire for either errno and
>update the get_hwpoison_page() kerneldoc.
>
>memory_failure() still folds every negative return into
>MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>this patch is a no-op for users of memory_failure() and only changes
>the errno that soft_offline_page() can propagate to its callers. A
>follow-up wires the new return code through memory_failure() and
>reports MF_MSG_KERNEL for the unrecoverable cases.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Signed-off-by: Breno Leitao <leitao@debian.org>
>---
> mm/memory-failure.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
>diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>index 49bcfbd04d213..bae883df3ccb2 100644
>--- a/mm/memory-failure.c
>+++ b/mm/memory-failure.c
>@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> shake_page(p);
> goto try_again;
> }
>+ /*
>+ * Return -EIO rather than -ENOTRECOVERABLE: this
>+ * branch is also reached for pages that are merely
>+ * off-LRU transiently (e.g. a folio in the middle
>+ * of migration or compaction), which shake_page()
>+ * cannot drag back. The caller cannot prove the
>+ * page is permanently kernel-owned from here, so
>+ * keep it on the recoverable errno.
>+ */
> ret = -EIO;
> goto out;
> }
>@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> goto try_again;
> }
> put_page(p);
>- ret = -EIO;
>+ ret = -ENOTRECOVERABLE;
> }
> out:
>- if (ret == -EIO)
>+ if (ret == -EIO || ret == -ENOTRECOVERABLE)
> pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
>
> return ret;
>@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
> * -EIO for pages on which we can not handle memory errors,
> * -EBUSY when get_hwpoison_page() has raced with page lifecycle
> * operations like allocation and free,
>- * -EHWPOISON when the page is hwpoisoned and taken off from buddy.
>+ * -EHWPOISON when the page is hwpoisoned and taken off from buddy,
>+ * -ENOTRECOVERABLE for stable kernel-owned pages the handler
>+ * cannot recover (PG_reserved, slab, vmalloc, page tables,
>+ * kernel stacks, and similar non-LRU/non-buddy pages).
Did you test this patch series? I don't see how we ever get to
-ENOTRECOVERABLE there ...
Even with MF_COUNT_INCREASED, the first pass does:
if (flags & MF_COUNT_INCREASED)
count_increased = true;
[...]
if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
ret = 1;
} else {
if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
put_page(p);
shake_page(p);
count_increased = false;
goto try_again; <-
}
put_page(p);
ret = -ENOTRECOVERABLE;
}
Then we come back with count_increased=false:
try_again:
if (!count_increased) {
ret = __get_hwpoison_page(p, flags); <-
if (!ret) {
[...]
} else if (ret == -EBUSY) { <-
[...]
ret = -EIO;
goto out; <-
}
}
For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:
if (!HWPoisonHandlable(&folio->page, flags))
return -EBUSY;
so they still seem to end up as -EIO ... Am I missing something?
> */
> static int get_hwpoison_page(struct page *p, unsigned long flags)
> {
>
>--
>2.53.0-Meta
>
>
^ permalink raw reply
* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-05-14 12:34 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
virtualization, linux-arch, linux-mm, linux-trace-kernel,
kernel-team, Paul E. McKenney
In-Reply-To: <20260513193342.GB2545104@noisy.programming.kicks-ass.net>
On Wed, May 13, 2026 at 09:33:42PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 05:09:34PM +0000, Dmitry Ilvokhin wrote:
> > Use the arch-overridable queued_spin_release(), introduced in the
> > previous commit, to ensure the tracepoint works correctly across all
> > architectures, including those with custom unlock implementations (e.g.
> > x86 paravirt).
> >
> > When the tracepoint is disabled, the only addition to the hot path is a
> > single NOP instruction (the static branch). When enabled, the contention
> > check, trace call, and unlock are combined in an out-of-line function to
> > minimize hot path impact, avoiding the compiler needing to preserve the
> > lock pointer in a callee-saved register across the trace call.
> >
> > Binary size impact (x86_64, defconfig):
> > uninlined unlock (common case): +680 bytes (+0.00%)
> > inlined unlock (worst case): +83659 bytes (+0.21%)
> >
> > The inlined unlock case could not be achieved through Kconfig options on
> > x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on
> > x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force
> > inline the unlock path and estimate the worst case binary size increase.
> >
> > In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already
> > opted against binary size optimization, so the inlined worst case is
> > unlikely to be a concern.
>
> This is not quite accurate. You add the (5byte) NOP for the static
> branch, but then you also add another 5 bytes for the CALL and at least
> another 2 bytes (possibly 5) for a JMP back into the previous stream.
> That is 12-15 bytes added to what was a single MOV instruction.
>
> That is quite ludicrous.
Thanks for the feedback, Peter. This is exactly the kind of feedback I
was looking for.
I understand your concerns and initially I had exactly the same
thoughts, and after I looked into the generated code more carefully the
impact on the executed path is smaller than the total size increase
suggests.
Generated code of _raw_spin_unlock() for baseline (before the patch) is
31 bytes in total (x86_64, defconfig, GCC 11).
3e0: endbr64 ; 4 bytes
3e4: movb $0x0,(%rdi) ; 3 bytes (unlock)
3e7: decl %gs:__preempt_count ; 7 bytes
3ee: je 3f5 ; 2 bytes
3f0: jmp __x86_return_thunk ; 5 bytes
3f5: call __SCT__preempt_schedule ; 5 bytes
3fa: jmp __x86_return_thunk ; 5 bytes
Generated code of _raw_spin_unlock() with tracepoint (after the patch
applied) is 40 bytes in total.
bc0: endbr64 ; 4 bytes
bc4: xchg %ax,%ax ; 2 bytes (NOP, static branch)
bc6: movb $0x0,(%rdi) ; 3 bytes (unlock)
bc9: decl %gs:__preempt_count ; 7 bytes
bd0: je bde ; 2 bytes
bd2: jmp __x86_return_thunk ; 5 bytes
bd7: call queued_spin_release_traced ; 5 bytes
bdc: jmp bc9 ; 2 bytes
bde: call __SCT__preempt_schedule ; 5 bytes
be3: jmp __x86_return_thunk ; 5 bytes
It is 40 bytes (+9 bytes compared to baseline, 2 bytes for NOP and 7
bytes for CALL and JMP).
But if we look at the executed path the picture is a bit different.
Baseline, in best case scenario of least number of executed
instructions.
3e0: endbr64 ; 4 bytes (always executed)
3e4: movb $0x0,(%rdi) ; 3 bytes (unlock,
; always executed)
3e7: decl %gs:__preempt_count ; 7 bytes (always executed)
3ee: je 3f5 ; 2 bytes (always executed)
3f0: jmp __x86_return_thunk ; 5 bytes (executed if above
; je is not taken)
; rest is not executed
3f5: call __SCT__preempt_schedule ; 5 bytes
3fa: jmp __x86_return_thunk ; 5 bytes
Tracepoint (again same case of least number of executed instructions).
bc0: endbr64 ; 4 bytes (always executed)
bc4: xchg %ax,%ax ; 2 bytes (always executed, this is an
; only addition on the execution path).
bc6: movb $0x0,(%rdi) ; 3 bytes (unlock, always executed)
bc9: decl %gs:__preempt_count ; 7 bytes (always executed)
bd0: je bde ; 2 bytes (always executed)
bd2: jmp __x86_return_thunk ; 5 bytes (executed if above
; je is not taken)
; rest is not executed
bd7: call queued_spin_release_traced ; 5 bytes
bdc: jmp bc9 ; 2 bytes
bde: call __SCT__preempt_schedule ; 5 bytes
be3: jmp __x86_return_thunk ; 5 bytes
On the execution path we are getting 21 byte worth of instructions on
baseline against 23 bytes. The only addition on any executed path is the
2-byte NOP, that has a special treatment in CPU, cheap, but not entirely
free.
From a total size perspective it's 9 bytes, but on the executed path it's
a single 2-byte NOP.
Does this change the picture for you, or is the NOP still a concern for
this path?
>
> I disagree that UNINLINE_SPIN_UNLOCK=n opts against binary size. For x86
> the unlock is smaller than a function call.
>
Fair point on the UNINLINE_SPIN_UNLOCK characterization, but
UNINLINE_SPIN_UNLOCKis always "y" on x86_64. The inlined case only
applies to s390 (unconditionally), csky and loongarch (when
!PREEMPTION). I'll remove this, thanks.
>
> I really don't see how this is worth it.
^ permalink raw reply
* Re: [RFC v7 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-05-14 11:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-kernel, linux-trace-kernel
In-Reply-To: <20260513135741.12ddb97d@gandalf.local.home>
Hi Steven,
---- On Thu, 14 May 2026 01:57:41 +0800 Steven Rostedt <rostedt@goodmis.org> wrote ---
> On Mon, 11 May 2026 16:43:01 +0800
> Li Chen <me@linux.beauty> wrote:
>
> > @@ -1346,8 +1383,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
> > }
> > ext4_fc_unlock(sb, alloc_ctx);
> >
> > - ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
> > + ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
> > + &snap_inodes, &snap_ranges, &snap_err);
> > jbd2_journal_unlock_updates(journal);
> > + if (trace_ext4_fc_lock_updates_enabled()) {
> > + locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
> > + trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
> > + snap_inodes, snap_ranges, ret,
> > + snap_err);
>
> Please change this to:
>
> trace_call__ext4_fc_lock_updates(...)
>
> As the "trace_ext4_fc_lock_updates_enabled()" already has the static
> branch. No need to do it twice anymore. 7.1 introduced the
> "trace_call__foo()" that will do a direct call to the tracepoints
> registered, without the need for another static branch.
Thanks, will do it.
Regards,
Li
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox