Linux Trace Kernel
 help / color / mirror / Atom feed
* [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-05-14  3:49 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add supporting files for the ftrace stackmap feature:

Documentation/trace/ftrace-stackmap.rst:
  Comprehensive documentation covering design, usage, tracefs
  interface, binary format, and performance characteristics.

tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
  Basic functional selftest that verifies:
  - stackmap tracefs nodes exist
  - enabling stackmap + stacktrace produces stack_id events
  - stack_map_stat shows non-zero hits
  - reset clears entries

tools/tracing/stackmap_dump.py:
  Python script to parse the binary stack_map_bin export.
  Supports offline symbol resolution via addr2line, JSON output,
  and top-N filtering by ref_count.

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 Documentation/trace/ftrace-stackmap.rst       | 111 ++++++++++++++++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    |  74 +++++++++++
 tools/tracing/stackmap_dump.py                | 120 ++++++++++++++++++
 3 files changed, 305 insertions(+)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100755 tools/tracing/stackmap_dump.py

diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..8f6410d4258c
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,111 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks (default: 14, range: 10-20)
+
+Usage
+=====
+
+Enable stack deduplication::
+
+    echo 1 > /sys/kernel/debug/tracing/options/stackmap
+    echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+    echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+    sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+    cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+    stack_id 42 [ref 1337, depth 8]
+      [0] schedule+0x48/0xc0
+      [1] schedule_timeout+0x1c/0x30
+      ...
+
+To view statistics::
+
+    cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+    entries:    2500
+    table_size: 5000
+    hits:       148923
+    drops:      0
+    hit_rate:   98%
+
+To reset the stack map::
+
+    echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Tracefs Nodes
+=============
+
+``stack_map``
+    Text export of all deduplicated stacks with symbol resolution.
+    Writing ``0`` or ``reset`` clears all entries.
+
+``stack_map_stat``
+    Statistics: entry count, hits, drops, and hit rate.
+
+``stack_map_bin``
+    Binary export for efficient userspace consumption. Format:
+
+    - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+    - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+    Magic: ``0x464D5342`` ('FSMB'), Version: 2
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+  (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table
+- **Per-instance**: Each trace_array has its own stackmap, supporting
+  multiple ftrace instances
+- **Hash**: 32-bit jhash of stack IPs; full ``memcmp`` confirms matches
+
+Performance
+===========
+
+Typical results on ARM64 Android device (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Hit rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100755
index 000000000000..3b0a7f60769f
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,74 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify stack_map has entries
+# 4. Verify stack_map_stat shows hits
+# 5. Verify trace contains <stack_id> events
+# 6. Verify reset works
+
+fail() {
+    echo "FAIL: $1"
+    exit_fail
+}
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat has entries
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+if [ "$entries" -eq 0 ]; then
+    fail "stackmap has zero entries after tracing"
+fi
+
+# Check hits > 0
+hits=$(cat stack_map_stat | grep "^hits:" | awk '{print $2}')
+if [ "$hits" -eq 0 ]; then
+    fail "stackmap has zero hits"
+fi
+
+# Check drops == 0 (pool should be large enough for 1s trace)
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+    fail "stack_map output has no stack_id entries"
+fi
+
+# Check trace has stack_id events
+count=$(cat trace | grep -c "stack_id" || true)
+if [ "$count" -eq 0 ]; then
+    fail "trace has no <stack_id> events"
+fi
+
+# Test reset
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+if [ "$entries_after" -ne 0 ]; then
+    fail "stackmap reset did not clear entries"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $hits hits, $drops drops"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..91ce80c681ea
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,120 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+    # Pull from device and parse
+    adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+    python3 stackmap_dump.py /tmp/stack_map.bin
+
+    # With vmlinux for offline symbol resolution
+    python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+    # JSON output for tooling
+    python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342  # 'FSMB'
+HEADER_FMT = '<IIII'  # magic, version, nr_stacks, reserved
+ENTRY_FMT = '<IIII'   # stack_id, nr, ref_count, reserved
+HEADER_SIZE = struct.calcsize(HEADER_FMT)
+ENTRY_SIZE = struct.calcsize(ENTRY_FMT)
+
+
+def addr2line(vmlinux, addr):
+    """Resolve address to symbol using addr2line."""
+    try:
+        result = subprocess.run(
+            ['addr2line', '-f', '-e', vmlinux, hex(addr)],
+            capture_output=True, text=True, timeout=5
+        )
+        lines = result.stdout.strip().split('\n')
+        if len(lines) >= 1 and lines[0] != '??':
+            return lines[0]
+    except (subprocess.TimeoutExpired, FileNotFoundError):
+        pass
+    return None
+
+
+def parse_stackmap_bin(data):
+    """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+    if len(data) < HEADER_SIZE:
+        raise ValueError("File too small for header")
+
+    magic, version, nr_stacks, _ = struct.unpack_from(HEADER_FMT, data, 0)
+    if magic != MAGIC:
+        raise ValueError(f"Bad magic: 0x{magic:08x}, expected 0x{MAGIC:08x}")
+    if version not in (1, 2):
+        raise ValueError(f"Unsupported version: {version}")
+
+    offset = HEADER_SIZE
+    for _ in range(nr_stacks):
+        if offset + ENTRY_SIZE > len(data):
+            break
+        stack_id, nr, ref_count, _ = struct.unpack_from(ENTRY_FMT, data, offset)
+        offset += ENTRY_SIZE
+
+        ips_size = nr * 8
+        if offset + ips_size > len(data):
+            break
+        ips = struct.unpack_from(f'<{nr}Q', data, offset)
+        offset += ips_size
+
+        yield stack_id, ref_count, list(ips)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+    parser.add_argument('file', help='Path to stack_map_bin file')
+    parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+    parser.add_argument('--json', action='store_true', help='JSON output')
+    parser.add_argument('--top', type=int, default=0,
+                        help='Show only top N stacks by ref_count')
+    args = parser.parse_args()
+
+    with open(args.file, 'rb') as f:
+        data = f.read()
+
+    stacks = list(parse_stackmap_bin(data))
+
+    if args.top > 0:
+        stacks.sort(key=lambda x: x[1], reverse=True)
+        stacks = stacks[:args.top]
+
+    if args.json:
+        output = []
+        for stack_id, ref_count, ips in stacks:
+            entry = {
+                'stack_id': stack_id,
+                'ref_count': ref_count,
+                'ips': [f'0x{ip:x}' for ip in ips]
+            }
+            if args.vmlinux:
+                entry['symbols'] = [addr2line(args.vmlinux, ip) or f'0x{ip:x}'
+                                    for ip in ips]
+            output.append(entry)
+        print(json.dumps(output, indent=2))
+    else:
+        for stack_id, ref_count, ips in stacks:
+            print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+            for i, ip in enumerate(ips):
+                sym = ''
+                if args.vmlinux:
+                    resolved = addr2line(args.vmlinux, ip)
+                    if resolved:
+                        sym = f' {resolved}'
+                print(f"  [{i}] 0x{ip:x}{sym}")
+            print()
+
+    print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+    main()
-- 
2.34.1


^ permalink raw reply related

* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Masami Hiramatsu @ 2026-05-14  4:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Chen Jun, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260513124017.770e3098@gandalf.local.home>

On Wed, 13 May 2026 12:40:17 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 12 May 2026 08:47:50 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > On Fri, 8 May 2026 20:26:23 +0800
> > Chen Jun <chenjun102@huawei.com> wrote:
> > 
> > > Low-level functions have many call paths, and sometimes
> > > we only care about the calls on a specific call path.
> > > Add a new filter to filter based on the call stack.
> > > 
> > > Usage:
> > > 1. echo 'caller=="$function_name"' > events/../filter  
> > 
> > Thanks for interesting idea :)
> > 
> > BTW, we already have "stacktrace". Since this actually checks
> > stacktrace, not caller, so I think we should reuse it.
> > Also, I think OP_GLOB is more suitable for this case.
> > (and more useful)
> 
> Actually, it's not a stack trace, it's a function that is called from other
> functions. But since "caller" sounds like a direct called function (stack
> trace of the first instance), I think perhaps it should be "called_within" or
> something similar. :-/

Yeah, what about "callers"?

> 
> Also, OP_GLOB can't work because it only works for a single function. At
> the time of parsing, it finds the function (and should probably error out
> if there's more than one function with a given name). It then records the
> start and end address of the function so it only needs to find if one of
> the entries in the stack trace is between the start and end of the function.

Ah, OK. It is just comparing address, not name.

> 
> I don't think this is possible with GLOB. We don't want to do a search of
> the functions when the event is triggered.

Agreed.

Thanks,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2] perf/ftrace: Fix WARNING in __unregister_ftrace_function
From: Masami Hiramatsu @ 2026-05-14  4:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260513161916.04151502@fangorn>

On Wed, 13 May 2026 16:19:16 -0400
Rik van Riel <riel@surriel.com> wrote:

> perf_ftrace_function_unregister() unconditionally calls
> unregister_ftrace_function() without checking whether the ftrace_ops
> was ever successfully registered. This triggers a WARN_ON in
> __unregister_ftrace_function() when the ops doesn't have
> FTRACE_OPS_FL_ENABLED set.
> 
> This can happen during perf_event_alloc() error cleanup when
> perf_trace_destroy() is called via __free_event() on an event whose
> ftrace_ops registration failed or was already torn down by
> perf_try_init_event()'s err_destroy path.
> 
> The call path is:
>   perf_event_alloc() error cleanup
>     -> __free_event()
>       -> event->destroy() [tp_perf_event_destroy]
>         -> perf_trace_destroy()
>           -> perf_trace_event_close()
>             -> TRACE_REG_PERF_CLOSE
>               -> perf_ftrace_function_unregister()
>                 -> unregister_ftrace_function()
>                   -> __unregister_ftrace_function()
>                     -> WARN_ON(!(ops->flags & FTRACE_OPS_FL_ENABLED))
> 
> Fix this by checking FTRACE_OPS_FL_ENABLED before attempting to
> unregister. If the ops is not enabled, just free the filter and
> return success.
> 
> Assisted-by: Claude:claude-opus-4.7 syzkaller
> Signed-off-by: Rik van Riel <riel@surriel.com>

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: ced39002f5ea ("ftrace, perf: Add support to use function tracepoint in perf")

Thanks,

> ---
>  kernel/trace/trace_event_perf.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index a6bb7577e8c5..58e1b427b576 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -497,7 +497,11 @@ static int perf_ftrace_function_register(struct perf_event *event)
>  static int perf_ftrace_function_unregister(struct perf_event *event)
>  {
>  	struct ftrace_ops *ops = &event->ftrace_ops;
> -	int ret = unregister_ftrace_function(ops);
> +	int ret = 0;
> +
> +	if (ops->flags & FTRACE_OPS_FL_ENABLED)
> +		ret = unregister_ftrace_function(ops);
> +
>  	ftrace_free_filter(ops);
>  	return ret;
>  }
> -- 
> 2.52.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v2] rtla: Document tests in README
From: Tomas Glozar @ 2026-05-14  7:30 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

RTLA tests are not documented anywhere. Mention both runtime and unit
tests in the README, with instructions on how to run them and a list of
dependencies and required system configuration.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
v2: Add package hints for common distros for Test::Harness (suggested by
Crystal Wood).

v1: https://lore.kernel.org/linux-trace-kernel/20260423130759.882247-1-tglozar@redhat.com

 tools/tracing/rtla/README.txt | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/tracing/rtla/README.txt b/tools/tracing/rtla/README.txt
index a9faee4dbb3a..13b4a798b487 100644
--- a/tools/tracing/rtla/README.txt
+++ b/tools/tracing/rtla/README.txt
@@ -42,4 +42,34 @@ For development, we suggest the following steps for compiling rtla:
   $ make
   $ sudo make install
 
+Running tests
+
+RTLA has two test suites: a runtime test suite and a unit test suite.
+
+The runtime test suite is available as "make check" (root required) and has
+the following dependencies, in addition to RTLA build dependencies:
+
+- Perl
+- Test::Harness (libtest-harness-perl on Debian/Ubuntu, perl-Test-Harness on Fedora/RHEL)
+- bash
+- coreutils
+- ldd
+- util-linux
+- procps(-ng)
+- bpftool (if rtla is built against libbpf)
+
+as well as the following required system configuration:
+
+- CONFIG_OSNOISE_TRACER=y
+- CONFIG_TIMERLAT_TRACER=y
+- tracefs mounted and readable at /sys/kernel/tracing
+
+The unit test suite is available as "make unit-tests" and has the following
+dependencies:
+
+- libcheck
+
+Unlike the runtime test suite, root is not required to run unit tests, nor is
+a tracefs/osnoise/timerlat-capable kernel required.
+
 For further information, please refer to the rtla man page.
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Lance Yang @ 2026-05-14  9:12 UTC (permalink / raw)
  To: leitao
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-1-be2e578e61da@debian.org>


On Wed, May 13, 2026 at 08:39:32AM -0700, Breno Leitao wrote:
>The first entry of error_states[],
>
>	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
>
>is unreachable.  identify_page_state() has two callers, and neither
>one can dispatch a PG_reserved page to me_kernel():
>
>  * memory_failure() reaches identify_page_state() only after
>    get_hwpoison_page() returned 1.  get_any_page() reaches that
>    return only via __get_hwpoison_page(), which gates the refcount
>    on HWPoisonHandlable().  HWPoisonHandlable() rejects PG_reserved
>    pages, so they fail with -EBUSY/-EIO long before
>    identify_page_state() runs.

HWPoisonHandlable() does not test PG_reserved directly; it only lets
LRU or free buddy pages through:

return PageLRU(page) || is_free_buddy_page(page);

So this really relies on PG_reserved not being combined with either of
those states. I would not expect that to happen, though.

>
>  * try_memory_failure_hugetlb() reaches identify_page_state() on
>    the MF_HUGETLB_IN_USED branch, but the page is necessarily a
>    hugetlb folio there.  The first table entry that matches a
>    hugetlb folio is { head, head, MF_MSG_HUGE, me_huge_page }, so
>    they dispatch to me_huge_page() before the (now-removed)
>    reserved entry would have matched, regardless of whether
>    PG_reserved happens to be set on the head page.

As David pointed out, hugetlb setup clears PG_reserved before setting
PG_head. See hugetlb_folio_init_vmemmap():

	__folio_clear_reserved(folio);
	__folio_set_head(folio);

>
>me_kernel() never executes and the entry exists only to be matched
>against by code that cannot see it.

identify_page_state() is reached only when get_hwpoison_page()
returns 1, but a PG_reserved page would not get that far, IIUC :)

>
>Drop the entry, the me_kernel() helper, and the now-unused
>"reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
>remains part of the tracepoint and pr_err() string tables, and
>follow-on work to classify unrecoverable kernel pages can reuse it
>without churning the user-visible enum.
>
>No functional change.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Signed-off-by: Breno Leitao <leitao@debian.org>
>---

With David's comments addressed, feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>

^ permalink raw reply

* Re: [PATCH] tracing: samples: avoid warning about __aeabi_unwind_cpp_pr1
From: Vincent Donnefort @ 2026-05-14  9:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnd Bergmann, Masami Hiramatsu, Nathan Chancellor, Marc Zyngier,
	Arnd Bergmann, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260513105939.3bbdc174@gandalf.local.home>

On Wed, May 13, 2026 at 10:59:39AM -0400, Steven Rostedt wrote:
> 
> Vincent,
> 
> Is this patch needed? That is, did it fall through the cracks?

Yes, I believe it is! 

Reviewed-by: Vincent Donnefort <vdonnefort@google.com>

> 
> -- Steve
> 
> On Mon, 23 Mar 2026 11:56:41 +0100
> Arnd Bergmann <arnd@kernel.org> wrote:
> 
> > From: Arnd Bergmann <arnd@arndb.de>
> > 
> > The now more verbose check found another symbol missing from the whitelist:
> > 
> > Unexpected symbols in kernel/trace/simple_ring_buffer.o:
> >          U __aeabi_unwind_cpp_pr1
> > 
> > Add this to the Makefile.
> > 
> > Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> >  kernel/trace/Makefile | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > index d662c1a64cd5..aba6a25db17b 100644
> > --- a/kernel/trace/Makefile
> > +++ b/kernel/trace/Makefile
> > @@ -169,8 +169,8 @@ targets += undefsyms_base.o
> >  # because it is not linked into vmlinux.
> >  KASAN_SANITIZE_undefsyms_base.o := y
> >  
> > -UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
> > -		      __msan simple_ring_buffer \
> > +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
> > +		      __x86_indirect_thunk __aeabi_unwind_cpp simple_ring_buffer \
> >  		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
> >  
> >  quiet_cmd_check_undefined = NM      $<
> 

^ permalink raw reply

* Re: [PATCH v7 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-05-14 10:55 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <5712adbc-b2fd-49fd-9827-cace47eff9ad@kernel.org>

On Wed, May 13, 2026 at 10:10:27PM +0200, David Hildenbrand (Arm) wrote:
> On 5/13/26 17:39, Breno Leitao wrote:
> >   * memory_failure() reaches identify_page_state() only after
> >     get_hwpoison_page() returned 1.  get_any_page() reaches that
> >     return only via __get_hwpoison_page(), which gates the refcount
> >     on HWPoisonHandlable().  HWPoisonHandlable() rejects PG_reserved
> >     pages, so they fail with -EBUSY/-EIO long before
> >     identify_page_state() runs.
> 
> You should clarify why they are rejected. There is no explicit check for
> PG_reserved in there!

True, I meant that PG_reserved pages do not fit any of the criterias of
HWPoisonHandlable().

I will rewrite it more explictly:

	__get_hwpoison_page() only takes a refcount when the page is
	HWPoisonHandlable()'d, and HWPoisonHandlable() is an allowlist for LRU /
	free-buddy / (soft-offline) movable_ops pages.

is it any better?

> >   * try_memory_failure_hugetlb() reaches identify_page_state() on
> >     the MF_HUGETLB_IN_USED branch, but the page is necessarily a
> >     hugetlb folio there.  The first table entry that matches a
> >     hugetlb folio is { head, head, MF_MSG_HUGE, me_huge_page }, so
> >     they dispatch to me_huge_page() before the (now-removed)
> >     reserved entry would have matched, regardless of whether
> >     PG_reserved happens to be set on the head page.
> 
> See hugetlb_folio_init_vmemmap(): we always clear PG_reserved for hugetlb folios
> allocated from memblock.

Thanks. I clearly see a call to __folio_clear_reserved(folio), so, huge pagetlb folios
are never reserved.

A better summary would be:

	try_memory_failure_hugetlb() reaches identify_page_state() only via the
	MF_HUGETLB_IN_USED branch, as hugetlb folios don't carry PG_reserved at
	that point (hugetlb_folio_init_vmemmap() clears it during init).

> Yes, I think this should work.
> 
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks for the review,
--breno

^ permalink raw reply

* Re: [PATCH v7 4/6] mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()
From: Breno Leitao @ 2026-05-14 11:06 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <511dc52e-f2af-43c8-a9cf-19321b091dbe@kernel.org>

On Wed, May 13, 2026 at 09:49:28PM +0200, David Hildenbrand (Arm) wrote:
> On 5/13/26 17:39, Breno Leitao wrote:
> > The previous patch already classifies PG_reserved pages as
> > MF_MSG_KERNEL through the long path: get_hwpoison_page() calls
> > __get_hwpoison_page() which fails HWPoisonHandlable(), get_any_page()
> > exhausts its shake_page() retry budget, and the resulting
> > -ENOTRECOVERABLE is mapped to MF_MSG_KERNEL by the switch.  The
> > outcome is correct but the work in between is wasted: shake_page()
> > cannot turn a reserved page into a handlable one.
> 
> If really required, can we just move the check right there, into get_any_page() etc?

Sure, we might move it to get_any_page(). I took this current approach
based on the following facts:

1) Lance suggested it, and it sounded a good idea.
	https://lore.kernel.org/all/20260512124837.38883-1-lance.yang@linux.dev/

2) There is a _similar_ check close to this one in memory_failure(),
   just before this one:

  if (TestSetPageHWPoison(p)) {
  	....
	action_result()
	goto unlock_mutex;
  }

  and now

  if (PageReserved(p)) {
	...
  	action_result()
	goto unlock_mutes;
  }

3) I wanted to give get it as  real layering point, not handwaving.

That said, I will short-circuit reserved pages inside get_any_page(), in
an updated version.

Again, thanks for the review and direction!
--breno

^ permalink raw reply

* Re: [RFC v7 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-05-14 11:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-kernel, linux-trace-kernel
In-Reply-To: <20260513135741.12ddb97d@gandalf.local.home>

Hi Steven,

 ---- On Thu, 14 May 2026 01:57:41 +0800  Steven Rostedt <rostedt@goodmis.org> wrote --- 
 > On Mon, 11 May 2026 16:43:01 +0800
 > Li Chen <me@linux.beauty> wrote:
 > 
 > > @@ -1346,8 +1383,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 > >      }
 > >      ext4_fc_unlock(sb, alloc_ctx);
 > >  
 > > -    ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
 > > +    ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
 > > +                      &snap_inodes, &snap_ranges, &snap_err);
 > >      jbd2_journal_unlock_updates(journal);
 > > +    if (trace_ext4_fc_lock_updates_enabled()) {
 > > +        locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
 > > +        trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
 > > +                       snap_inodes, snap_ranges, ret,
 > > +                       snap_err);
 > 
 > Please change this to:
 > 
 >         trace_call__ext4_fc_lock_updates(...)
 > 
 > As the "trace_ext4_fc_lock_updates_enabled()" already has the static
 > branch. No need to do it twice anymore. 7.1 introduced the
 > "trace_call__foo()" that will do a direct call to the tracepoints
 > registered, without the need for another static branch.

Thanks, will do it.


Regards,
Li​


^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-05-14 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <20260513193342.GB2545104@noisy.programming.kicks-ass.net>

On Wed, May 13, 2026 at 09:33:42PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2026 at 05:09:34PM +0000, Dmitry Ilvokhin wrote:
> > Use the arch-overridable queued_spin_release(), introduced in the
> > previous commit, to ensure the tracepoint works correctly across all
> > architectures, including those with custom unlock implementations (e.g.
> > x86 paravirt).
> > 
> > When the tracepoint is disabled, the only addition to the hot path is a
> > single NOP instruction (the static branch). When enabled, the contention
> > check, trace call, and unlock are combined in an out-of-line function to
> > minimize hot path impact, avoiding the compiler needing to preserve the
> > lock pointer in a callee-saved register across the trace call.
> > 
> > Binary size impact (x86_64, defconfig):
> >   uninlined unlock (common case): +680 bytes  (+0.00%)
> >   inlined unlock (worst case):    +83659 bytes (+0.21%)
> > 
> > The inlined unlock case could not be achieved through Kconfig options on
> > x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on
> > x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force
> > inline the unlock path and estimate the worst case binary size increase.
> > 
> > In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already
> > opted against binary size optimization, so the inlined worst case is
> > unlikely to be a concern.
> 
> This is not quite accurate. You add the (5byte) NOP for the static
> branch, but then you also add another 5 bytes for the CALL and at least
> another 2 bytes (possibly 5) for a JMP back into the previous stream.
> That is 12-15 bytes added to what was a single MOV instruction.
> 
> That is quite ludicrous.

Thanks for the feedback, Peter. This is exactly the kind of feedback I
was looking for.

I understand your concerns and initially I had exactly the same
thoughts, and after I looked into the generated code more carefully the
impact on the executed path is smaller than the total size increase
suggests.

Generated code of _raw_spin_unlock() for baseline (before the patch) is
31 bytes in total (x86_64, defconfig, GCC 11).

    3e0:  endbr64                          ; 4 bytes
    3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock)
    3e7:  decl %gs:__preempt_count         ; 7 bytes
    3ee:  je   3f5                         ; 2 bytes
    3f0:  jmp  __x86_return_thunk          ; 5 bytes
    3f5:  call __SCT__preempt_schedule     ; 5 bytes
    3fa:  jmp  __x86_return_thunk          ; 5 bytes

Generated code of _raw_spin_unlock() with tracepoint (after the patch
applied) is 40 bytes in total.

    bc0:  endbr64                          ; 4 bytes
    bc4:  xchg %ax,%ax                     ; 2 bytes (NOP, static branch)
    bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock)
    bc9:  decl %gs:__preempt_count         ; 7 bytes
    bd0:  je   bde                         ; 2 bytes
    bd2:  jmp  __x86_return_thunk          ; 5 bytes
    bd7:  call queued_spin_release_traced  ; 5 bytes
    bdc:  jmp  bc9                         ; 2 bytes
    bde:  call __SCT__preempt_schedule     ; 5 bytes
    be3:  jmp  __x86_return_thunk          ; 5 bytes

It is 40 bytes (+9 bytes compared to baseline, 2 bytes for NOP and 7
bytes for CALL and JMP).

But if we look at the executed path the picture is a bit different.

Baseline, in best case scenario of least number of executed
instructions.

    3e0:  endbr64                          ; 4 bytes (always executed)
    3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock,
                                           ; always executed)
    3e7:  decl %gs:__preempt_count         ; 7 bytes (always executed)
    3ee:  je   3f5                         ; 2 bytes (always executed)
    3f0:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
                                           ; je is not taken)
                                           ; rest is not executed
    3f5:  call __SCT__preempt_schedule     ; 5 bytes
    3fa:  jmp  __x86_return_thunk          ; 5 bytes

Tracepoint (again same case of least number of executed instructions).

    bc0:  endbr64                          ; 4 bytes (always executed)
    bc4:  xchg %ax,%ax                     ; 2 bytes (always executed, this is an
                                           ; only addition on the execution path).
    bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock, always executed)
    bc9:  decl %gs:__preempt_count         ; 7 bytes (always executed)
    bd0:  je   bde                         ; 2 bytes (always executed)
    bd2:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
                                           ; je is not taken)
                                           ; rest is not executed
    bd7:  call queued_spin_release_traced  ; 5 bytes
    bdc:  jmp  bc9                         ; 2 bytes
    bde:  call __SCT__preempt_schedule     ; 5 bytes
    be3:  jmp  __x86_return_thunk          ; 5 bytes

On the execution path we are getting 21 byte worth of instructions on
baseline against 23 bytes. The only addition on any executed path is the
2-byte NOP, that has a special treatment in CPU, cheap, but not entirely
free.

From a total size perspective it's 9 bytes, but on the executed path it's
a single 2-byte NOP.

Does this change the picture for you, or is the NOP still a concern for
this path?

> 
> I disagree that UNINLINE_SPIN_UNLOCK=n opts against binary size. For x86
> the unlock is smaller than a function call.
> 

Fair point on the UNINLINE_SPIN_UNLOCK characterization, but
UNINLINE_SPIN_UNLOCKis always "y" on x86_64. The inlined case only
applies to s390 (unconditionally), csky and loongarch (when
!PREEMPTION). I'll remove this, thanks.

> 
> I really don't see how this is worth it.

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-05-14 13:28 UTC (permalink / raw)
  To: leitao
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-2-be2e578e61da@debian.org>


On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
>get_any_page() collapses three different failure modes into a single
>-EIO return:
>
>  * the put_page race in the !count_increased path;
>  * the HWPoisonHandlable() rejection that bounces out of
>    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
>  * the HWPoisonHandlable() rejection that goes through the
>    count_increased / put_page / shake_page retry loop.
>
>The first is transient (the page is racing with the allocator).  The
>second can be either transient (a userspace folio briefly off LRU
>during migration/compaction) or stable (slab/vmalloc/page-table/
>kernel-stack pages).  The third describes a stable kernel-owned page
>that the count_increased=true caller already held a reference on.
>
>Distinguish them on the return path: keep -EIO for both the put_page
>race and the -EBUSY-after-retries branch (shake_page() cannot drag a
>folio back from active migration, so we cannot prove the page is
>permanently kernel-owned from there), keep -EBUSY for the allocation
>race (unchanged), and return -ENOTRECOVERABLE only from the
>count_increased-true HWPoisonHandlable() rejection that exhausts its
>retries -- the caller's reference is structural evidence that the
>page is owned by the kernel.
>
>Extend the unhandlable-page pr_err() to fire for either errno and
>update the get_hwpoison_page() kerneldoc.
>
>memory_failure() still folds every negative return into
>MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>this patch is a no-op for users of memory_failure() and only changes
>the errno that soft_offline_page() can propagate to its callers.  A
>follow-up wires the new return code through memory_failure() and
>reports MF_MSG_KERNEL for the unrecoverable cases.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Signed-off-by: Breno Leitao <leitao@debian.org>
>---
> mm/memory-failure.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
>diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>index 49bcfbd04d213..bae883df3ccb2 100644
>--- a/mm/memory-failure.c
>+++ b/mm/memory-failure.c
>@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> 				shake_page(p);
> 				goto try_again;
> 			}
>+			/*
>+			 * Return -EIO rather than -ENOTRECOVERABLE: this
>+			 * branch is also reached for pages that are merely
>+			 * off-LRU transiently (e.g. a folio in the middle
>+			 * of migration or compaction), which shake_page()
>+			 * cannot drag back.  The caller cannot prove the
>+			 * page is permanently kernel-owned from here, so
>+			 * keep it on the recoverable errno.
>+			 */
> 			ret = -EIO;
> 			goto out;
> 		}
>@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> 			goto try_again;
> 		}
> 		put_page(p);
>-		ret = -EIO;
>+		ret = -ENOTRECOVERABLE;
> 	}
> out:
>-	if (ret == -EIO)
>+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> 
> 	return ret;
>@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
>  *         -EIO for pages on which we can not handle memory errors,
>  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
>  *         operations like allocation and free,
>- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
>+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
>+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
>+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
>+ *         kernel stacks, and similar non-LRU/non-buddy pages).

Did you test this patch series? I don't see how we ever get to
-ENOTRECOVERABLE there ...

Even with MF_COUNT_INCREASED, the first pass does:

	if (flags & MF_COUNT_INCREASED)
		count_increased = true;

	[...]

	if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
		ret = 1;
	} else {
		if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
			put_page(p);
			shake_page(p);
			count_increased = false;
			goto try_again; <-
		}
		put_page(p);
		ret = -ENOTRECOVERABLE;
	}

Then we come back with count_increased=false:

try_again:
	if (!count_increased) {
		ret = __get_hwpoison_page(p, flags); <-
		if (!ret) {
		[...]
		} else if (ret == -EBUSY) { <-
		[...]
			ret = -EIO;
			goto out; <-
		}
	}

For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:

	if (!HWPoisonHandlable(&folio->page, flags))
		return -EBUSY;

so they still seem to end up as -EIO ... Am I missing something?

>  */
> static int get_hwpoison_page(struct page *p, unsigned long flags)
> {
>
>-- 
>2.53.0-Meta
>
>

^ permalink raw reply

* [PATCH 0/7] uprobes/x86: Fix red zone issue for optimized uprobes
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel

hi,
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.

Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call.

Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
if we decide to take this change.

thanks,
jirka


[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
---
Andrii Nakryiko (1):
      selftests/bpf: Add tests for uprobe nop10 red zone clobbering

Jiri Olsa (6):
      uprobes/x86: Move optimized uprobe from nop5 to nop10
      libbpf: Change has_nop_combo to work on top of nop10
      selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
      selftests/bpf: Change uprobe syscall tests to use nop10
      selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
      selftests/bpf: Add reattach tests for uprobe syscall

 arch/x86/kernel/uprobes.c                               | 121 ++++++++++++++++++++++++++++------------
 tools/lib/bpf/usdt.c                                    |  16 +++---
 tools/testing/selftests/bpf/bench.c                     |  20 +++----
 tools/testing/selftests/bpf/benchs/bench_trigger.c      |  38 ++++++-------
 tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 217 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 tools/testing/selftests/bpf/prog_tests/usdt.c           |  74 +++++++++++++++++++++----
 tools/testing/selftests/bpf/progs/test_usdt.c           |  25 +++++++++
 tools/testing/selftests/bpf/usdt.h                      |   2 +-
 tools/testing/selftests/bpf/usdt_2.c                    |  15 ++++-
 10 files changed, 423 insertions(+), 107 deletions(-)

^ permalink raw reply

* [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.

Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call, like:

  lea -0x80(%rsp), %rsp
  call tramp

Note the lea instruction is used to adjust the rsp register without
changing the flags.

The optimized uprobe performance stays the same:

        uprobe-nop     :    3.129 ± 0.013M/s
        uprobe-push    :    3.045 ± 0.006M/s
        uprobe-ret     :    1.095 ± 0.004M/s
  -->   uprobe-nop10   :    7.170 ± 0.020M/s
        uretprobe-nop  :    2.143 ± 0.021M/s
        uretprobe-push :    2.090 ± 0.000M/s
        uretprobe-ret  :    0.942 ± 0.000M/s
  -->   uretprobe-nop10:    3.381 ± 0.003M/s
        usdt-nop       :    3.245 ± 0.004M/s
  -->   usdt-nop10     :    7.256 ± 0.023M/s

[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
 1 file changed, 86 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..f7c4101a4039 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -636,9 +636,21 @@ struct uprobe_trampoline {
 	unsigned long		vaddr;
 };
 
+#define LEA_INSN_SIZE		5
+#define OPT_INSN_SIZE		(LEA_INSN_SIZE + CALL_INSN_SIZE)
+#define OPT_JMP8_OFFSET		(OPT_INSN_SIZE - JMP8_INSN_SIZE)
+#define REDZONE_SIZE		0x80
+
+static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
+
+static bool is_lea_insn(const uprobe_opcode_t *insn)
+{
+	return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
+}
+
 static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
 {
-	long delta = (long)(vaddr + 5 - vtramp);
+	long delta = (long)(vaddr + OPT_INSN_SIZE - vtramp);
 
 	return delta >= INT_MIN && delta <= INT_MAX;
 }
@@ -651,7 +663,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
 	};
 	unsigned long low_limit, high_limit;
 	unsigned long low_tramp, high_tramp;
-	unsigned long call_end = vaddr + 5;
+	unsigned long call_end = vaddr + OPT_INSN_SIZE;
 
 	if (check_add_overflow(call_end, INT_MIN, &low_limit))
 		low_limit = PAGE_SIZE;
@@ -826,8 +838,8 @@ SYSCALL_DEFINE0(uprobe)
 	regs->ax  = args.ax;
 	regs->r11 = args.r11;
 	regs->cx  = args.cx;
-	regs->ip  = args.retaddr - 5;
-	regs->sp += sizeof(args);
+	regs->ip  = args.retaddr - OPT_INSN_SIZE;
+	regs->sp += sizeof(args) + REDZONE_SIZE;
 	regs->orig_ax = -1;
 
 	sp = regs->sp;
@@ -844,12 +856,12 @@ SYSCALL_DEFINE0(uprobe)
 	 */
 	if (regs->sp != sp) {
 		/* skip the trampoline call */
-		if (args.retaddr - 5 == regs->ip)
-			regs->ip += 5;
+		if (args.retaddr - OPT_INSN_SIZE == regs->ip)
+			regs->ip += OPT_INSN_SIZE;
 		return regs->ax;
 	}
 
-	regs->sp -= sizeof(args);
+	regs->sp -= sizeof(args) + REDZONE_SIZE;
 
 	/* for the case uprobe_consumer has changed ax/r11/cx */
 	args.ax  = regs->ax;
@@ -857,7 +869,7 @@ SYSCALL_DEFINE0(uprobe)
 	args.cx  = regs->cx;
 
 	/* keep return address unless we are instructed otherwise */
-	if (args.retaddr - 5 != regs->ip)
+	if (args.retaddr - OPT_INSN_SIZE != regs->ip)
 		args.retaddr = regs->ip;
 
 	if (shstk_push(args.retaddr) == -EFAULT)
@@ -891,7 +903,7 @@ asm (
 	"pop %rax\n"
 	"pop %r11\n"
 	"pop %rcx\n"
-	"ret\n"
+	"ret $" __stringify(REDZONE_SIZE) "\n"
 	"int3\n"
 	".balign " __stringify(PAGE_SIZE) "\n"
 	".popsection\n"
@@ -909,7 +921,7 @@ late_initcall(arch_uprobes_init);
 
 enum {
 	EXPECT_SWBP,
-	EXPECT_CALL,
+	EXPECT_OPTIMIZED,
 };
 
 struct write_opcode_ctx {
@@ -930,17 +942,18 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 		       int nbytes, void *data)
 {
 	struct write_opcode_ctx *ctx = data;
-	uprobe_opcode_t old_opcode[5];
+	uprobe_opcode_t old_opcode[OPT_INSN_SIZE];
 
-	uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+	uprobe_copy_from_page(page, ctx->base, old_opcode, OPT_INSN_SIZE);
 
 	switch (ctx->expect) {
 	case EXPECT_SWBP:
 		if (is_swbp_insn(&old_opcode[0]))
 			return 1;
 		break;
-	case EXPECT_CALL:
-		if (is_call_insn(&old_opcode[0]))
+	case EXPECT_OPTIMIZED:
+		if (is_lea_insn(&old_opcode[0]) &&
+		    is_call_insn(&old_opcode[LEA_INSN_SIZE]))
 			return 1;
 		break;
 	}
@@ -963,7 +976,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
  *   - SMP sync all CPUs
  */
 static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-		       unsigned long vaddr, char *insn, bool optimize)
+		       unsigned long vaddr, char *insn, int size, bool optimize)
 {
 	uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
 	struct write_opcode_ctx ctx = {
@@ -978,7 +991,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	 * so we can skip this step for optimize == true.
 	 */
 	if (!optimize) {
-		ctx.expect = EXPECT_CALL;
+		ctx.expect = EXPECT_OPTIMIZED;
 		err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
 				   true /* is_register */, false /* do_update_ref_ctr */,
 				   &ctx);
@@ -990,7 +1003,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 
 	/* Write all but the first byte of the patched range. */
 	ctx.expect = EXPECT_SWBP;
-	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, size - 1, verify_insn,
 			   true /* is_register */, false /* do_update_ref_ctr */,
 			   &ctx);
 	if (err)
@@ -1017,17 +1030,32 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 			 unsigned long vaddr, unsigned long tramp)
 {
-	u8 call[5];
+	u8 insn[OPT_INSN_SIZE], *call = &insn[LEA_INSN_SIZE];
 
-	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
+	/*
+	 * We have nop10 instruction (with first byte overwritten to int3),
+	 * changing it to:
+	 *   lea -0x80(%rsp), %rsp
+	 *   call tramp
+	 */
+	memcpy(insn, lea_rsp, LEA_INSN_SIZE);
+	__text_gen_insn(call, CALL_INSN_OPCODE,
+			(const void *) (vaddr + LEA_INSN_SIZE),
 			(const void *) tramp, CALL_INSN_SIZE);
-	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+	return int3_update(auprobe, vma, vaddr, insn, OPT_INSN_SIZE, true /* optimize */);
 }
 
 static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 			   unsigned long vaddr)
 {
-	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+	/*
+	 * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
+	 * end of the 10-byte slot instead of restoring the original nop10,
+	 * because we could have thread already inside lea instruction.
+	 */
+	u8 jmp[OPT_INSN_SIZE] = { JMP8_INSN_OPCODE, OPT_JMP8_OFFSET };
+
+	return int3_update(auprobe, vma, vaddr, jmp, JMP8_INSN_SIZE, false /* optimize */);
 }
 
 static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
@@ -1049,19 +1077,21 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 	struct __packed __arch_relative_insn {
 		u8 op;
 		s32 raddr;
-	} *call = (struct __arch_relative_insn *) insn;
+	} *call = (struct __arch_relative_insn *)(insn + LEA_INSN_SIZE);
 
-	if (!is_call_insn(insn))
+	if (!is_lea_insn(insn))
+		return false;
+	if (!is_call_insn(insn + LEA_INSN_SIZE))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+	return __in_uprobe_trampoline(vaddr + OPT_INSN_SIZE + call->raddr);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
 {
-	uprobe_opcode_t insn[5];
+	uprobe_opcode_t insn[OPT_INSN_SIZE];
 	int err;
 
-	err = copy_from_vaddr(mm, vaddr, &insn, 5);
+	err = copy_from_vaddr(mm, vaddr, &insn, OPT_INSN_SIZE);
 	if (err)
 		return err;
 	return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
@@ -1095,14 +1125,25 @@ int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		  unsigned long vaddr)
 {
 	if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
-		int ret = is_optimized(vma->vm_mm, vaddr);
-		if (ret < 0)
+		uprobe_opcode_t insn[OPT_INSN_SIZE];
+		int ret;
+
+		ret = copy_from_vaddr(vma->vm_mm, vaddr, &insn, OPT_INSN_SIZE);
+		if (ret)
 			return ret;
-		if (ret) {
+		if (__is_optimized((uprobe_opcode_t *)&insn, vaddr)) {
 			ret = swbp_unoptimize(auprobe, vma, vaddr);
 			WARN_ON_ONCE(ret);
 			return ret;
 		}
+		/*
+		 * We can have re-attached probe on top of jmp8 instruction,
+		 * which did not get optimized. We need to restore the jmp8
+		 * instruction, instead of the original instruction (nop10).
+		 */
+		if (is_swbp_insn(&insn[0]) && insn[1] == OPT_JMP8_OFFSET)
+			return uprobe_write_opcode(auprobe, vma, vaddr, JMP8_INSN_OPCODE,
+						   false /* is_register */);
 	}
 	return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn,
 				   false /* is_register */);
@@ -1131,7 +1172,7 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
 void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn[5];
+	uprobe_opcode_t insn[OPT_INSN_SIZE];
 
 	if (!should_optimize(auprobe))
 		return;
@@ -1142,7 +1183,7 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 	 * Check if some other thread already optimized the uprobe for us,
 	 * if it's the case just go away silently.
 	 */
-	if (copy_from_vaddr(mm, vaddr, &insn, 5))
+	if (copy_from_vaddr(mm, vaddr, &insn, OPT_INSN_SIZE))
 		goto unlock;
 	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
 		goto unlock;
@@ -1160,14 +1201,24 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 
 static bool can_optimize(struct insn *insn, unsigned long vaddr)
 {
-	if (!insn->x86_64 || insn->length != 5)
+	if (!insn->x86_64)
 		return false;
 
-	if (!insn_is_nop(insn))
+	/* We can't do cross page atomic writes yet. */
+	if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < OPT_INSN_SIZE)
 		return false;
 
-	/* We can't do cross page atomic writes yet. */
-	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+	/* We can optimize on top of nop10.. */
+	if (insn->length == OPT_INSN_SIZE && insn_is_nop(insn))
+		return true;
+
+	/* .. and JMP rel8 to end of slot — check swbp_unoptimize. */
+	if (insn->length == 2 &&
+	    insn->opcode.bytes[0] == JMP8_INSN_OPCODE &&
+	    insn->immediate.value == OPT_JMP8_OFFSET)
+		return true;
+
+	return false;
 }
 #else /* 32-bit: */
 /*
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
fixing has_nop_combo to reflect that.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/bpf/usdt.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
index e3710933fd52..7e62e4d5bedd 100644
--- a/tools/lib/bpf/usdt.c
+++ b/tools/lib/bpf/usdt.c
@@ -305,7 +305,7 @@ struct usdt_manager *usdt_manager_new(struct bpf_object *obj)
 
 	/*
 	 * Detect kernel support for uprobe() syscall, it's presence means we can
-	 * take advantage of faster nop5 uprobe handling.
+	 * take advantage of faster nop10 uprobe handling.
 	 * Added in: 56101b69c919 ("uprobes/x86: Add uprobe syscall to speed up uprobe")
 	 */
 	man->has_uprobe_syscall = kernel_supports(obj, FEAT_UPROBE_SYSCALL);
@@ -596,14 +596,14 @@ static int parse_usdt_spec(struct usdt_spec *spec, const struct usdt_note *note,
 #if defined(__x86_64__)
 static bool has_nop_combo(int fd, long off)
 {
-	unsigned char nop_combo[6] = {
-		0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 /* nop,nop5 */
+	unsigned char nop_combo[11] = {
+		0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00,
 	};
-	unsigned char buf[6];
+	unsigned char buf[11];
 
-	if (pread(fd, buf, 6, off) != 6)
+	if (pread(fd, buf, 11, off) != 11)
 		return false;
-	return memcmp(buf, nop_combo, 6) == 0;
+	return memcmp(buf, nop_combo, 11) == 0;
 }
 #else
 static bool has_nop_combo(int fd, long off)
@@ -814,8 +814,8 @@ static int collect_usdt_targets(struct usdt_manager *man, struct elf_fd *elf_fd,
 		memset(target, 0, sizeof(*target));
 
 		/*
-		 * We have uprobe syscall and usdt with nop,nop5 instructions combo,
-		 * so we can place the uprobe directly on nop5 (+1) and get this probe
+		 * We have uprobe syscall and usdt with nop,nop10 instructions combo,
+		 * so we can place the uprobe directly on nop10 (+1) and get this probe
 		 * optimized.
 		 */
 		if (man->has_uprobe_syscall && has_nop_combo(elf_fd->fd, usdt_rel_ip)) {
-- 
2.53.0


^ permalink raw reply related

* [PATCH 3/7] selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

Syncing latest usdt.h change [1].

Now that we have nop10 optimization support in kernel, let's emit
nop,nop10 for usdt probe. We leave it up to the library to use
desirable nop instruction.

[1] TBD
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/usdt.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/usdt.h b/tools/testing/selftests/bpf/usdt.h
index c71e21df38b3..d359663b9c32 100644
--- a/tools/testing/selftests/bpf/usdt.h
+++ b/tools/testing/selftests/bpf/usdt.h
@@ -313,7 +313,7 @@ struct usdt_sema { volatile unsigned short active; };
 #if defined(__ia64__) || defined(__s390__) || defined(__s390x__)
 #define USDT_NOP			nop 0
 #elif defined(__x86_64__)
-#define USDT_NOP                       .byte 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x0 /* nop, nop5 */
+#define USDT_NOP                       .byte 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 /* nop, nop10 */
 #else
 #define USDT_NOP			nop
 #endif
-- 
2.53.0


^ permalink raw reply related

* [PATCH 4/7] selftests/bpf: Change uprobe syscall tests to use nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

Optimized uprobes are now on top of 10-bytes nop instructions,
reflect that in existing tests.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/benchs/bench_trigger.c      |  2 +-
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 29 ++++++++++---------
 tools/testing/selftests/bpf/prog_tests/usdt.c | 25 +++++++++-------
 tools/testing/selftests/bpf/usdt_2.c          |  2 +-
 4 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 2f22ec61667b..bcc4820c802e 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -398,7 +398,7 @@ static void *uprobe_producer_ret(void *input)
 #ifdef __x86_64__
 __nocf_check __weak void uprobe_target_nop5(void)
 {
-	asm volatile (".byte 0x0f, 0x1f, 0x44, 0x00, 0x00");
+	asm volatile (".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
 }
 
 static void *uprobe_producer_nop5(void *input)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 955a37751b52..c2e9e549c737 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -17,7 +17,7 @@
 #include "uprobe_syscall_executed.skel.h"
 #include "bpf/libbpf_internal.h"
 
-#define USDT_NOP .byte 0x0f, 0x1f, 0x44, 0x00, 0x00
+#define USDT_NOP .byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00
 #include "usdt.h"
 
 #pragma GCC diagnostic ignored "-Wattributes"
@@ -26,7 +26,7 @@ __attribute__((aligned(16)))
 __nocf_check __weak __naked unsigned long uprobe_regs_trigger(void)
 {
 	asm volatile (
-		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00\n" /* nop5 */
+		".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10 */
 		"movq $0xdeadbeef, %rax\n"
 		"ret\n"
 	);
@@ -345,9 +345,9 @@ static void test_uretprobe_syscall_call(void)
 __attribute__((aligned(16)))
 __nocf_check __weak __naked void uprobe_test(void)
 {
-	asm volatile ("					\n"
-		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00	\n"
-		"ret					\n"
+	asm volatile (
+		".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10 */
+		"ret\n"
 	);
 }
 
@@ -388,14 +388,16 @@ static int find_uprobes_trampoline(void *tramp_addr)
 	return ret;
 }
 
-static unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char jmp2B[2]   = { 0xeb, 8 };
+static unsigned char nop10[10]  = { 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
+static unsigned char lea_rsp[5] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
 
-static void *find_nop5(void *fn)
+static void *find_nop10(void *fn)
 {
 	int i;
 
-	for (i = 0; i < 10; i++) {
-		if (!memcmp(nop5, fn + i, 5))
+	for (i = 0; i < 128; i++) {
+		if (!memcmp(nop10, fn + i, 9))
 			return fn + i;
 	}
 	return NULL;
@@ -420,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
 	ASSERT_EQ(skel->bss->executed, executed, "executed");
 
 	/* .. and check the trampoline is as expected. */
-	call = (struct __arch_relative_insn *) addr;
+	ASSERT_OK(memcmp(addr, lea_rsp, 4), "lea_rsp");
+	call = (struct __arch_relative_insn *)(addr + 5);
 	tramp = (void *) (call + 1) + call->raddr;
 	ASSERT_EQ(call->op, 0xe8, "call");
 	ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
@@ -432,7 +435,7 @@ static void check_detach(void *addr, void *tramp)
 {
 	/* [uprobes_trampoline] stays after detach */
 	ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
-	ASSERT_OK(memcmp(addr, nop5, 5), "nop5");
+	ASSERT_OK(memcmp(addr, jmp2B, 2), "jmp2B");
 }
 
 static void check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
@@ -568,8 +571,8 @@ static void test_uprobe_usdt(void)
 	void *addr;
 
 	errno = 0;
-	addr = find_nop5(usdt_test);
-	if (!ASSERT_OK_PTR(addr, "find_nop5"))
+	addr = find_nop10(usdt_test);
+	if (!ASSERT_OK_PTR(addr, "find_nop10"))
 		return;
 
 	skel = uprobe_syscall_executed__open_and_load();
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 69759b27794d..be34c4087ff5 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -252,7 +252,7 @@ extern void usdt_1(void);
 extern void usdt_2(void);
 
 static unsigned char nop1[1] = { 0x90 };
-static unsigned char nop1_nop5_combo[6] = { 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char nop1_nop10_combo[11] = { 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
 
 static void *find_instr(void *fn, unsigned char *instr, size_t cnt)
 {
@@ -271,17 +271,17 @@ static void subtest_optimized_attach(void)
 	__u8 *addr_1, *addr_2;
 
 	/* usdt_1 USDT probe has single nop instruction */
-	addr_1 = find_instr(usdt_1, nop1_nop5_combo, 6);
-	if (!ASSERT_NULL(addr_1, "usdt_1_find_nop1_nop5_combo"))
+	addr_1 = find_instr(usdt_1, nop1_nop10_combo, 6);
+	if (!ASSERT_NULL(addr_1, "usdt_1_find_nop1_nop10_combo"))
 		return;
 
 	addr_1 = find_instr(usdt_1, nop1, 1);
 	if (!ASSERT_OK_PTR(addr_1, "usdt_1_find_nop1"))
 		return;
 
-	/* usdt_2 USDT probe has nop,nop5 instructions combo */
-	addr_2 = find_instr(usdt_2, nop1_nop5_combo, 6);
-	if (!ASSERT_OK_PTR(addr_2, "usdt_2_find_nop1_nop5_combo"))
+	/* usdt_2 USDT probe has nop,nop10 instructions combo */
+	addr_2 = find_instr(usdt_2, nop1_nop10_combo, 6);
+	if (!ASSERT_OK_PTR(addr_2, "usdt_2_find_nop1_nop10_combo"))
 		return;
 
 	skel = test_usdt__open_and_load();
@@ -309,12 +309,12 @@ static void subtest_optimized_attach(void)
 
 	bpf_link__destroy(skel->links.usdt_executed);
 
-	/* we expect the nop5 ip */
+	/* we expect the nop10 ip */
 	skel->bss->expected_ip = (unsigned long) addr_2 + 1;
 
 	/*
 	 * Attach program on top of usdt_2 which is probe defined on top
-	 * of nop1,nop5 combo, so the probe gets optimized on top of nop5.
+	 * of nop1,nop10 combo, so the probe gets optimized on top of nop10.
 	 */
 	skel->links.usdt_executed = bpf_program__attach_usdt(skel->progs.usdt_executed,
 						     0 /*self*/, "/proc/self/exe",
@@ -328,8 +328,13 @@ static void subtest_optimized_attach(void)
 	/* nop stays on addr_2 address */
 	ASSERT_EQ(*addr_2, 0x90, "nop");
 
-	/* call is on addr_2 + 1 address */
-	ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
+	/*
+	 * lea -0x80(%rsp), %rsp
+	 * call ...
+	 */
+	static unsigned char expected[] = { 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8 };
+
+	ASSERT_MEMEQ(addr_2 + 1, expected, sizeof(expected), "lea_and_call");
 	ASSERT_EQ(skel->bss->executed, 4, "executed");
 
 cleanup:
diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
index 789883aaca4c..b359b389f6c0 100644
--- a/tools/testing/selftests/bpf/usdt_2.c
+++ b/tools/testing/selftests/bpf/usdt_2.c
@@ -3,7 +3,7 @@
 #if defined(__x86_64__)
 
 /*
- * Include usdt.h with default nop,nop5 instructions combo.
+ * Include usdt.h with default nop,nop10 instructions combo.
  */
 #include "usdt.h"
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 5/7] selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

Changing uprobe/usdt trigger bench code to use nop10 instead
of nop5. Also changing un_bench_uprobes.sh to use nop10 triggers.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/bench.c           | 20 +++++------
 .../selftests/bpf/benchs/bench_trigger.c      | 36 +++++++++----------
 .../selftests/bpf/benchs/run_bench_uprobes.sh |  2 +-
 3 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 6155ce455c27..1252a1af2e84 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -539,12 +539,12 @@ extern const struct bench bench_trig_uretprobe_multi_push;
 extern const struct bench bench_trig_uprobe_multi_ret;
 extern const struct bench bench_trig_uretprobe_multi_ret;
 #ifdef __x86_64__
-extern const struct bench bench_trig_uprobe_nop5;
-extern const struct bench bench_trig_uretprobe_nop5;
-extern const struct bench bench_trig_uprobe_multi_nop5;
-extern const struct bench bench_trig_uretprobe_multi_nop5;
+extern const struct bench bench_trig_uprobe_nop10;
+extern const struct bench bench_trig_uretprobe_nop10;
+extern const struct bench bench_trig_uprobe_multi_nop10;
+extern const struct bench bench_trig_uretprobe_multi_nop10;
 extern const struct bench bench_trig_usdt_nop;
-extern const struct bench bench_trig_usdt_nop5;
+extern const struct bench bench_trig_usdt_nop10;
 #endif
 
 extern const struct bench bench_rb_libbpf;
@@ -619,12 +619,12 @@ static const struct bench *benchs[] = {
 	&bench_trig_uprobe_multi_ret,
 	&bench_trig_uretprobe_multi_ret,
 #ifdef __x86_64__
-	&bench_trig_uprobe_nop5,
-	&bench_trig_uretprobe_nop5,
-	&bench_trig_uprobe_multi_nop5,
-	&bench_trig_uretprobe_multi_nop5,
+	&bench_trig_uprobe_nop10,
+	&bench_trig_uretprobe_nop10,
+	&bench_trig_uprobe_multi_nop10,
+	&bench_trig_uretprobe_multi_nop10,
 	&bench_trig_usdt_nop,
-	&bench_trig_usdt_nop5,
+	&bench_trig_usdt_nop10,
 #endif
 	/* ringbuf/perfbuf benchmarks */
 	&bench_rb_libbpf,
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index bcc4820c802e..3998ea8ff9aa 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -396,15 +396,15 @@ static void *uprobe_producer_ret(void *input)
 }
 
 #ifdef __x86_64__
-__nocf_check __weak void uprobe_target_nop5(void)
+__nocf_check __weak void uprobe_target_nop10(void)
 {
 	asm volatile (".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
 }
 
-static void *uprobe_producer_nop5(void *input)
+static void *uprobe_producer_nop10(void *input)
 {
 	while (true)
-		uprobe_target_nop5();
+		uprobe_target_nop10();
 	return NULL;
 }
 
@@ -418,7 +418,7 @@ static void *uprobe_producer_usdt_nop(void *input)
 	return NULL;
 }
 
-static void *uprobe_producer_usdt_nop5(void *input)
+static void *uprobe_producer_usdt_nop10(void *input)
 {
 	while (true)
 		usdt_2();
@@ -542,24 +542,24 @@ static void uretprobe_multi_ret_setup(void)
 }
 
 #ifdef __x86_64__
-static void uprobe_nop5_setup(void)
+static void uprobe_nop10_setup(void)
 {
-	usetup(false, false /* !use_multi */, &uprobe_target_nop5);
+	usetup(false, false /* !use_multi */, &uprobe_target_nop10);
 }
 
-static void uretprobe_nop5_setup(void)
+static void uretprobe_nop10_setup(void)
 {
-	usetup(true, false /* !use_multi */, &uprobe_target_nop5);
+	usetup(true, false /* !use_multi */, &uprobe_target_nop10);
 }
 
-static void uprobe_multi_nop5_setup(void)
+static void uprobe_multi_nop10_setup(void)
 {
-	usetup(false, true /* use_multi */, &uprobe_target_nop5);
+	usetup(false, true /* use_multi */, &uprobe_target_nop10);
 }
 
-static void uretprobe_multi_nop5_setup(void)
+static void uretprobe_multi_nop10_setup(void)
 {
-	usetup(true, true /* use_multi */, &uprobe_target_nop5);
+	usetup(true, true /* use_multi */, &uprobe_target_nop10);
 }
 
 static void usdt_setup(const char *name)
@@ -598,7 +598,7 @@ static void usdt_nop_setup(void)
 	usdt_setup("usdt_1");
 }
 
-static void usdt_nop5_setup(void)
+static void usdt_nop10_setup(void)
 {
 	usdt_setup("usdt_2");
 }
@@ -665,10 +665,10 @@ BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
 BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
 BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
 #ifdef __x86_64__
-BENCH_TRIG_USERMODE(uprobe_nop5, nop5, "uprobe-nop5");
-BENCH_TRIG_USERMODE(uretprobe_nop5, nop5, "uretprobe-nop5");
-BENCH_TRIG_USERMODE(uprobe_multi_nop5, nop5, "uprobe-multi-nop5");
-BENCH_TRIG_USERMODE(uretprobe_multi_nop5, nop5, "uretprobe-multi-nop5");
+BENCH_TRIG_USERMODE(uprobe_nop10, nop10, "uprobe-nop10");
+BENCH_TRIG_USERMODE(uretprobe_nop10, nop10, "uretprobe-nop10");
+BENCH_TRIG_USERMODE(uprobe_multi_nop10, nop10, "uprobe-multi-nop10");
+BENCH_TRIG_USERMODE(uretprobe_multi_nop10, nop10, "uretprobe-multi-nop10");
 BENCH_TRIG_USERMODE(usdt_nop, usdt_nop, "usdt-nop");
-BENCH_TRIG_USERMODE(usdt_nop5, usdt_nop5, "usdt-nop5");
+BENCH_TRIG_USERMODE(usdt_nop10, usdt_nop10, "usdt-nop10");
 #endif
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
index 9ec59423b949..e490b337e960 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
@@ -2,7 +2,7 @@
 
 set -eufo pipefail
 
-for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret,nop5} usdt-nop usdt-nop5
+for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret,nop10} usdt-nop usdt-nop10
 do
 	summary=$(sudo ./bench -w2 -d5 -a trig-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
 	printf "%-15s: %s\n" $i "$summary"
-- 
2.53.0


^ permalink raw reply related

* [PATCH 6/7] selftests/bpf: Add reattach tests for uprobe syscall
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

Adding reattach tests for uprobe syscall tests to make sure
we can re-attach and optimize same uprobe multiple times.

The reason is that optimized uprobe does not restore original
nop10 after detach, but instead it uses 'jmp 8' instruction.

Making sure we can still install and optimize uprobe on top
of the 'jmp 8' instruction.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 115 ++++++++++++++++--
 1 file changed, 105 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index c2e9e549c737..82b3c0ce9253 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -431,21 +431,27 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
 	return tramp;
 }
 
-static void check_detach(void *addr, void *tramp)
+static bool check_detach(void *addr, void *tramp)
 {
+	bool ok = true;
+
 	/* [uprobes_trampoline] stays after detach */
-	ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
-	ASSERT_OK(memcmp(addr, jmp2B, 2), "jmp2B");
+	if (!ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline"))
+		ok = false;
+	if (!ASSERT_OK(memcmp(addr, jmp2B, 2), "jmp2B"))
+		ok = false;
+	return ok;
 }
 
-static void check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
-		  trigger_t trigger, void *addr, int executed)
+static void *check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
+		   trigger_t trigger, void *addr, int executed)
 {
 	void *tramp;
 
 	tramp = check_attach(skel, trigger, addr, executed);
 	bpf_link__destroy(link);
 	check_detach(addr, tramp);
+	return tramp;
 }
 
 static void test_uprobe_legacy(void)
@@ -456,6 +462,7 @@ static void test_uprobe_legacy(void)
 	);
 	struct bpf_link *link;
 	unsigned long offset;
+	void *tramp;
 
 	offset = get_uprobe_offset(&uprobe_test);
 	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
@@ -473,7 +480,28 @@ static void test_uprobe_legacy(void)
 	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
 		goto cleanup;
 
-	check(skel, link, uprobe_test, uprobe_test, 2);
+	tramp = check(skel, link, uprobe_test, uprobe_test, 2);
+
+	/* reattach and detach without triggering optimization */
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+					       0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+		goto cleanup;
+
+	bpf_link__destroy(link);
+	if (!check_detach(uprobe_test, tramp))
+		goto cleanup;
+
+	uprobe_test();
+	ASSERT_EQ(skel->bss->executed, 2, "executed_no_probe");
+
+	/* reattach with triggering optimization */
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+				0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 4);
 
 	/* uretprobe */
 	skel->bss->executed = 0;
@@ -495,6 +523,7 @@ static void test_uprobe_multi(void)
 	LIBBPF_OPTS(bpf_uprobe_multi_opts, opts);
 	struct bpf_link *link;
 	unsigned long offset;
+	void *tramp;
 
 	offset = get_uprobe_offset(&uprobe_test);
 	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
@@ -515,7 +544,28 @@ static void test_uprobe_multi(void)
 	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
 		goto cleanup;
 
-	check(skel, link, uprobe_test, uprobe_test, 2);
+	tramp = check(skel, link, uprobe_test, uprobe_test, 2);
+
+	/* reattach and detach without triggering optimization */
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	bpf_link__destroy(link);
+	if (!check_detach(uprobe_test, tramp))
+		goto cleanup;
+
+	uprobe_test();
+	ASSERT_EQ(skel->bss->executed, 2, "executed_no_probe");
+
+	/* reattach with triggering optimization */
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 4);
 
 	/* uretprobe.multi */
 	skel->bss->executed = 0;
@@ -539,6 +589,7 @@ static void test_uprobe_session(void)
 	);
 	struct bpf_link *link;
 	unsigned long offset;
+	void *tramp;
 
 	offset = get_uprobe_offset(&uprobe_test);
 	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
@@ -558,7 +609,28 @@ static void test_uprobe_session(void)
 	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
 		goto cleanup;
 
-	check(skel, link, uprobe_test, uprobe_test, 4);
+	tramp = check(skel, link, uprobe_test, uprobe_test, 4);
+
+	/* reattach and detach without triggering optimization */
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_session,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	bpf_link__destroy(link);
+	if (!check_detach(uprobe_test, tramp))
+		goto cleanup;
+
+	uprobe_test();
+	ASSERT_EQ(skel->bss->executed, 4, "executed_no_probe");
+
+	/* reattach with triggering optimization */
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_session,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 8);
 
 cleanup:
 	uprobe_syscall_executed__destroy(skel);
@@ -568,7 +640,7 @@ static void test_uprobe_usdt(void)
 {
 	struct uprobe_syscall_executed *skel;
 	struct bpf_link *link;
-	void *addr;
+	void *addr, *tramp;
 
 	errno = 0;
 	addr = find_nop10(usdt_test);
@@ -587,7 +659,30 @@ static void test_uprobe_usdt(void)
 	if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
 		goto cleanup;
 
-	check(skel, link, usdt_test, addr, 2);
+	tramp = check(skel, link, usdt_test, addr, 2);
+
+	/* reattach and detach without triggering optimization */
+	link = bpf_program__attach_usdt(skel->progs.test_usdt,
+				-1 /* all PIDs */, "/proc/self/exe",
+				"optimized_uprobe", "usdt", NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+		goto cleanup;
+
+	bpf_link__destroy(link);
+	if (!check_detach(addr, tramp))
+		goto cleanup;
+
+	usdt_test();
+	ASSERT_EQ(skel->bss->executed, 2, "executed_no_probe");
+
+	/* reattach with triggering optimization */
+	link = bpf_program__attach_usdt(skel->progs.test_usdt,
+				-1 /* all PIDs */, "/proc/self/exe",
+				"optimized_uprobe", "usdt", NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+		goto cleanup;
+
+	check(skel, link, usdt_test, addr, 4);
 
 cleanup:
 	uprobe_syscall_executed__destroy(skel);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 7/7] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: Jiri Olsa @ 2026-05-14 13:53 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-1-jolsa@kernel.org>

From: Andrii Nakryiko <andrii@kernel.org>

The uprobe nop5 optimization used to replace a 5-byte NOP with a 5-byte
CALL to a trampoline. The CALL pushes a return address onto the stack at
[rsp-8], clobbering whatever was stored there.

On x86-64, the red zone is the 128 bytes below rsp that user code may use
for temporary storage without adjusting rsp. Compilers can place USDT
argument operands there, generating specs like "8@-8(%rbp)" when rbp ==
rsp. With the CALL-based optimization, the return address overwrites that
argument before the BPF-side USDT argument fetch runs.

Add two tests for this case. The uprobe_syscall subtest stores known values
at -8(%rsp), -16(%rsp), and -24(%rsp), executes an optimized nop10 uprobe,
and verifies the red-zone data is still intact. The USDT subtest triggers a
probe in a function where the compiler places three USDT operands in the
red zone and verifies that all 10 optimized invocations deliver the expected
argument values to BPF.

On an unfixed kernel, the first hit goes through the INT3 path and later
hits use the optimized CALL path, so the red-zone checks fail after
optimization.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
[ updates to use nop10 ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 75 +++++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/usdt.c | 49 ++++++++++++
 tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++++
 tools/testing/selftests/bpf/usdt_2.c          | 13 ++++
 4 files changed, 162 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 82b3c0ce9253..d553485e7db5 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -357,6 +357,48 @@ __nocf_check __weak void usdt_test(void)
 	USDT(optimized_uprobe, usdt);
 }
 
+/*
+ * Assembly-level red zone clobbering test. Stores known values in the
+ * red zone (below RSP), executes a nop10 (uprobe site), and checks that
+ * the values survived. Returns 0 if intact, 1 if clobbered.
+ *
+ * The nop5 optimization used CALL (which pushes a return address to
+ * [rsp-8]), the value at -8(%rsp) was overwritten. The nop10 optimization
+ * should escape that by moving stackpointer below the redzone before
+ * doing the CALL.
+ */
+__attribute__((aligned(16)))
+__nocf_check __weak __naked unsigned long uprobe_red_zone_test(void)
+{
+	asm volatile (
+		"movabs $0x1111111111111111, %%rax\n"
+		"movq   %%rax, -8(%%rsp)\n"
+		"movabs $0x2222222222222222, %%rax\n"
+		"movq   %%rax, -16(%%rsp)\n"
+		"movabs $0x3333333333333333, %%rax\n"
+		"movq   %%rax, -24(%%rsp)\n"
+
+		".byte 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10: uprobe site */
+
+		"movabs $0x1111111111111111, %%rax\n"
+		"cmpq   %%rax, -8(%%rsp)\n"
+		"jne    1f\n"
+		"movabs $0x2222222222222222, %%rax\n"
+		"cmpq   %%rax, -16(%%rsp)\n"
+		"jne    1f\n"
+		"movabs $0x3333333333333333, %%rax\n"
+		"cmpq   %%rax, -24(%%rsp)\n"
+		"jne    1f\n"
+
+		"xorl   %%eax, %%eax\n"
+		"retq\n"
+		"1:\n"
+		"movl   $1, %%eax\n"
+		"retq\n"
+		::: "rax", "memory"
+	);
+}
+
 static int find_uprobes_trampoline(void *tramp_addr)
 {
 	void *start, *end;
@@ -855,6 +897,37 @@ static void test_uprobe_race(void)
 #define __NR_uprobe 336
 #endif
 
+static void test_uprobe_red_zone(void)
+{
+	struct uprobe_syscall_executed *skel;
+	struct bpf_link *link;
+	void *nop10_addr;
+	size_t offset;
+	int i;
+
+	nop10_addr = find_nop10(uprobe_red_zone_test);
+	if (!ASSERT_NEQ(nop10_addr, NULL, "find_nop10"))
+		return;
+
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	offset = get_uprobe_offset(nop10_addr);
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+			0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_uprobe"))
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		ASSERT_EQ(uprobe_red_zone_test(), 0, "red_zone_intact");
+
+	bpf_link__destroy(link);
+
+cleanup:
+	uprobe_syscall_executed__destroy(skel);
+}
+
 static void test_uprobe_error(void)
 {
 	long err = syscall(__NR_uprobe);
@@ -881,6 +954,8 @@ static void __test_uprobe_syscall(void)
 		test_uprobe_usdt();
 	if (test__start_subtest("uprobe_race"))
 		test_uprobe_race();
+	if (test__start_subtest("uprobe_red_zone"))
+		test_uprobe_red_zone();
 	if (test__start_subtest("uprobe_error"))
 		test_uprobe_error();
 	if (test__start_subtest("uprobe_regs_equal"))
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index be34c4087ff5..606601ccdc42 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -250,6 +250,7 @@ static void subtest_basic_usdt(bool optimized)
 #ifdef __x86_64__
 extern void usdt_1(void);
 extern void usdt_2(void);
+extern void usdt_red_zone_trigger(void);
 
 static unsigned char nop1[1] = { 0x90 };
 static unsigned char nop1_nop10_combo[11] = { 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };
@@ -340,6 +341,52 @@ static void subtest_optimized_attach(void)
 cleanup:
 	test_usdt__destroy(skel);
 }
+
+/*
+ * Test that USDT arguments survive nop10 optimization in a function where
+ * the compiler places operands in the red zone.
+ *
+ * Signal handlers are prone to having the compiler place USDT argument
+ * operands in the red zone (below rsp).
+ *
+ * The nop5 optimization used CALL (which pushes a return address to
+ * [rsp-8]), the value at -8(%rsp) was overwritten. The nop10 optimization
+ * should escape that by moving stackpointer below the redzone before
+ * doing the CALL.
+ */
+static void subtest_optimized_red_zone(void)
+{
+	struct test_usdt *skel;
+	int i;
+
+	skel = test_usdt__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	skel->bss->expected_arg[0] = 0xDEADBEEF;
+	skel->bss->expected_arg[1] = 0xCAFEBABE;
+	skel->bss->expected_arg[2] = 0xFEEDFACE;
+	skel->bss->expected_pid = getpid();
+
+	skel->links.usdt_check_arg = bpf_program__attach_usdt(
+		skel->progs.usdt_check_arg, 0, "/proc/self/exe",
+		"optimized_attach", "usdt_red_zone", NULL);
+	if (!ASSERT_OK_PTR(skel->links.usdt_check_arg, "attach_usdt_red_zone"))
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		usdt_red_zone_trigger();
+
+	ASSERT_EQ(skel->bss->arg_total, 10, "arg_total");
+	ASSERT_EQ(skel->bss->arg_bad, 0, "arg_bad");
+	ASSERT_EQ(skel->bss->arg_last[0], 0xDEADBEEF, "arg_last_1");
+	ASSERT_EQ(skel->bss->arg_last[1], 0xCAFEBABE, "arg_last_2");
+	ASSERT_EQ(skel->bss->arg_last[2], 0xFEEDFACE, "arg_last_3");
+
+cleanup:
+	test_usdt__destroy(skel);
+}
+
 #endif
 
 unsigned short test_usdt_100_semaphore SEC(".probes");
@@ -613,6 +660,8 @@ void test_usdt(void)
 		subtest_basic_usdt(true);
 	if (test__start_subtest("optimized_attach"))
 		subtest_optimized_attach();
+	if (test__start_subtest("optimized_red_zone"))
+		subtest_optimized_red_zone();
 #endif
 	if (test__start_subtest("multispec"))
 		subtest_multispec_usdt();
diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
index f00cb52874e0..0ee78fb050a1 100644
--- a/tools/testing/selftests/bpf/progs/test_usdt.c
+++ b/tools/testing/selftests/bpf/progs/test_usdt.c
@@ -149,5 +149,30 @@ int usdt_executed(struct pt_regs *ctx)
 		executed++;
 	return 0;
 }
+
+int arg_total;
+int arg_bad;
+long arg_last[3];
+long expected_arg[3];
+int expected_pid;
+
+SEC("usdt")
+int BPF_USDT(usdt_check_arg, long arg1, long arg2, long arg3)
+{
+	if (expected_pid != (bpf_get_current_pid_tgid() >> 32))
+		return 0;
+
+	__sync_fetch_and_add(&arg_total, 1);
+	arg_last[0] = arg1;
+	arg_last[1] = arg2;
+	arg_last[2] = arg3;
+
+	if (arg1 != expected_arg[0] ||
+	    arg2 != expected_arg[1] ||
+	    arg3 != expected_arg[2])
+		__sync_fetch_and_add(&arg_bad, 1);
+
+	return 0;
+}
 #endif
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
index b359b389f6c0..5e38f8605b02 100644
--- a/tools/testing/selftests/bpf/usdt_2.c
+++ b/tools/testing/selftests/bpf/usdt_2.c
@@ -13,4 +13,17 @@ void usdt_2(void)
 	USDT(optimized_attach, usdt_2);
 }
 
+static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
+static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
+static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
+
+void __attribute__((noinline)) usdt_red_zone_trigger(void)
+{
+	unsigned long a1 = usdt_red_zone_arg1;
+	unsigned long a2 = usdt_red_zone_arg2;
+	unsigned long a3 = usdt_red_zone_arg3;
+
+	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
+}
+
 #endif
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v2.1 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  48 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  67 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 197 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 222 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1284 insertions(+), 51 deletions(-)


base-commit: 678b6bc7ce120b8c51d4e05fcb8eb0a92f9be3f6
-- 
2.47.3

^ permalink raw reply

* [RFC PATCH v2.1 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  9 +++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 7e25f4469b81b..2b96a03876034 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT_CONDITION(damon_aggregated_v2,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_CONDITION(nr_probes > 0),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fe6c789f2cecb..0ad1a0af06893 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1905,6 +1905,13 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	if (trace_damon_aggregated_v2_enabled()) {
+		damon_for_each_probe(probe, c)
+			nr_probes++;
+	}
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1913,6 +1920,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_aggregated_v2(ti, r, damon_nr_regions(t),
+					nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-05-14 14:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
	linux-arch, linux-mm, linux-trace-kernel, kernel-team,
	Paul E. McKenney
In-Reply-To: <20260513114102.50f4ca68@gandalf.local.home>

On Wed, May 13, 2026 at 11:41:02AM -0400, Steven Rostedt wrote:
> On Tue,  5 May 2026 17:09:34 +0000
> Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > Use the arch-overridable queued_spin_release(), introduced in the
> > previous commit, to ensure the tracepoint works correctly across all
> 
> Remove the ", introduced in the previous commit," That's useless in git
> change logs.

Thanks for the suggestion, will do here and in other places.

[...]

> >  /**
> >   * queued_spin_unlock - unlock a queued spinlock
> >   * @lock : Pointer to queued spinlock structure
> > + *
> > + * Generic tracing wrapper around the arch-overridable
> > + * queued_spin_release().
> >   */
> >  static __always_inline void queued_spin_unlock(struct qspinlock *lock)
> >  {
> > +	/*
> > +	 * Trace and release are combined in queued_spin_release_traced() so
> > +	 * the compiler does not need to preserve the lock pointer across the
> > +	 * function call, avoiding callee-saved register save/restore on the
> > +	 * hot path.
> > +	 */
> > +	if (tracepoint_enabled(contended_release)) {
> > +		queued_spin_release_traced(lock);
> > +		return;
> 
> Get rid of the "return;". What does it save you? It just makes it that you
> need to duplicate the code. Even though it's a one liner, it can cause bugs
> in the future if this changes. You could call the function:
> 
>   do_trace_queued_spin_release_traced(lock);
> 
> 
> > +	}
> >  	queued_spin_release(lock);
> >  }
> >  
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index af8d122bb649..649fdca69288 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -104,6 +104,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
> >  #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
> >  #endif
> >  
> > +void __lockfunc queued_spin_release_traced(struct qspinlock *lock)
> > +{
> > +	if (queued_spin_is_contended(lock))
> > +		trace_call__contended_release(lock);
> > +	queued_spin_release(lock);
> 
> And then remove the duplicate call of "queued_spin_release()" here.

This is the scenario the comment above the static branch describes.
Here's what it looks like in practice on x86_64 (defconfig, compiled
with GCC 11).

Current design (trace + unlock combined, with return):
  
    endbr64
    xchg %ax,%ax                     ; NOP (static branch)
    movb $0x0,(%rdi)                 ; unlock
    decl %gs:__preempt_count
    je   preempt
    jmp  __x86_return_thunk
    call queued_spin_release_traced  ; cold
    jmp  preempt_handling            ; cold
    call __SCT__preempt_schedule
    jmp  __x86_return_thunk

With the trace-only function (no return, unlock after the call):
  
    endbr64
    push %rbx                        ; saves callee-saved rbx (!)
    mov  %rdi,%rbx                   ; preserve lock across call (!)
    xchg %ax,%ax                     ; NOP (static branch)
    movb $0x0,(%rbx)                 ; unlock
    decl %gs:__preempt_count
    je   preempt
    pop  %rbx                        ; callee-saved restore (!)
    jmp  __x86_return_thunk
    call queued_spin_release_traced  ; cold
    jmp  unlock                      ; cold
    call __SCT__preempt_schedule
    pop  %rbx
    jmp  __x86_return_thunk

Three extra instructions marked by "!" on the hot path (push, mov, pop),
all wasted when the tracepoint is off. That's the main reason for
combining trace and unlock in the same out-of-line function.

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-14 14:37 UTC (permalink / raw)
  To: Lance Yang
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260514132830.25622-1-lance.yang@linux.dev>

On Thu, May 14, 2026 at 09:28:30PM +0800, Lance Yang wrote:
> 
> On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
> >get_any_page() collapses three different failure modes into a single
> >-EIO return:
> >
> >  * the put_page race in the !count_increased path;
> >  * the HWPoisonHandlable() rejection that bounces out of
> >    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
> >  * the HWPoisonHandlable() rejection that goes through the
> >    count_increased / put_page / shake_page retry loop.
> >
> >The first is transient (the page is racing with the allocator).  The
> >second can be either transient (a userspace folio briefly off LRU
> >during migration/compaction) or stable (slab/vmalloc/page-table/
> >kernel-stack pages).  The third describes a stable kernel-owned page
> >that the count_increased=true caller already held a reference on.
> >
> >Distinguish them on the return path: keep -EIO for both the put_page
> >race and the -EBUSY-after-retries branch (shake_page() cannot drag a
> >folio back from active migration, so we cannot prove the page is
> >permanently kernel-owned from there), keep -EBUSY for the allocation
> >race (unchanged), and return -ENOTRECOVERABLE only from the
> >count_increased-true HWPoisonHandlable() rejection that exhausts its
> >retries -- the caller's reference is structural evidence that the
> >page is owned by the kernel.
> >
> >Extend the unhandlable-page pr_err() to fire for either errno and
> >update the get_hwpoison_page() kerneldoc.
> >
> >memory_failure() still folds every negative return into
> >MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> >this patch is a no-op for users of memory_failure() and only changes
> >the errno that soft_offline_page() can propagate to its callers.  A
> >follow-up wires the new return code through memory_failure() and
> >reports MF_MSG_KERNEL for the unrecoverable cases.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Signed-off-by: Breno Leitao <leitao@debian.org>
> >---
> > mm/memory-failure.c | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> >diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >index 49bcfbd04d213..bae883df3ccb2 100644
> >--- a/mm/memory-failure.c
> >+++ b/mm/memory-failure.c
> >@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> > 				shake_page(p);
> > 				goto try_again;
> > 			}
> >+			/*
> >+			 * Return -EIO rather than -ENOTRECOVERABLE: this
> >+			 * branch is also reached for pages that are merely
> >+			 * off-LRU transiently (e.g. a folio in the middle
> >+			 * of migration or compaction), which shake_page()
> >+			 * cannot drag back.  The caller cannot prove the
> >+			 * page is permanently kernel-owned from here, so
> >+			 * keep it on the recoverable errno.
> >+			 */
> > 			ret = -EIO;
> > 			goto out;
> > 		}
> >@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> > 			goto try_again;
> > 		}
> > 		put_page(p);
> >-		ret = -EIO;
> >+		ret = -ENOTRECOVERABLE;
> > 	}
> > out:
> >-	if (ret == -EIO)
> >+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> > 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> > 
> > 	return ret;
> >@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
> >  *         -EIO for pages on which we can not handle memory errors,
> >  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
> >  *         operations like allocation and free,
> >- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
> >+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
> >+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
> >+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
> >+ *         kernel stacks, and similar non-LRU/non-buddy pages).
> 
> Did you test this patch series? I don't see how we ever get to
> -ENOTRECOVERABLE there ...

Yes, I did. I am using the following test case:

https://github.com/leitao/linux/commit/cfebe84ddeab5ac34ed456331db980d57e7025dc

	# RUN_DESTRUCTIVE=1 tools/testing/selftests/mm/hwpoison-panic.sh
	# enabling /proc/sys/vm/panic_on_unrecoverable_memory_failure
	# injecting hwpoison at phys 0x2a00000 (Kernel rodata)
	# expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'
	[  501.113256] Memory failure: 0x2a00: recovery action for reserved kernel page: Ignored
	[  501.113956] Kernel panic - not syncing: Memory failure: 0x2a00: unrecoverable page


> Even with MF_COUNT_INCREASED, the first pass does:
> 
> 	if (flags & MF_COUNT_INCREASED)
> 		count_increased = true;
> 
> 	[...]
> 
> 	if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
> 		ret = 1;
> 	} else {
> 		if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
> 			put_page(p);
> 			shake_page(p);
> 			count_increased = false;
> 			goto try_again; <-
> 		}
> 		put_page(p);
> 		ret = -ENOTRECOVERABLE;
> 	}
> 
> Then we come back with count_increased=false:
> 
> try_again:
> 	if (!count_increased) {
> 		ret = __get_hwpoison_page(p, flags); <-
> 		if (!ret) {
> 		[...]
> 		} else if (ret == -EBUSY) { <-
> 		[...]
> 			ret = -EIO;
> 			goto out; <-
> 		}
> 	}
> 
> For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:
> 
> 	if (!HWPoisonHandlable(&folio->page, flags))
> 		return -EBUSY;
> 
> so they still seem to end up as -EIO ... Am I missing something?

You are not, and thanks for catching this. I traced it again and the
-ENOTRECOVERABLE branch is unreachable for slab/vmalloc/page-table pages
exactly as you described. The __get_hwpoison_page() → -EBUSY → shake → retry
loop catches them first and they exit as -EIO.

The selftest I am using (link above) only validated the PageReserved
short-circuit added in patch 3, which lives in memory_failure() and never
reaches get_any_page().

I even thought about this code path, and I was not convinced we should return
-ENOTRECOVERABLE, thus I documented the following (as in this current patch)

	@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
			shake_page(p);
			goto try_again;
		}
	+            /*
	+             * Return -EIO rather than -ENOTRECOVERABLE: this
	+             * branch is also reached for pages that are merely
	+             * off-LRU transiently (e.g. a folio in the middle
	+             * of migration or compaction), which shake_page()
	+             * cannot drag back.  The caller cannot prove the
	+             * page is permanently kernel-owned from here, so
	+             * keep it on the recoverable errno.
	+             */
		ret = -EIO;

^ permalink raw reply

* Re: [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: bot+bpf-ci @ 2026-05-14 14:55 UTC (permalink / raw)
  To: jolsa, oleg, peterz, mingo, mhiramat, andrii
  Cc: bpf, linux-trace-kernel, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260514135342.22130-3-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 630 bytes --]

>     libbpf: Change has_nop_combo to work on top of nop10
>
>     We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
>     fixing has_nop_combo to reflect that.
>
>     Signed-off-by: Jiri Olsa <jolsa@kernel.org>

This looks like a bug fix for kernel/libbpf integration. Should this
include:

  Fixes: 9734c3ac2f23 ("uprobes/x86: Move optimized uprobe from nop5 to nop10")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25864669093

^ permalink raw reply

* Re: [PATCH 7/7] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: bot+bpf-ci @ 2026-05-14 14:55 UTC (permalink / raw)
  To: jolsa, oleg, peterz, mingo, mhiramat, andrii
  Cc: bpf, linux-trace-kernel, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260514135342.22130-8-jolsa@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 82b3c0ce9253..d553485e7db5 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c

[ ... ]

> diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> index be34c4087ff5..606601ccdc42 100644
> --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
> index f00cb52874e0..0ee78fb050a1 100644
> --- a/tools/testing/selftests/bpf/progs/test_usdt.c
> +++ b/tools/testing/selftests/bpf/progs/test_usdt.c

[ ... ]

> diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> index b359b389f6c0..5e38f8605b02 100644
> --- a/tools/testing/selftests/bpf/usdt_2.c
> +++ b/tools/testing/selftests/bpf/usdt_2.c
> @@ -13,4 +13,17 @@ void usdt_2(void)
>  	USDT(optimized_attach, usdt_2);
>  }
>
> +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> +
> +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> +{
> +	unsigned long a1 = usdt_red_zone_arg1;
> +	unsigned long a2 = usdt_red_zone_arg2;
> +	unsigned long a3 = usdt_red_zone_arg3;
> +
> +	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
> +}
> +
>  #endif

Is this test reliable across different compiler versions and optimizations?

The USDT macro uses the "nor" inline assembly constraint, which allows the
compiler to allocate these operands to CPU registers. Because there is no
register pressure in usdt_red_zone_trigger(), modern compilers might allocate
a1, a2, and a3 directly to registers rather than the stack's red zone.

If the operands are placed in registers, the uprobe CALL optimization
clobbering [rsp-8] will not corrupt them. This could cause the test to pass
even on buggy kernels, creating a false positive.

Would it be safer to explicitly force these operands into the red zone using
inline assembly constraints, rather than depending on the compiler's register
allocator?

Reference: https://lore.kernel.org/bpf/20260509021238.AE98FC2BCB0@smtp.kernel.org/
Raised by: sashiko-bot@kernel.org on Sat, 09 May 2026


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25864669093

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox