Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
From: Nico Pache @ 2026-05-26 12:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Bagas Sanjaya
In-Reply-To: <94f759f8-e2ed-4f22-b9e7-4693ad005509@kernel.org>

On Fri, May 22, 2026 at 3:59 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
>
> >
> >  process THP controls
> > @@ -264,11 +265,6 @@ support the following arguments::
> >  Khugepaged controls
> >  -------------------
> >
> > -.. note::
> > -   khugepaged currently only searches for opportunities to collapse to
> > -   PMD-sized THP and no attempt is made to collapse to other THP
> > -   sizes.
>
> Should we maybe leave this here and clarify that for file/shmem, it will still
> only collapse to PMD-sized THPs?

Ah yes that would be a good idea. Ill send a fixup!

Thank you :)

>
> --
> Cheers,
>
> David
>


^ permalink raw reply

* [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li,
	kernel test robot
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add supporting files for the ftrace stackmap feature:

Documentation/trace/ftrace-stackmap.rst:
  Documentation covering design, usage, tracefs interface, binary
  format, and performance characteristics. Added to the 'Core Tracing
  Frameworks' toctree in Documentation/trace/index.rst. Documents:
  - Reset requires tracing to be stopped first
  - Boot-time activation via trace_options=stackmap
  - bits parameter range [10, 18] and worst-case memory usage
  - tracefs file modes (0640 / 0440)
  - Best-effort snapshot semantics for stack_map_bin
  - Counter naming: successes (events served), drops, success_rate
  - Gravestone amplification when the pool is exhausted

tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
  Functional selftest verifying:
  - stackmap tracefs nodes exist
  - enabling stackmap + stacktrace produces stack_id events
  - stack_map_stat shows non-zero successes and zero drops
  - reset clears entries when tracing is stopped
  - reset is rejected (-EBUSY) while tracing is active
  Test reads trace contents BEFORE switching back to the nop tracer
  (tracer_init() unconditionally calls tracing_reset_online_cpus(),
  which would empty the ring buffer). The function:tracer dependency
  is declared in '# requires:' so ftracetest skips on kernels without
  CONFIG_FUNCTION_TRACER instead of failing spuriously. An EXIT trap
  restores options/stackmap and options/stacktrace on any exit path.

tools/tracing/stackmap_dump.py:
  Python script to parse the binary stack_map_bin export.
  Features:
  - Automatic endianness detection via magic number
  - Batched addr2line via stdin (avoids ARG_MAX with large stacks)
  - JSON output mode
  - Top-N filtering by ref_count

Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x464D5342 = 'FSMB').

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 Documentation/trace/ftrace-stackmap.rst       | 162 ++++++++++++++++++
 Documentation/trace/index.rst                 |   1 +
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 103 +++++++++++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  42 +++++
 tools/tracing/stackmap_dump.py                | 150 ++++++++++++++++
 5 files changed, 458 insertions(+)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100755 tools/tracing/stackmap_dump.py

diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..191347be3664
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+  (default: 14 → 16384 stacks; valid range: 10-18).
+
+  At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+  for the element pool. Each ``open()`` of ``stack_map_bin`` may
+  briefly allocate a similar amount for a snapshot. The cap is set
+  intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+    echo 1 > /sys/kernel/debug/tracing/options/stackmap
+    echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+    echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+    sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+    cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+    stack_id 42 [ref 1337, depth 8]
+      [0] schedule+0x48/0xc0
+      [1] schedule_timeout+0x1c/0x30
+      ...
+
+To view statistics::
+
+    cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+    entries:      2500 / 16384
+    table_size:   32768
+    successes:    148923
+    drops:        0
+    success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+    echo 0 > /sys/kernel/debug/tracing/tracing_on
+    echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+    trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+    Text export of all deduplicated stacks with symbol resolution.
+    Writing ``0`` or ``reset`` clears all entries (only when tracing
+    is stopped).
+
+``stack_map_stat``
+    Statistics: entries (allocated unique stacks), table_size,
+    successes (events served), drops (events that fell back to
+    full-stack recording), and success_rate. Drops accumulate when
+    the element pool is exhausted; once that happens, slots that
+    won the cmpxchg but failed to allocate an element remain
+    "claimed but empty" and increase probe pressure for any future
+    insert hashing to the same bucket. Reset (when tracing is
+    stopped) clears these gravestones.
+
+``stack_map_bin``
+    Binary export for efficient userspace consumption. Format:
+
+    - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+    - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+    All fields are written in the kernel's native byte order.
+    Userspace tools detect endianness by reading the magic value.
+    Magic: ``0x464D5342`` ('FSMB'), Version: 2.
+
+    The export is a best-effort snapshot allocated at ``open()``;
+    concurrent inserts during the snapshot may be truncated. A
+    bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+  (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+  length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+  confirms matches
+
+Deduplication is best-effort, not strict: if two CPUs race in the
+insert path with the same ``key_hash`` (i.e. the same stack), the
+``cmpxchg`` loser advances by one slot and may insert the same stack
+again. Under heavy contention this can produce a small number of
+duplicate entries for the same stack; ``ref_count`` is then split
+across the duplicates. Total memory is still bounded by the element
+pool size, and lookup correctness is unaffected (each duplicate is
+a self-consistent entry with its own ``stack_id``). The trade-off is
+intentional and keeps the hot path lock-free.
+
+Performance
+===========
+
+Typical results on an aarch64 SMP system (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Dedup rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
    ftrace
    ftrace-design
    ftrace-uses
+   ftrace-stackmap
    kprobes
    kprobetrace
    fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100644
index 000000000000..18fa998ae460
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,103 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap function:tracer
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify trace contains <stack_id> events (read BEFORE switching
+#    tracer back to nop, since tracer_init() resets the ring buffer)
+# 4. Verify stack_map has entries and zero drops
+# 5. Verify reset is rejected (-EBUSY) while tracing is active
+# 6. Verify reset clears the map when tracing is stopped
+
+fail() {
+    echo "FAIL: $1"
+    exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+    disable_tracing 2>/dev/null
+    echo nop > current_tracer 2>/dev/null
+    echo 0 > options/stackmap 2>/dev/null
+    echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map      || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin  || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Read trace contents NOW, before switching tracer back to nop.
+# tracer_init() unconditionally calls tracing_reset_online_cpus(),
+# so the ring buffer would be empty after 'echo nop > current_tracer'.
+count=$(grep -c "<stack_id" trace || true)
+: "${count:=0}"
+if [ "$count" -eq 0 ]; then
+    fail "trace has no <stack_id> events"
+fi
+
+# Now safe to switch back and disable options
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+    fail "stackmap has zero entries after tracing"
+fi
+
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+    fail "stackmap has zero successes"
+fi
+
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+if [ "$drops" -ne 0 ]; then
+    fail "stackmap had $drops drops (pool exhausted?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+    fail "stack_map output has no stack_id entries"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+    disable_tracing
+    fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+# Test reset works when tracing is stopped
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+    fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
new file mode 100644
index 000000000000..49848eac2624
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
@@ -0,0 +1,42 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap option is gated to the top-level trace instance
+# requires: stack_map options/stackmap instances
+
+# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the
+# convention used for global-only options like 'printk' and 'record-cmd'.
+# Verify that:
+# 1. The global instance exposes options/stackmap and the stack_map* nodes.
+# 2. A newly created secondary instance under instances/ does NOT expose
+#    options/stackmap or stack_map* nodes.
+
+fail() {
+    echo "FAIL: $1"
+    rmdir instances/test_stackmap_gate 2>/dev/null
+    exit_fail
+}
+
+# 1. Global instance must expose the option and the nodes
+test -e options/stackmap || fail "options/stackmap missing on global instance"
+test -e stack_map        || fail "stack_map missing on global instance"
+test -e stack_map_stat   || fail "stack_map_stat missing on global instance"
+test -e stack_map_bin    || fail "stack_map_bin missing on global instance"
+
+# 2. Create a secondary instance and verify it does NOT see the option
+#    or the stack_map* nodes.
+mkdir instances/test_stackmap_gate || fail "could not create secondary instance"
+
+if [ -e instances/test_stackmap_gate/options/stackmap ]; then
+    fail "secondary instance unexpectedly exposes options/stackmap"
+fi
+
+for f in stack_map stack_map_stat stack_map_bin; do
+    if [ -e instances/test_stackmap_gate/$f ]; then
+        fail "secondary instance unexpectedly has $f"
+    fi
+done
+
+rmdir instances/test_stackmap_gate || fail "could not remove secondary instance"
+
+echo "stackmap option gating to top-level instance works"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..fcd8ddcd97de
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+    # Pull from device and parse
+    adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+    python3 stackmap_dump.py /tmp/stack_map.bin
+
+    # With vmlinux for offline symbol resolution
+    python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+    # JSON output for tooling
+    python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342  # 'FSMB'
+HEADER_SIZE = 16  # 4 x u32
+ENTRY_SIZE = 16   # 4 x u32
+
+
+def detect_endianness(data):
+    """Detect byte order from magic number in header."""
+    if len(data) < 4:
+        raise ValueError("File too small")
+    magic_le = struct.unpack_from('<I', data, 0)[0]
+    if magic_le == MAGIC:
+        return '<'
+    magic_be = struct.unpack_from('>I', data, 0)[0]
+    if magic_be == MAGIC:
+        return '>'
+    raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+    """Resolve multiple addresses in one addr2line invocation."""
+    if not addrs:
+        return {}
+    try:
+        # Feed addresses on stdin to avoid ARG_MAX limits with large
+        # numbers of addresses (one stack can have 30+ frames; a
+        # snapshot can have thousands of unique stacks).
+        stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+        result = subprocess.run(
+            ['addr2line', '-f', '-e', vmlinux],
+            input=stdin, capture_output=True, text=True, timeout=60
+        )
+        lines = result.stdout.split('\n')
+        # addr2line outputs 2 lines per address: function name + source location
+        symbols = {}
+        for i, addr in enumerate(addrs):
+            idx = i * 2
+            if idx < len(lines) and lines[idx] and lines[idx] != '??':
+                symbols[addr] = lines[idx]
+        return symbols
+    except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+        print(f"warning: addr2line failed: {e}", file=sys.stderr)
+        return {}
+
+
+def parse_stackmap_bin(data):
+    """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+    if len(data) < HEADER_SIZE:
+        raise ValueError("File too small for header")
+
+    endian = detect_endianness(data)
+    header_fmt = f'{endian}IIII'
+    entry_fmt = f'{endian}IIII'
+
+    magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+    if version != 2:
+        raise ValueError(f"Unsupported version: {version}")
+
+    offset = HEADER_SIZE
+    for _ in range(nr_stacks):
+        if offset + ENTRY_SIZE > len(data):
+            break
+        stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+        offset += ENTRY_SIZE
+
+        ips_size = nr * 8
+        if offset + ips_size > len(data):
+            break
+        ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+        offset += ips_size
+
+        yield stack_id, ref_count, list(ips)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+    parser.add_argument('file', help='Path to stack_map_bin file')
+    parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+    parser.add_argument('--json', action='store_true', help='JSON output')
+    parser.add_argument('--top', type=int, default=0,
+                        help='Show only top N stacks by ref_count')
+    args = parser.parse_args()
+
+    with open(args.file, 'rb') as f:
+        data = f.read()
+
+    stacks = list(parse_stackmap_bin(data))
+
+    if args.top > 0:
+        stacks.sort(key=lambda x: x[1], reverse=True)
+        stacks = stacks[:args.top]
+
+    # Batch symbol resolution
+    symbols = {}
+    if args.vmlinux:
+        all_addrs = set()
+        for _, _, ips in stacks:
+            all_addrs.update(ips)
+        symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+    if args.json:
+        output = []
+        for stack_id, ref_count, ips in stacks:
+            entry = {
+                'stack_id': stack_id,
+                'ref_count': ref_count,
+                'ips': [f'0x{ip:x}' for ip in ips]
+            }
+            if args.vmlinux:
+                entry['symbols'] = [symbols.get(ip, f'0x{ip:x}')
+                                    for ip in ips]
+            output.append(entry)
+        print(json.dumps(output, indent=2))
+    else:
+        for stack_id, ref_count, ips in stacks:
+            print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+            for i, ip in enumerate(ips):
+                sym = symbols.get(ip, '')
+                if sym:
+                    sym = f' {sym}'
+                print(f"  [{i}] 0x{ip:x}{sym}")
+            print()
+
+    print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+    main()
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.

Changes:
- New TRACE_STACK_ID in trace_type enum
- New stack_id_entry in trace_entries.h
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
  is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
  TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
  used by TRACE_ITER_PROF_TEXT_OFFSET)
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS
  so it is only exposed under the top-level trace instance, matching
  the convention already used for global-only options such as 'printk'
  and 'record-cmd'. Secondary instances under tracing/instances/*/
  do not see the option at all, avoiding a confusing no-op.
- Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id()
  when the stackmap option is active. If reserving a TRACE_STACK_ID
  ring-buffer slot fails after a successful get_id(), the path falls
  through to the full-stack recording so the event still gets a stack
  trace recorded.
- Stackmap pointer read with smp_load_acquire(), published with
  smp_store_release() to ensure proper initialization ordering
- NULL check on tr->stackmap is retained as defense-in-depth: events
  that fire before fs_initcall (when the map is created) or after a
  failed ftrace_stackmap_create() observe a NULL pointer and fall back
  to full stack recording without dereferencing it
- ftrace_stackmap_create() takes the owning trace_array so the
  stackmap can later check tracing state during reset
- Added stack_id print handler in trace_output.c
- Added TRACE_STACK_ID to trace_valid_entry() in trace_selftest.c
  so ftrace startup selftests don't reject the new entry type when
  the stackmap option is enabled

Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, or NULL pointer), the full stack trace is recorded as
before -- no new failure modes introduced.

Per-instance stackmap support is left as a follow-up; gating the
option via TOP_LEVEL_TRACE_FLAGS makes the global-only scope
explicit at the tracefs interface rather than relying on a silent
runtime fallback.

Usage:
  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 kernel/trace/trace.c          | 78 ++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h          | 16 +++++++
 kernel/trace/trace_entries.h  | 15 +++++++
 kernel/trace/trace_output.c   | 23 +++++++++++
 kernel/trace/trace_selftest.c |  1 +
 5 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..36120355e549 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
 
 #include "trace.h"
 #include "trace_output.h"
+#include "trace_stackmap.h"
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
 /*
@@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
 /* trace_options that are only supported by global_trace */
 #define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) |			\
 	       TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) |	\
-	       TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS)
+	       TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) |	\
+	       FPROFILE_DEFAULT_FLAGS)
 
 /* trace_flags that are default zero for instances */
 #define ZEROED_TRACE_FLAGS \
 	(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
-	 TRACE_ITER(COPY_MARKER))
+	 TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP))
 
 /*
  * The global_trace is the descriptor that holds the top-level tracing
@@ -2184,6 +2186,49 @@ void __ftrace_trace_stack(struct trace_array *tr,
 	}
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	/*
+	 * If stackmap dedup is enabled, try to store only the stack_id
+	 * in the ring buffer instead of the full stack trace.
+	 */
+	if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+		struct ftrace_stackmap *smap;
+		struct stack_id_entry *sid_entry;
+		int sid;
+
+		smap = smp_load_acquire(&tr->stackmap);
+		if (!smap)
+			goto full_stack;
+
+		sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+		if (sid >= 0) {
+			event = __trace_buffer_lock_reserve(buffer,
+					TRACE_STACK_ID,
+					sizeof(*sid_entry), trace_ctx);
+			if (!event) {
+				/*
+				 * Could not reserve a TRACE_STACK_ID slot;
+				 * fall back to the full-stack path so the
+				 * event still gets a stack trace recorded.
+				 */
+				goto full_stack;
+			}
+			sid_entry = ring_buffer_event_data(event);
+			sid_entry->stack_id = sid;
+			/*
+			 * stack_id is a synthetic side-event attached to a
+			 * primary trace event that was already subject to
+			 * filtering. No per-event filter is defined for
+			 * TRACE_STACK_ID, so commit unconditionally.
+			 */
+			__buffer_unlock_commit(buffer, event);
+			goto out;
+		}
+		/* On stackmap failure, record the full stack instead. */
+	}
+full_stack:
+#endif
+
 	event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
 				    struct_size(entry, caller, nr_entries),
 				    trace_ctx);
@@ -9222,6 +9267,35 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
 			NULL, &tracing_dyn_info_fops);
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	{
+		struct ftrace_stackmap *smap;
+
+		smap = ftrace_stackmap_create(&global_trace);
+		if (!IS_ERR(smap)) {
+			/*
+			 * Use smp_store_release to ensure the stackmap
+			 * structure is fully initialized before publishing
+			 * the pointer to concurrent trace event readers.
+			 */
+			smp_store_release(&global_trace.stackmap, smap);
+			trace_create_file("stack_map", TRACE_MODE_WRITE, NULL,
+					smap, &ftrace_stackmap_fops);
+			trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL,
+					smap, &ftrace_stackmap_stat_fops);
+			trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL,
+					smap, &ftrace_stackmap_bin_fops);
+		} else {
+			pr_warn("ftrace stackmap init failed, dedup disabled\n");
+			/*
+			 * global_trace is statically defined; its stackmap
+			 * field is zero-initialized via BSS, so leaving it
+			 * NULL ensures the smp_load_acquire() in
+			 * __ftrace_trace_stack() falls back to full stack.
+			 */
+		}
+	}
+#endif
 	create_trace_instances(NULL);
 
 	update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..7e7d5e5a35ff 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
 	TRACE_TIMERLAT,
 	TRACE_RAW_DATA,
 	TRACE_FUNC_REPEATS,
+	TRACE_STACK_ID,
 
 	__TRACE_LAST_TYPE,
 };
@@ -453,6 +454,9 @@ struct trace_array {
 	struct cond_snapshot	*cond_snapshot;
 #endif
 	struct trace_func_repeats	__percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+	struct ftrace_stackmap		*stackmap;
+#endif
 	/*
 	 * On boot up, the ring buffer is set to the minimum size, so that
 	 * we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
 			  TRACE_GRAPH_RET);		\
 		IF_ASSIGN(var, ent, struct func_repeats_entry,		\
 			  TRACE_FUNC_REPEATS);				\
+		IF_ASSIGN(var, ent, struct stack_id_entry,		\
+			  TRACE_STACK_ID);				\
 		__ftrace_bad_type();					\
 	} while (0)
 
@@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 # define STACK_FLAGS
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS				\
+			C(STACKMAP,		"stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT	-1
+#endif
+
 #ifdef CONFIG_FUNCTION_PROFILER
+
 # define PROFILER_FLAGS					\
 		C(PROF_TEXT_OFFSET,	"prof-text-offset"),
 # ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
+		STACKMAP_FLAGS					\
 		BRANCH_FLAGS					\
 		PROFILER_FLAGS					\
 		FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
 		 (void *)__entry->caller[6], (void *)__entry->caller[7])
 );
 
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+	TRACE_STACK_ID,
+
+	F_STRUCT(
+		__field(	int,		stack_id	)
+	),
+
+	F_printk("<stack_id %d>", __entry->stack_id)
+);
+
 /*
  * trace_printk entry:
  */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
 	.funcs		= &trace_user_stack_funcs,
 };
 
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+					      int flags, struct trace_event *event)
+{
+	struct stack_id_entry *field;
+	struct trace_seq *s = &iter->seq;
+
+	trace_assign_type(field, iter->ent);
+	trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+	return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+	.trace		= trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+	.type		= TRACE_STACK_ID,
+	.funcs		= &trace_stack_id_funcs,
+};
+
 /* TRACE_HWLAT */
 static enum print_line_t
 trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
 	&trace_wake_event,
 	&trace_stack_event,
 	&trace_user_stack_event,
+	&trace_stack_id_event,
 	&trace_bputs_event,
 	&trace_bprint_event,
 	&trace_print_event,
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 929c84075315..0c97065b0d68 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry)
 	case TRACE_CTX:
 	case TRACE_WAKE:
 	case TRACE_STACK:
+	case TRACE_STACK_ID:
 	case TRACE_PRINT:
 	case TRACE_BRANCH:
 	case TRACE_GRAPH_ENT:
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.

The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:

- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
  bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
  is O(1) even when the table is heavily loaded with claimed-but-
  empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)

The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the
existing tracing_map / hist_triggers requirement: the lock-free
hot path uses cmpxchg in a context that may be reached from NMI.

The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (all fields native-endian)

Hot-path counters use per-CPU local_t (NMI-safe single-instruction
increments) instead of atomic64_t. atomic64_t falls back to
raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems,
which would deadlock if an NMI hit while the spinlock was held.
local_t avoids this hazard.

Reset semantics:
- Reset is a control-path operation only allowed when tracing is
  stopped on the owning trace_array. Online reset (with tracing
  active) is intentionally not supported.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
  verifies tracer_tracing_is_on() returns false.
- synchronize_rcu() drains in-flight get_id() callers from the
  ftrace callback path (which runs preempt-disabled).
- A reader_sem (rw_semaphore) serializes the clearing memset
  against tracefs readers (seq_file iteration and stack_map_bin
  snapshot), which run in process context and aren't covered by
  synchronize_rcu(). The hot path doesn't take this lock.
- Reset clears the resetting flag with atomic_set_release() so a
  subsequent get_id() observes a fully cleared map.
- get_id() uses atomic_read_acquire() on resetting so subsequent
  loads of entry->key/val are properly ordered after the check
  (control dependencies only order stores per LKMM).
- Concurrent reset returns -EBUSY; reset while tracing is active
  returns -EBUSY.

Concurrency notes:
- entry->val publication uses smp_store_release() paired with
  smp_load_acquire() in all dereferencing readers.
- entry->key reads (in get_id, seq_start/next, bin_open) use
  READ_ONCE() to avoid LKMM data races with the cmpxchg writer.
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
  use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
  atomic_read() before the contended atomic RMW, avoiding cacheline
  contention once the pool is full. Slots that win cmpxchg but
  cannot get an elt are left 'claimed but empty'; subsequent
  lookups treat val==NULL as a miss and probe past them.

Hash key:
- Per-instance random seed stored in the stackmap struct (no
  global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
  free-slot sentinel). Full memcmp confirms matches.

Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
  element pool is ~135 MB and a stack_map_bin snapshot may briefly
  allocate another ~135 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
  field so data[] is 8-byte aligned on both 32-bit and 64-bit
  architectures, avoiding alignment faults when writing u64 IPs
  on strict-alignment architectures.

Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
  range 10-18, default 14)

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 kernel/trace/Kconfig          |  22 +
 kernel/trace/Makefile         |   1 +
 kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++++++++++++++++++
 kernel/trace/trace_stackmap.h |  57 +++
 4 files changed, 860 insertions(+)
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..e49cae886ff0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,28 @@ config STACK_TRACER
 
 	  Say N if unsure.
 
+config FTRACE_STACKMAP
+	bool "Ftrace stack map deduplication"
+	depends on TRACING
+	depends on STACKTRACE
+	depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select KALLSYMS
+	help
+	  This enables a global stack trace hash table for ftrace, inspired
+	  by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+	  only a stack_id in the ring buffer instead of the full stack trace,
+	  significantly reducing trace buffer usage when the same call stacks
+	  appear repeatedly.
+
+	  The deduplicated stacks are exported via:
+	    /sys/kernel/debug/tracing/stack_map
+
+	  Writing to this file resets the stack map. Reading shows all unique
+	  stacks with their stack_id and reference count.
+
+	  Say Y if you want to reduce ftrace buffer usage for stack traces.
+	  Say N if unsure.
+
 config TRACE_PREEMPT_TOGGLE
 	bool
 	help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 8d3d96e847d8..c2d9b2bf895a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
 obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
 obj-$(CONFIG_NOP_TRACER) += trace_nop.o
 obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
 obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
 obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..c89f6d527c96
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,780 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ *   bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ *   cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ *   - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ *     and blocks new get_id() callers (they observe resetting=1 and
+ *     return -EINVAL).
+ *   - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the
+ *     resetting flag itself prevents new insertions even if userspace
+ *     re-enables tracing immediately after the check.
+ *   - synchronize_rcu() drains in-flight get_id() callers from the
+ *     ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val), but they are not coherent with
+ * a concurrent reset; since reset requires tracing to be stopped,
+ * mid-iteration reset can produce truncated or partial output but
+ * never crashes.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/local_lock.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+#include <asm/local.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE	64
+
+/*
+ * Memory ordering of entry->val: published with smp_store_release()
+ * by the inserter; consumed with smp_load_acquire() by every reader
+ * that dereferences the elt (get_id, seq_show, bin_open). This pairs
+ * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the
+ * publish) with the reads of those fields (which happen AFTER the
+ * load). seq_start / seq_next only test val for NULL and use the
+ * acquire load purely to keep memory ordering symmetric.
+ */
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+	u32		nr;		/* actual number of IPs */
+	atomic_t	ref_count;
+	unsigned long	ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+	u32			key;	/* 0 = free, non-zero = jhash */
+	struct stackmap_elt	*val;	/* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+	struct trace_array	*tr;		/* owning trace_array */
+	unsigned int		map_bits;
+	unsigned int		map_size;	/* 1 << (map_bits + 1) */
+	unsigned int		max_elts;	/* 1 << map_bits */
+	u32			hash_seed;	/* per-instance jhash seed */
+	atomic_t		next_elt;	/* index into elts pool */
+	struct stackmap_entry	*entries;	/* hash table */
+	struct stackmap_elt	*elts;		/* flat element pool */
+	atomic_t		resetting;
+	/*
+	 * Reader/reset serialization. Held in shared mode (read lock)
+	 * across seq_file iteration and binary snapshot construction;
+	 * held in exclusive mode (write lock) by reset's clearing
+	 * phase. The hot path (get_id) does not take this lock — it
+	 * uses smp_load_acquire/smp_store_release on entry->val and
+	 * the resetting flag for the lock-free protocol.
+	 */
+	struct rw_semaphore	reader_sem;
+	/*
+	 * Per-CPU counters using local_t. local_t increments are NMI-
+	 * safe on all architectures (single-instruction or interrupt-
+	 * masked) and avoid the raw_spinlock_t fallback that
+	 * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would
+	 * deadlock if an NMI hit while the spinlock was held.
+	 */
+	local_t __percpu	*successes;	/* events served (hits + new inserts) */
+	local_t __percpu	*drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ *   bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ *             export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN	10
+#define FTRACE_STACKMAP_BITS_MAX	18
+#define FTRACE_STACKMAP_BITS_DEFAULT	14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+	unsigned long val;
+
+	if (kstrtoul(str, 0, &val))
+		return -EINVAL;
+	val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+	stackmap_map_bits = val;
+	return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+	int idx;
+
+	/*
+	 * Fast-path early-out once the pool is fully consumed. Avoids
+	 * the contended atomic RMW on next_elt for every traced event
+	 * after the pool is exhausted.
+	 */
+	if (atomic_read(&smap->next_elt) >= smap->max_elts)
+		return NULL;
+
+	idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+	if (idx < smap->max_elts)
+		return &smap->elts[idx];
+	return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+	struct ftrace_stackmap *smap;
+	unsigned int bits;
+
+	smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+	if (!smap)
+		return ERR_PTR(-ENOMEM);
+
+	/* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+	bits = clamp_val(stackmap_map_bits,
+			 FTRACE_STACKMAP_BITS_MIN,
+			 FTRACE_STACKMAP_BITS_MAX);
+
+	smap->tr = tr;
+	smap->map_bits = bits;
+	smap->max_elts = 1U << bits;
+	smap->map_size = 1U << (bits + 1);	/* 2x over-provision */
+
+	smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+	if (!smap->entries) {
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * Single large vmalloc of the element pool, indexed flat.
+	 * At bits=18 this is 256K * sizeof(struct stackmap_elt). The
+	 * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB.
+	 */
+	smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+	if (!smap->elts) {
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	smap->successes = alloc_percpu(local_t);
+	if (!smap->successes) {
+		vfree(smap->elts);
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+	smap->drops = alloc_percpu(local_t);
+	if (!smap->drops) {
+		free_percpu(smap->successes);
+		vfree(smap->elts);
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	smap->hash_seed = get_random_u32();
+	atomic_set(&smap->next_elt, 0);
+	atomic_set(&smap->resetting, 0);
+	init_rwsem(&smap->reader_sem);
+
+	return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+	if (!smap || IS_ERR(smap))
+		return;
+	free_percpu(smap->drops);
+	free_percpu(smap->successes);
+	vfree(smap->elts);
+	vfree(smap->entries);
+	kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ *   1. Atomically claim reset rights via cmpxchg on @resetting.
+ *   2. Verify tracing is stopped on @smap->tr; if not, release the
+ *      claim and return -EBUSY. The resetting flag itself blocks
+ *      any subsequent get_id() callers.
+ *   3. synchronize_rcu() drains in-flight get_id() callers from the
+ *      ftrace callback path (which runs preempt-disabled).
+ *   4. memset entries, elts, and counters.
+ *   5. Release the resetting flag with release semantics so any new
+ *      get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+	if (!smap)
+		return 0;
+
+	if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+		return -EBUSY;
+
+	if (smap->tr && tracer_tracing_is_on(smap->tr)) {
+		atomic_set(&smap->resetting, 0);
+		return -EBUSY;
+	}
+
+	/*
+	 * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+	 * is needed before it. It drains in-flight ftrace callbacks that
+	 * may have already passed the resetting check with the old value.
+	 */
+	synchronize_rcu();
+
+	/*
+	 * Take the reader_sem in exclusive mode. This serializes the
+	 * memset against any tracefs reader (seq_file iteration or
+	 * stack_map_bin snapshot) that may currently hold the rwsem
+	 * for read. synchronize_rcu() already drained the hot path;
+	 * this rwsem covers process-context readers that aren't
+	 * preempt-disabled.
+	 */
+	down_write(&smap->reader_sem);
+
+	memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+	memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+	atomic_set(&smap->next_elt, 0);
+	{
+		int cpu;
+
+		for_each_possible_cpu(cpu) {
+			local_set(per_cpu_ptr(smap->successes, cpu), 0);
+			local_set(per_cpu_ptr(smap->drops, cpu), 0);
+		}
+	}
+
+	up_write(&smap->reader_sem);
+
+	/* Release resetting=0 so new get_id() observes a cleared map. */
+	atomic_set_release(&smap->resetting, 0);
+	return 0;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+			   unsigned long *ips, unsigned int nr_entries)
+{
+	u32 key_hash, idx, test_key, trace_len;
+	struct stackmap_entry *entry;
+	struct stackmap_elt *val;
+	int probes = 0;
+
+	/*
+	 * atomic_read_acquire() pairs with atomic_set_release() in the
+	 * reset path. This ensures that subsequent reads of entry->key
+	 * and entry->val are ordered after this check; without acquire,
+	 * the CPU would only have a control dependency, which orders
+	 * subsequent stores but not loads (per LKMM).
+	 */
+	if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting))
+		return -EINVAL;
+	if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+		nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+	trace_len = nr_entries * sizeof(unsigned long);
+	/*
+	 * jhash2() requires the length in u32 units and the data to be
+	 * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+	 * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+	 * directly; the cast to u32* is safe because ips[] is naturally
+	 * aligned to sizeof(unsigned long) >= 4.
+	 */
+	key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+			  smap->hash_seed);
+	if (key_hash == 0)
+		key_hash = 1;	/* 0 means free slot */
+
+	idx = key_hash >> (32 - (smap->map_bits + 1));
+
+	while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+		idx &= (smap->map_size - 1);
+		entry = &smap->entries[idx];
+		/*
+		 * READ_ONCE() to avoid LKMM data race with concurrent
+		 * cmpxchg(&entry->key, 0, key_hash) on this slot.
+		 */
+		test_key = READ_ONCE(entry->key);
+
+		if (test_key == key_hash) {
+			/*
+			 * smp_load_acquire pairs with smp_store_release in
+			 * the publisher below; ensures we see fully-formed
+			 * elt fields (nr, ips, ref_count) before dereference.
+			 */
+			val = smp_load_acquire(&entry->val);
+			/*
+			 * READ_ONCE(val->nr) keeps style consistent with
+			 * the seq_show / bin_open readers. nr is write-once
+			 * (set before publish, never modified afterwards),
+			 * so the load is data-race-free, but READ_ONCE
+			 * silences any analysis tool that flags a plain
+			 * read of a field that is also read under acquire
+			 * elsewhere.
+			 */
+			if (val && READ_ONCE(val->nr) == nr_entries &&
+			    memcmp(val->ips, ips, trace_len) == 0) {
+				atomic_inc(&val->ref_count);
+				local_inc(this_cpu_ptr(smap->successes));
+				return (int)idx;
+			}
+			/*
+			 * val == NULL: another CPU is mid-insert, or this
+			 * slot is "claimed but empty" (pool exhausted).
+			 * val != NULL but mismatch: 32-bit hash collision
+			 * with a different stack. In both cases, advance.
+			 */
+		} else if (!test_key) {
+			/*
+			 * Free slot: try to claim it.
+			 *
+			 * If two CPUs race here with the same key_hash
+			 * (same stack), one loses the cmpxchg, advances,
+			 * and may insert the same stack at a later slot.
+			 * This can produce a small number of duplicate
+			 * entries under heavy contention. The trade-off
+			 * is accepted to keep the hot path lock-free;
+			 * ref_count is split across the duplicates and
+			 * total memory cost is bounded by the element
+			 * pool size.
+			 */
+			if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+				struct stackmap_elt *elt;
+
+				elt = stackmap_get_elt(smap);
+				if (!elt) {
+					/*
+					 * Pool exhausted. We claimed this
+					 * slot with cmpxchg but cannot fill
+					 * it. Leave key set so the slot
+					 * stays "claimed but empty" — future
+					 * lookups treat val==NULL as a miss
+					 * and probe past it. Cannot revert
+					 * key=0 without racing other CPUs.
+					 */
+					local_inc(this_cpu_ptr(smap->drops));
+					return -ENOSPC;
+				}
+
+				elt->nr = nr_entries;
+				atomic_set(&elt->ref_count, 1);
+				memcpy(elt->ips, ips, trace_len);
+
+				/*
+				 * Publish elt with release semantics so the
+				 * reader's smp_load_acquire can safely
+				 * dereference val->nr / val->ips.
+				 */
+				smp_store_release(&entry->val, elt);
+				local_inc(this_cpu_ptr(smap->successes));
+				return (int)idx;
+			}
+			/* cmpxchg failed; another CPU claimed this slot. */
+		}
+
+		idx++;
+		probes++;
+	}
+
+	local_inc(this_cpu_ptr(smap->drops));
+	return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+	struct ftrace_stackmap	*smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+	u32 i;
+
+	if (!smap)
+		return NULL;
+	/*
+	 * Take the reader_sem to serialize against ftrace_stackmap_reset(),
+	 * which holds it for write while clearing the table. Released in
+	 * stackmap_seq_stop(), which seq_file calls regardless of whether
+	 * start() returned an element or NULL (per Documentation/filesystems
+	 * /seq_file.rst: "the iterator value returned by start() or next()
+	 * is guaranteed to be passed to a subsequent next() or stop()").
+	 */
+	down_read(&smap->reader_sem);
+	for (i = *pos; i < smap->map_size; i++) {
+		if (READ_ONCE(smap->entries[i].key) &&
+		    smp_load_acquire(&smap->entries[i].val)) {
+			*pos = i;
+			return &smap->entries[i];
+		}
+	}
+	return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+	u32 i;
+
+	if (!smap)
+		return NULL;
+	for (i = *pos + 1; i < smap->map_size; i++) {
+		if (READ_ONCE(smap->entries[i].key) &&
+		    smp_load_acquire(&smap->entries[i].val)) {
+			*pos = i;
+			return &smap->entries[i];
+		}
+	}
+	/*
+	 * Advance *pos past the end so that on the next read() the
+	 * subsequent stackmap_seq_start() call returns NULL and the
+	 * iteration terminates. Without this, seq_read() would loop
+	 * on the last element.
+	 */
+	*pos = smap->map_size;
+	return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+
+	/*
+	 * seq_file invokes stop() unconditionally after each iteration
+	 * pass (see seq_read_iter / traverse), even when start() returned
+	 * NULL. Always release here, balanced against the down_read in
+	 * stackmap_seq_start().
+	 */
+	if (smap)
+		up_read(&smap->reader_sem);
+}
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+	struct stackmap_entry *entry = v;
+	struct stackmap_elt *elt = smp_load_acquire(&entry->val);
+	struct stackmap_seq_private *priv = m->private;
+	u32 idx = entry - priv->smap->entries;
+	u32 i, nr;
+
+	if (!elt)
+		return 0;
+
+	nr = READ_ONCE(elt->nr);
+	if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+		nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+	seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+		   idx, atomic_read(&elt->ref_count), nr);
+	for (i = 0; i < nr; i++)
+		seq_printf(m, "  [%u] %pS\n", i, (void *)elt->ips[i]);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+	.start	= stackmap_seq_start,
+	.next	= stackmap_seq_next,
+	.stop	= stackmap_seq_stop,
+	.show	= stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+	struct stackmap_seq_private *priv;
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open_private(file, &stackmap_seq_ops,
+			       sizeof(struct stackmap_seq_private));
+	if (ret)
+		return ret;
+	m = file->private_data;
+	priv = m->private;
+	priv->smap = inode->i_private;
+	return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+	if (n > 0 && buf[n - 1] == '\n')
+		n--;
+	return (n == 1 && buf[0] == '0') ||
+	       (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	struct seq_file *m = file->private_data;
+	struct stackmap_seq_private *priv = m->private;
+	char buf[8];
+	size_t n = min(count, sizeof(buf) - 1);
+	int ret;
+
+	if (n == 0)
+		return -EINVAL;
+	if (copy_from_user(buf, ubuf, n))
+		return -EFAULT;
+	buf[n] = '\0';
+
+	if (!stackmap_write_is_reset(buf, n))
+		return -EINVAL;
+
+	/*
+	 * ftrace_stackmap_reset() atomically claims reset rights via
+	 * cmpxchg and returns -EBUSY if another reset is in progress
+	 * or if tracing is active.
+	 */
+	ret = ftrace_stackmap_reset(priv->smap);
+	if (ret)
+		return ret;
+	return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+	.open		= stackmap_open,
+	.read		= seq_read,
+	.write		= stackmap_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+	struct ftrace_stackmap *smap = m->private;
+	u64 successes = 0, drops = 0;
+	u32 entries;
+	int cpu;
+
+	if (!smap) {
+		seq_puts(m, "stackmap not initialized\n");
+		return 0;
+	}
+
+	entries = atomic_read(&smap->next_elt);
+	for_each_possible_cpu(cpu) {
+		successes += local_read(per_cpu_ptr(smap->successes, cpu));
+		drops += local_read(per_cpu_ptr(smap->drops, cpu));
+	}
+
+	seq_printf(m, "entries:      %u / %u\n", entries, smap->max_elts);
+	seq_printf(m, "table_size:   %u\n", smap->map_size);
+	seq_printf(m, "successes:    %llu\n", successes);
+	seq_printf(m, "drops:        %llu\n", drops);
+	if (successes + drops > 0)
+		seq_printf(m, "success_rate: %llu%%\n",
+			   successes * 100 / (successes + drops));
+	return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+	.open		= stackmap_stat_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+	/*
+	 * Use u64 (not size_t) so data[] is 8-byte aligned on both
+	 * 32-bit and 64-bit architectures. The IP array within data[]
+	 * is accessed as u64*, which would alignment-fault on strict
+	 * architectures (e.g. older ARM, SPARC) if data[] started at
+	 * a 4-byte boundary.
+	 */
+	u64	size;
+	char	data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+	struct ftrace_stackmap *smap = inode->i_private;
+	struct stackmap_bin_snapshot *snap;
+	struct ftrace_stackmap_bin_header *hdr;
+	size_t alloc_size, off;
+	u32 nr_entries, i, nr_stacks;
+
+	if (!smap)
+		return -ENODEV;
+
+	/*
+	 * Worst-case allocation size: every populated entry uses a
+	 * full-depth stack. The (+1) gives one slack slot in case a
+	 * concurrent insert lands between this snapshot and iteration.
+	 * The loop below performs an explicit bounds check anyway.
+	 *
+	 * At bits=18 this caps at ~135 MB. The file is mode 0440
+	 * (TRACE_MODE_READ), so only privileged users can open it.
+	 */
+	nr_entries = atomic_read(&smap->next_elt);
+	alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+		     (sizeof(struct ftrace_stackmap_bin_entry) +
+		      FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+	snap = vmalloc(sizeof(*snap) + alloc_size);
+	if (!snap)
+		return -ENOMEM;
+
+	hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+	hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+	hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+	hdr->reserved = 0;
+	off = sizeof(*hdr);
+	nr_stacks = 0;
+
+	/*
+	 * Take reader_sem to serialize against ftrace_stackmap_reset(),
+	 * which clears the table and elt pool under the write lock.
+	 */
+	down_read(&smap->reader_sem);
+
+	for (i = 0; i < smap->map_size; i++) {
+		struct stackmap_entry *entry = &smap->entries[i];
+		struct stackmap_elt *elt;
+		struct ftrace_stackmap_bin_entry *e;
+		u64 *ips_out;
+		u32 k, nr;
+
+		if (!READ_ONCE(entry->key))
+			continue;
+		elt = smp_load_acquire(&entry->val);
+		if (!elt)
+			continue;
+
+		nr = READ_ONCE(elt->nr);
+		if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+			nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+		/* Bounds check: stop if we would overflow the allocation. */
+		if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+			break;
+
+		e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+		e->stack_id = i;
+		e->nr = nr;
+		e->ref_count = atomic_read(&elt->ref_count);
+		e->reserved = 0;
+		off += sizeof(*e);
+
+		ips_out = (u64 *)(snap->data + off);
+		for (k = 0; k < nr; k++)
+			ips_out[k] = (u64)elt->ips[k];
+		off += nr * sizeof(u64);
+		nr_stacks++;
+	}
+
+	up_read(&smap->reader_sem);
+
+	hdr->nr_stacks = nr_stacks;
+	snap->size = off;
+	file->private_data = snap;
+	return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	struct stackmap_bin_snapshot *snap = file->private_data;
+
+	if (!snap)
+		return -EINVAL;
+	return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+	vfree(file->private_data);
+	return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+	.open		= stackmap_bin_open,
+	.read		= stackmap_bin_read,
+	.llseek		= default_llseek,
+	.release	= stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..2e82bd6fb1c3
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH	64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC	0x464D5342	/* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION	2
+
+struct ftrace_stackmap_bin_header {
+	u32 magic;
+	u32 version;
+	u32 nr_stacks;
+	u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+	u32 stack_id;
+	u32 nr;
+	u32 ref_count;
+	u32 reserved;
+	/* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+			   unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *
+ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+					 unsigned long *ips, unsigned int n)
+{ return -EOPNOTSUPP; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Hi Masami, Steven, all,

This is v3 of the ftrace stackmap series. It addresses the Sashiko
review on v2 [1] that Masami pointed out.

[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com

The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.

Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending.

Changes since v2
================

Patch 1 (lock-free stackmap):
  - Hot-path counters changed from atomic64_t to per-CPU local_t.
    This avoids the raw_spinlock_t fallback that atomic64_t uses on
    32-bit GENERIC_ATOMIC64, which would deadlock from NMI context.
  - reset() now serializes against tracefs readers via an
    rw_semaphore (held for write during the clearing memset, held
    for read by seq_file iteration and bin snapshot construction).
    synchronize_rcu() alone was insufficient because seq_file/bin
    readers are in process context, not preempt-disabled.
  - get_id() uses atomic_read_acquire() on smap->resetting so
    subsequent loads of entry->key/val are properly ordered after
    the check (LKMM control dependencies only order stores).
  - All plain reads of entry->key now use READ_ONCE() to avoid
    LKMM data races with the cmpxchg writer.
  - val->nr in the hot path now uses READ_ONCE() to keep style
    consistent with the seq_show / bin_open readers.
  - stackmap_seq_next() now updates *pos past map_size on EOF so
    seq_read() terminates instead of looping on the last element.
  - Added a comment in the cmpxchg-claim path documenting that
    two CPUs racing with the same key_hash may produce a small
    number of duplicate entries; this is an accepted trade-off
    for keeping the hot path lock-free.
  - Removed BUG_ON in create path (the constraint is satisfied by
    construction; no runtime check needed).

Patch 2 (integration):
  - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and
    ZEROED_TRACE_FLAGS so the option is only exposed under the
    top-level trace instance, matching the convention used for
    other global-only options such as 'printk' and 'record-cmd'.
    Secondary instances under tracing/instances/*/ no longer see
    the option at all, instead of seeing it as a silent no-op.
  - TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c
    so ftrace startup selftests don't reject the entry type.
  - Corrected a comment about how global_trace.stackmap is
    zero-initialized (BSS, not kzalloc).

Patch 3 (docs / selftest / tooling):
  - Selftest now reads trace contents BEFORE switching back to the
    nop tracer (tracer_init() calls tracing_reset_online_cpus()
    which would have emptied the ring buffer).
  - Added 'function:tracer' to the selftest '# requires:' line so
    ftracetest skips when CONFIG_FUNCTION_TRACER is disabled
    instead of failing spuriously.
  - Selftest grep tightened to '<stack_id' to avoid future
    false-positives if any other tracepoint name contains
    "stack_id".
  - New stackmap-instance-gate.tc selftest asserts the option and
    stack_map* nodes are present on the global instance and absent
    on a freshly-created secondary instance, locking in the
    TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2.
  - Documentation Performance section made vendor-neutral
    ("aarch64 SMP system" instead of a specific device name) and
    the term "Hit rate" replaced with "Dedup rate" to match the
    actual stat field name (success_rate).
  - Documentation Design section now states that deduplication is
    best-effort under heavy contention (cmpxchg races may produce
    a small number of duplicate entries for the same stack), so
    users observing entries > unique-stacks have a documented
    explanation.

Test results
============

Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI)
Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots)
Method: 5-second capture with stacktrace trigger

Functional tests (all PASS):
  - tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist
  - options/stackmap writable, trace shows <stack_id N>
  - stack_map text export with correct symbols
  - reset clears entries when tracing stopped
  - reset rejected (-EBUSY) while tracing active
  - per-event trigger: only specified events get stacks

Performance (sched_switch, 5s):
  entries:       466 / 16384
  successes:     9159
  drops:         0
  success_rate:  100%
  dedup rate:    95.2% (466 unique stacks / 9625 total events)

Performance (kmem_cache_alloc, 5s):
  entries:       1177 / 16384
  successes:     60078
  drops:         0
  success_rate:  100%
  dedup rate:    98.1% (1177 unique stacks / 61255 total events)

Ring buffer space savings:
  Event               Full stack         Stackmap           Saving
  ----------------    ---------------    ---------------    ------
  sched_switch        9625 × 88B=847KB   12B×9625+88B×466=156KB   82%
  kmem_cache_alloc    61255×88B=5.4MB    12B×61255+88B×1177=839KB  85%

QEMU validation (v3 base: v7.1-rc5)
===================================

The series boots cleanly on aarch64 QEMU. A post-init smoke test
(12/12 PASS) verified all functional behaviors including:
- tracefs nodes appear with correct file modes
- stack_id events emitted, kernel symbols resolve correctly
  (e.g. __schedule+0x7cc/0x1138)
- reset rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- per-CPU local_t counters aggregate correctly across CPUs
- stack_map_bin magic correct (0x464D5342 'FSMB')
- 'stackmap' option visible on the global instance, hidden on
  secondary instances under tracing/instances/*/

Boot-time activation via 'trace_options=stackmap,stacktrace' works:
events that fire before stackmap initialization fall back to
recording full stack traces; later events are deduplicated. No
events are dropped due to the transition.

Known limitations
=================

- Per-instance stackmap support is not included in this series.
  Following the convention used for other global-only options
  (PRINTK, RECORD_CMD), the 'stackmap' option is gated to the
  top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is
  not exposed under tracing/instances/*/options/. Per-instance
  maps would be a follow-up.
- The element pool is allocated eagerly at fs_initcall when
  CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will
  ever enable the option. At the default bits=14 this is roughly
  8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager
  allocation keeps the hot path entirely allocation-free and
  avoids any allocation-failure path under tracing pressure.
  Lazy allocation on first 'echo 1 > options/stackmap' is a
  reasonable follow-up if maintainers prefer that trade-off.
- Deduplication is best-effort, not strict: under heavy
  concurrent contention two CPUs racing in the insert path with
  the same stack hash may each succeed in claiming a different
  slot, producing a small number of duplicate entries for the
  same stack. ref_count is then split across the duplicates.
  This is intentional: it keeps the hot path lock-free and
  bounds memory by the element pool size.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once
  the binary format settles.

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace


Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 162 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  22 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          |  78 +-
 kernel/trace/trace.h                          |  16 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_selftest.c                 |   1 +
 kernel/trace/trace_stackmap.c                 | 780 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  57 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 103 +++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  42 +
 tools/tracing/stackmap_dump.py                | 150 ++++
 14 files changed, 1449 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1


^ permalink raw reply

* [PATCH RFC 2/3] mm: new migrate_mode flag for async using non-temporal stores
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
	Johannes Weiner
  Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
	Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
	linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-0-eb9c9422daef@zptcorp.com>

From: Alirad Malek <alirad.malek@zptcorp.com>

In preparation for the following patch, add a new migrate_mode which is
still async but will use non-temporal stores. Add a helper function
that checks for both async modes and replace all plain checks of
MIGRATE_ASYNC.

Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
 fs/nfs/write.c                 |  2 +-
 include/linux/migrate_mode.h   |  9 +++++++++
 include/trace/events/migrate.h |  1 +
 mm/compaction.c                | 18 +++++++++---------
 mm/migrate.c                   | 12 ++++++------
 5 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1ed4b3590b1a..beae4441e080 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2119,7 +2119,7 @@ int nfs_migrate_folio(struct address_space *mapping, struct folio *dst,
 	}
 
 	if (folio_test_private_2(src)) { /* [DEPRECATED] */
-		if (mode == MIGRATE_ASYNC)
+		if (migrate_mode_is_async(mode))
 			return -EBUSY;
 		folio_wait_private_2(src);
 	}
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 265c4328b36a..f7186e705b48 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -3,6 +3,8 @@
 #define MIGRATE_MODE_H_INCLUDED
 /*
  * MIGRATE_ASYNC means never block
+ * MIGRATE_ASYNC_NON_TEMPORAL_STORES means never block and use non-temporal
+ * stores if supported by the architecture
  * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
@@ -10,10 +12,17 @@
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
+	MIGRATE_ASYNC_NON_TEMPORAL_STORES,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
 };
 
+static inline bool migrate_mode_is_async(enum migrate_mode mode)
+{
+	return mode == MIGRATE_ASYNC ||
+		mode == MIGRATE_ASYNC_NON_TEMPORAL_STORES;
+}
+
 enum migrate_reason {
 	MR_COMPACTION,
 	MR_MEMORY_FAILURE,
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index cd01dd7b3640..e493207a3f46 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -9,6 +9,7 @@
 
 #define MIGRATE_MODE						\
 	EM( MIGRATE_ASYNC,	"MIGRATE_ASYNC")		\
+	EM(MIGRATE_ASYNC_NON_TEMPORAL_STORES,		"MIGRATE_ASYNC_NON_TEMPORAL_STORES")	\
 	EM( MIGRATE_SYNC_LIGHT,	"MIGRATE_SYNC_LIGHT")		\
 	EMe(MIGRATE_SYNC,	"MIGRATE_SYNC")
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..cd26781b7376 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -444,7 +444,7 @@ static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 	/* Update where async and sync compaction should restart */
 	if (pfn > zone->compact_cached_migrate_pfn[0])
 		zone->compact_cached_migrate_pfn[0] = pfn;
-	if (cc->mode != MIGRATE_ASYNC &&
+	if (!migrate_mode_is_async(cc->mode) &&
 	    pfn > zone->compact_cached_migrate_pfn[1])
 		zone->compact_cached_migrate_pfn[1] = pfn;
 }
@@ -507,7 +507,7 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 	__acquires(lock)
 {
 	/* Track if the lock is contended in async mode */
-	if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
+	if (migrate_mode_is_async(cc->mode) && !cc->contended) {
 		if (spin_trylock_irqsave(lock, *flags))
 			return true;
 
@@ -864,7 +864,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			return -EAGAIN;
 
 		/* async migration should just abort */
-		if (cc->mode == MIGRATE_ASYNC)
+		if (migrate_mode_is_async(cc->mode))
 			return -EAGAIN;
 
 		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
@@ -875,7 +875,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	cond_resched();
 
-	if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
+	if (cc->direct_compaction && migrate_mode_is_async(cc->mode)) {
 		skip_on_failure = true;
 		next_skip_pfn = block_end_pfn(low_pfn, cc->order);
 	}
@@ -1364,7 +1364,7 @@ static bool suitable_migration_source(struct compact_control *cc,
 	if (pageblock_skip_persistent(page))
 		return false;
 
-	if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
+	if (!migrate_mode_is_async(cc->mode) || !cc->direct_compaction)
 		return true;
 
 	block_mt = get_pageblock_migratetype(page);
@@ -1465,7 +1465,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
 		return;
 
 	/* Minimise scanning during async compaction */
-	if (cc->direct_compaction && cc->mode == MIGRATE_ASYNC)
+	if (cc->direct_compaction && migrate_mode_is_async(cc->mode))
 		return;
 
 	/* Pageblock boundaries */
@@ -1705,7 +1705,7 @@ static void isolate_freepages(struct compact_control *cc)
 	block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
 						zone_end_pfn(zone));
 	low_pfn = pageblock_end_pfn(cc->migrate_pfn);
-	stride = cc->mode == MIGRATE_ASYNC ? COMPACT_CLUSTER_MAX : 1;
+	stride = migrate_mode_is_async(cc->mode) ? COMPACT_CLUSTER_MAX : 1;
 
 	/*
 	 * Isolate free pages until enough are available to migrate the
@@ -2514,7 +2514,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 	unsigned long start_pfn = cc->zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(cc->zone);
 	unsigned long last_migrated_pfn;
-	const bool sync = cc->mode != MIGRATE_ASYNC;
+	const bool sync = !migrate_mode_is_async(cc->mode);
 	bool update_cached;
 	unsigned int nr_succeeded = 0, nr_migratepages;
 	int order;
@@ -2537,7 +2537,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 		ret = compaction_suit_allocation_order(cc->zone, cc->order,
 						       cc->highest_zoneidx,
 						       cc->alloc_flags,
-						       cc->mode == MIGRATE_ASYNC,
+						       migrate_mode_is_async(cc->mode),
 						       !cc->direct_compaction);
 		if (ret != COMPACT_CONTINUE)
 			return ret;
diff --git a/mm/migrate.c b/mm/migrate.c
index 2c3d489ecf51..ff6cf50e7b0b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -907,7 +907,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 
 	do {
 		if (!trylock_buffer(bh)) {
-			if (mode == MIGRATE_ASYNC)
+			if (migrate_mode_is_async(mode))
 				goto unlock;
 			if (mode == MIGRATE_SYNC_LIGHT && !buffer_uptodate(bh))
 				goto unlock;
@@ -1220,7 +1220,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 	dst->private = NULL;
 
 	if (!folio_trylock(src)) {
-		if (mode == MIGRATE_ASYNC)
+		if (migrate_mode_is_async(mode))
 			goto out;
 
 		/*
@@ -1325,7 +1325,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		/* Establish migration ptes */
 		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
 			       !folio_test_ksm(src) && !anon_vma, src);
-		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
+		try_to_migrate(src, migrate_mode_is_async(mode) ? TTU_BATCH_FLUSH : 0);
 		old_page_state |= PAGE_WAS_MAPPED;
 	}
 
@@ -1565,7 +1565,7 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
 {
 	int rc;
 
-	if (mode == MIGRATE_ASYNC) {
+	if (migrate_mode_is_async(mode)) {
 		if (!folio_trylock(folio))
 			return -EAGAIN;
 	} else {
@@ -1799,7 +1799,7 @@ static int migrate_pages_batch(struct list_head *from,
 	LIST_HEAD(dst_folios);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
 
-	VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
+	VM_WARN_ON_ONCE(!migrate_mode_is_async(mode) &&
 			!list_empty(from) && !list_is_singular(from));
 
 	for (pass = 0; pass < nr_pass && retry; pass++) {
@@ -2107,7 +2107,7 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 		list_cut_before(&folios, from, &folio2->lru);
 	else
 		list_splice_init(from, &folios);
-	if (mode == MIGRATE_ASYNC)
+	if (migrate_mode_is_async(mode))
 		rc = migrate_pages_batch(&folios, get_new_folio, put_new_folio,
 				private, mode, reason, &ret_folios,
 				&split_folios, &stats,

-- 
2.43.0


^ permalink raw reply related

* [PATCH RFC 3/3] mm: use non-temporal stores for demotion
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
	Johannes Weiner
  Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
	Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
	linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-0-eb9c9422daef@zptcorp.com>

From: Alirad Malek <alirad.malek@zptcorp.com>

Memory demoted to a lower tier is assumed to be cold and most likely out of
the CPU's last level cache. Additionally, in certain demotion targets (e.g.
CXL devices with compressed memory) the bandwidth can be negatively
impacted by the eviction patterns of the last level cache when standard
memcpy is used. When the feature is enabled, use the
MIGRATE_ASYNC_NON_TEMPORAL_STORES flag in demotions to trigger the folio
copy path using non-temporal stores.

Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
 mm/Kconfig   | 8 ++++++++
 mm/migrate.c | 9 ++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ebd8ea353687..4b7a75b57f6e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -645,6 +645,14 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config DEMOTION_WITH_NON_TEMPORAL_STORES
+	bool "Use non-temporal stores for demotion"
+	default n
+	depends on MIGRATION
+	help
+	  Enable non-temporal stores when migrating pages due to demotion.
+	  If disabled, demotion uses regular migration copy paths.
+
 config DEVICE_MIGRATION
 	def_bool MIGRATION && ZONE_DEVICE
 
diff --git a/mm/migrate.c b/mm/migrate.c
index ff6cf50e7b0b..368d40dc8772 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -862,7 +862,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
-	rc = folio_mc_copy(dst, src);
+	if (mode == MIGRATE_ASYNC_NON_TEMPORAL_STORES)
+		rc = folio_mc_copy_nt(dst, src);
+	else
+		rc = folio_mc_copy(dst, src);
 	if (unlikely(rc))
 		return rc;
 
@@ -2081,6 +2084,10 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 	LIST_HEAD(split_folios);
 	struct migrate_pages_stats stats;
 
+	if (IS_ENABLED(CONFIG_DEMOTION_WITH_NON_TEMPORAL_STORES) &&
+		reason == MR_DEMOTION && mode == MIGRATE_ASYNC)
+		mode = MIGRATE_ASYNC_NON_TEMPORAL_STORES;
+
 	trace_mm_migrate_pages_start(mode, reason);
 
 	memset(&stats, 0, sizeof(stats));

-- 
2.43.0


^ permalink raw reply related

* [PATCH RFC 1/3] mm, x86: support copying a folio using non-temporal stores
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
	Johannes Weiner
  Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
	Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
	linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-0-eb9c9422daef@zptcorp.com>

From: Alirad Malek <alirad.malek@zptcorp.com>

In x86, use memcpy_flushcache (that uses non-temporal store
instructions) to copy a folio. To achieve that, starting from folio_mc_copy
down to copy_mc_to_kernel, create a series of helpers (named with an _nt
suffix) that have similar behavior to the original counterparts.

Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
 arch/x86/include/asm/uaccess.h |  4 ++++
 arch/x86/lib/copy_mc.c         | 26 ++++++++++++++++++++++++++
 include/linux/highmem.h        | 32 ++++++++++++++++++++++++++++++++
 include/linux/mm.h             |  1 +
 mm/util.c                      | 17 +++++++++++++++++
 5 files changed, 80 insertions(+)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 367297b188c3..2d0938d3e372 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -494,6 +494,10 @@ unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
 #define copy_mc_to_kernel copy_mc_to_kernel
 
+unsigned long __must_check
+copy_mc_to_kernel_nt(void *to, const void *from, unsigned len);
+#define copy_mc_to_kernel_nt copy_mc_to_kernel_nt
+
 unsigned long __must_check
 copy_mc_to_user(void __user *to, const void *from, unsigned len);
 #endif
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 97e88e58567b..5a2ee5c2211e 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -81,6 +81,32 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
 }
 EXPORT_SYMBOL_GPL(copy_mc_to_kernel);
 
+/**
+ * copy_mc_to_kernel_nt - memory copy that handles source exceptions
+ * if enabled, otherwise uses non-temporal stores
+ * @dst: destination address
+ * @src: source address
+ * @len: number of bytes to copy
+ *
+ * Return 0 for success, or number of bytes not copied if there was an
+ * exception.
+ */
+unsigned long __must_check copy_mc_to_kernel_nt(void *dst, const void *src, unsigned len)
+{
+	unsigned long ret;
+
+	if (copy_mc_fragile_enabled) {
+		instrument_memcpy_before(dst, src, len);
+		ret = copy_mc_fragile(dst, src, len);
+		instrument_memcpy_after(dst, src, len, ret);
+		return ret;
+	}
+
+	memcpy_flushcache(dst, src, len);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(copy_mc_to_kernel_nt);
+
 unsigned long __must_check copy_mc_to_user(void __user *dst, const void *src, unsigned len)
 {
 	unsigned long ret;
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..a5cb435b9ffe 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -468,6 +468,32 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
 
 	return ret;
 }
+
+#ifdef copy_mc_to_kernel_nt
+static inline int copy_mc_highpage_nt(struct page *to, struct page *from)
+{
+	unsigned long ret;
+	char *vfrom, *vto;
+
+	vfrom = kmap_local_page(from);
+	vto = kmap_local_page(to);
+	ret = copy_mc_to_kernel_nt(vto, vfrom, PAGE_SIZE);
+	if (!ret)
+		kmsan_copy_page_meta(to, from);
+	kunmap_local(vto);
+	kunmap_local(vfrom);
+
+	if (ret)
+		memory_failure_queue(page_to_pfn(from), 0);
+
+	return ret;
+}
+#else
+static inline int copy_mc_highpage_nt(struct page *to, struct page *from)
+{
+	return copy_mc_highpage(to, from);
+}
+#endif
 #else
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
 					unsigned long vaddr, struct vm_area_struct *vma)
@@ -481,6 +507,12 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
 	copy_highpage(to, from);
 	return 0;
 }
+
+static inline int copy_mc_highpage_nt(struct page *to, struct page *from)
+{
+	copy_highpage(to, from);
+	return 0;
+}
 #endif
 
 static inline void memcpy_page(struct page *dst_page, size_t dst_off,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5be3d8a8f806..d07ce478582d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1644,6 +1644,7 @@ void __folio_put(struct folio *folio);
 void split_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
 int folio_mc_copy(struct folio *dst, struct folio *src);
+int folio_mc_copy_nt(struct folio *dst, struct folio *src);
 
 unsigned long nr_free_buffer_pages(void);
 
diff --git a/mm/util.c b/mm/util.c
index b05ab6f97e11..e09e9b5f8eee 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -749,6 +749,23 @@ int folio_mc_copy(struct folio *dst, struct folio *src)
 }
 EXPORT_SYMBOL(folio_mc_copy);
 
+int folio_mc_copy_nt(struct folio *dst, struct folio *src)
+{
+	long nr = folio_nr_pages(src);
+	long i = 0;
+
+	for (;;) {
+		if (copy_mc_highpage_nt(folio_page(dst, i), folio_page(src, i)))
+			return -EHWPOISON;
+		if (++i == nr)
+			break;
+		cond_resched();
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(folio_mc_copy_nt);
+
 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
 static int sysctl_overcommit_ratio __read_mostly = 50;
 static unsigned long sysctl_overcommit_kbytes __read_mostly;

-- 
2.43.0


^ permalink raw reply related

* [PATCH RFC 0/3] Demote to lower tier using non-temporal stores
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
	Johannes Weiner
  Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
	Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
	linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek

In most memory tiering scenarios, the memory to be demoted is expected
to be cold and most likely out of the node's last-level cache (as well
as target pages in the target node). Using non-temporal stores instead
of a standard memcpy path can reduce the cache pollution in the local
node and the bandwidth overhead to the target node. Furthermore, for
certain types of CXL devices that support in-line memory compression,
the last-level cache eviction patterns can negatively affect the
bandwidth of the device. Non-temporal stores can mitigate this.

This patch-set introduces a new migrate_mode flag for using non-temporal
stores that is used only in the demotion path. Patch 1 adds some helpers in
x86 and mm to bring non-temporal stores support to a respective folio_copy
function. Patch 2 adds the new flag and necessary changes for compatibility
with the existing behavior. Patch 3 uses the new flag for demotions and
guards this by a Kconfig option.

Experimental data: in a CXL system with 1 memory expander, a microbenchmark
that allocates N=64 GB memory in the local node and then triggers demotion
using memory.reclaim, shows a practically complete elimination of read
traffic on the device, i.e. write traffic is N GB with and without the
patch, while read traffic drops from N to almost 0 with the patch.

Opens:
1. There is some "duplication" in the x86 tree and a bit in mm. Can we do
   something better there? As it is now in copy_mc_to_kernel_nt we
duplicate the machine check functionality, which if available will override
the non-temporal. We were not sure how to prioritize these two and what's
the best approach here. Can we completely skip the machine checked for this
path?
2. We've hidden the use of the new flag behind a Kconfig option. Is this
   the right way to add this?  
3. We have not carefully considered how this should be structured so that
   it is easily adopted in other architecture trees (e.g. aarch64). Arm
support is currently out of our scope but any input is appreciated.

Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
Alirad Malek (3):
      mm, x86: support copying a folio using non-temporal stores
      mm: new migrate_mode flag for async using non-temporal stores
      mm: use non-temporal stores for demotion

 arch/x86/include/asm/uaccess.h |  4 ++++
 arch/x86/lib/copy_mc.c         | 26 ++++++++++++++++++++++++++
 fs/nfs/write.c                 |  2 +-
 include/linux/highmem.h        | 32 ++++++++++++++++++++++++++++++++
 include/linux/migrate_mode.h   |  9 +++++++++
 include/linux/mm.h             |  1 +
 include/trace/events/migrate.h |  1 +
 mm/Kconfig                     |  8 ++++++++
 mm/compaction.c                | 18 +++++++++---------
 mm/migrate.c                   | 21 ++++++++++++++-------
 mm/util.c                      | 17 +++++++++++++++++
 11 files changed, 122 insertions(+), 17 deletions(-)
---
base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
change-id: 20260526-rfc-nt-demote-0fafadfdd006

Best regards,
--  
Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>

^ permalink raw reply

* Re: [PATCHv3 07/12] selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
From: Jakub Sitnicki @ 2026-05-26 10:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260521124411.31133-8-jolsa@kernel.org>

On Thu, May 21, 2026 at 02:44 PM +02, Jiri Olsa wrote:
> Syncing latest usdt.h change [1].
>
> Now that we have nop10 optimization support in kernel, let's emit
> nop,nop10 for usdt probe. We leave it up to the library to use
> desirable nop instruction.
>
> [1] TBD
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply

* Re: [PATCH v2 10/17] landlock: Set audit_net.sk for socket access checks
From: Mickaël Salaün @ 2026-05-26 10:42 UTC (permalink / raw)
  To: Christian Brauner, Günther Noack, Steven Rostedt
  Cc: Jann Horn, Jeff Xu, Justin Suess, Kees Cook, Masami Hiramatsu,
	Mathieu Desnoyers, Matthieu Buffet, Mikhail Ivanov, Tingmao Wang,
	kernel-team, linux-fsdevel, linux-security-module,
	linux-trace-kernel, stable
In-Reply-To: <20260406143717.1815792-11-mic@digikod.net>

I merged this fix in the -next branch.

On Mon, Apr 06, 2026 at 04:37:08PM +0200, Mickaël Salaün wrote:
> Set audit_net.sk in current_check_access_socket() to provide the socket
> object to audit_log_lsm_data().  This makes Landlock consistent with
> AppArmor, which always sets .sk for socket operations, and with
> SELinux's generic socket permission checks.
> 
> The socket's local and foreign address information (laddr, lport, faddr,
> fport) is logged by the shared lsm_audit.c infrastructure when the
> socket has bound or connected state.  Fields with zero values are
> suppressed by print_ipv4_addr()/print_ipv6_addr(), so the audit output
> is unchanged for the common case of bind denials on unbound sockets.
> For connect denials after a prior bind, the bound local address (laddr,
> lport) appears before the existing sockaddr fields (daddr, dest).
> 
> No existing fields are removed or reordered, and the new field names
> (laddr, lport, faddr, fport) are standard audit fields already emitted
> by other LSMs through the same lsm_audit.c code path.
> 
> Add net_bind and net_connect audit tests.  The net_bind test verifies
> basic net denial auditing.  The net_connect test binds to an allowed
> port, then connects to a denied port, and verifies that the audit record
> includes laddr/lport from the socket state.
> 
> Fixes: 9f74411a40ce ("landlock: Log TCP bind and connect denials")
> Cc: stable@vger.kernel.org
> Cc: Günther Noack <gnoack@google.com>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> ---
> 
> Changes since v1:
> - New patch.
> ---
>  security/landlock/net.c                       |   1 +
>  tools/testing/selftests/landlock/audit_test.c | 187 ++++++++++++++++++
>  2 files changed, 188 insertions(+)
> 
> diff --git a/security/landlock/net.c b/security/landlock/net.c
> index a2aefc7967a1..d8bc9e0d012a 100644
> --- a/security/landlock/net.c
> +++ b/security/landlock/net.c
> @@ -225,6 +225,7 @@ static int current_check_access_socket(struct socket *const sock,
>  		return 0;
>  
>  	audit_net.family = address->sa_family;
> +	audit_net.sk = sock->sk;
>  	landlock_log_denial(subject,
>  			    &(struct landlock_request){
>  				    .type = LANDLOCK_REQUEST_NET_ACCESS,
> diff --git a/tools/testing/selftests/landlock/audit_test.c b/tools/testing/selftests/landlock/audit_test.c
> index da0bfd06391e..65dfb272c825 100644
> --- a/tools/testing/selftests/landlock/audit_test.c
> +++ b/tools/testing/selftests/landlock/audit_test.c
> @@ -6,14 +6,17 @@
>   */
>  
>  #define _GNU_SOURCE
> +#include <arpa/inet.h>
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <limits.h>
>  #include <linux/landlock.h>
> +#include <netinet/in.h>
>  #include <pthread.h>
>  #include <stdlib.h>
>  #include <sys/mount.h>
>  #include <sys/prctl.h>
> +#include <sys/socket.h>
>  #include <sys/types.h>
>  #include <sys/wait.h>
>  #include <unistd.h>
> @@ -160,6 +163,190 @@ TEST_F(audit, layers)
>  	EXPECT_EQ(0, close(ruleset_fd));
>  }
>  
> +static int matches_log_net_bind(struct __test_metadata *const _metadata,
> +				int audit_fd, __u16 port, __u64 *domain_id)
> +{
> +	/*
> +	 * The socket is unbound at bind() time, so laddr/lport/faddr/fport from
> +	 * the socket object are zero and not printed.  Only the sockaddr fields
> +	 * (src) appear.
> +	 */
> +	static const char log_template[] = REGEX_LANDLOCK_PREFIX
> +		" blockers=net\\.bind_tcp src=%u$";
> +	char log_match[sizeof(log_template) + 10];
> +
> +	snprintf(log_match, sizeof(log_match), log_template, port);
> +	return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
> +				  domain_id);
> +}
> +
> +/*
> + * Verifies that network denial audit records include enriched socket
> + * information (laddr/lport/faddr/fport) from the socket object.
> + */
> +TEST_F(audit, net_bind)
> +{
> +	const struct landlock_ruleset_attr ruleset_attr = {
> +		.handled_access_net = LANDLOCK_ACCESS_NET_BIND_TCP,
> +	};
> +	struct landlock_net_port_attr net_port = {
> +		.allowed_access = LANDLOCK_ACCESS_NET_BIND_TCP,
> +		.port = 1024,
> +	};
> +	int status, ruleset_fd;
> +	pid_t child;
> +	__u64 denial_dom = 1;
> +
> +	ruleset_fd =
> +		landlock_create_ruleset(&ruleset_attr, sizeof(ruleset_attr), 0);
> +	ASSERT_LE(0, ruleset_fd);
> +
> +	/* Allow port 1024 only. */
> +	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
> +				       &net_port, 0));
> +
> +	EXPECT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
> +
> +	child = fork();
> +	ASSERT_LE(0, child);
> +	if (child == 0) {
> +		struct sockaddr_in addr = {
> +			.sin_family = AF_INET,
> +			.sin_port = htons(1025),
> +			.sin_addr.s_addr = htonl(INADDR_ANY),
> +		};
> +		int sock_fd;
> +
> +		EXPECT_EQ(0, landlock_restrict_self(ruleset_fd, 0));
> +		close(ruleset_fd);
> +
> +		/* Bind to port 1025 (not allowed). */
> +		sock_fd = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +		ASSERT_LE(0, sock_fd);
> +		EXPECT_EQ(-1, bind(sock_fd, (struct sockaddr *)&addr,
> +				   sizeof(addr)));
> +		EXPECT_EQ(EACCES, errno);
> +		close(sock_fd);
> +
> +		/* Verify audit record with enriched socket info. */
> +		EXPECT_EQ(0, matches_log_net_bind(_metadata, self->audit_fd,
> +						  1025, &denial_dom));
> +		EXPECT_NE(denial_dom, 1);
> +		EXPECT_NE(denial_dom, 0);
> +
> +		_exit(_metadata->exit_code);
> +		return;
> +	}
> +
> +	ASSERT_EQ(child, waitpid(child, &status, 0));
> +	if (WIFSIGNALED(status) || !WIFEXITED(status) ||
> +	    WEXITSTATUS(status) != EXIT_SUCCESS)
> +		_metadata->exit_code = KSFT_FAIL;
> +
> +	EXPECT_EQ(0, close(ruleset_fd));
> +}
> +
> +static int matches_log_net_connect(struct __test_metadata *const _metadata,
> +				   int audit_fd, __u16 denied_port,
> +				   __u16 bound_port, __u64 *domain_id)
> +{
> +	/*
> +	 * After bind(), the socket has local address state.  The audit record
> +	 * should include laddr/lport from the socket (via audit_net.sk) and
> +	 * daddr/dest from the connect sockaddr.
> +	 */
> +	static const char log_template[] = REGEX_LANDLOCK_PREFIX
> +		" blockers=net\\.connect_tcp"
> +		" laddr=127\\.0\\.0\\.1 lport=%u"
> +		" daddr=127\\.0\\.0\\.1 dest=%u$";
> +	char log_match[sizeof(log_template) + 20];
> +
> +	snprintf(log_match, sizeof(log_match), log_template, bound_port,
> +		 denied_port);
> +	return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
> +				  domain_id);
> +}
> +
> +/*
> + * Verifies that network denial audit records for connect include enriched
> + * socket information (laddr/lport) from the socket object after a prior bind.
> + * This complements net_bind which tests the unbound case.
> + */
> +TEST_F(audit, net_connect)
> +{
> +	const struct landlock_ruleset_attr ruleset_attr = {
> +		.handled_access_net = LANDLOCK_ACCESS_NET_BIND_TCP |
> +				      LANDLOCK_ACCESS_NET_CONNECT_TCP,
> +	};
> +	struct landlock_net_port_attr net_port;
> +	int status, ruleset_fd;
> +	pid_t child;
> +	__u64 denial_dom = 1;
> +
> +	ruleset_fd =
> +		landlock_create_ruleset(&ruleset_attr, sizeof(ruleset_attr), 0);
> +	ASSERT_LE(0, ruleset_fd);
> +
> +	/* Allow bind to port 1024 and connect to port 1024. */
> +	net_port.allowed_access = LANDLOCK_ACCESS_NET_BIND_TCP |
> +				  LANDLOCK_ACCESS_NET_CONNECT_TCP;
> +	net_port.port = 1024;
> +	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
> +				       &net_port, 0));
> +
> +	EXPECT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
> +
> +	child = fork();
> +	ASSERT_LE(0, child);
> +	if (child == 0) {
> +		struct sockaddr_in bind_addr = {
> +			.sin_family = AF_INET,
> +			.sin_port = htons(1024),
> +			.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> +		};
> +		struct sockaddr_in conn_addr = {
> +			.sin_family = AF_INET,
> +			.sin_port = htons(1025),
> +			.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> +		};
> +		int sock_fd, optval = 1;
> +
> +		EXPECT_EQ(0, landlock_restrict_self(ruleset_fd, 0));
> +		close(ruleset_fd);
> +
> +		sock_fd = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +		ASSERT_LE(0, sock_fd);
> +		ASSERT_EQ(0, setsockopt(sock_fd, SOL_SOCKET, SO_REUSEADDR,
> +					&optval, sizeof(optval)));
> +
> +		/* Bind to allowed port 1024 (succeeds). */
> +		ASSERT_EQ(0, bind(sock_fd, (struct sockaddr *)&bind_addr,
> +				  sizeof(bind_addr)));
> +
> +		/* Connect to denied port 1025 (fails). */
> +		EXPECT_EQ(-1, connect(sock_fd, (struct sockaddr *)&conn_addr,
> +				      sizeof(conn_addr)));
> +		EXPECT_EQ(EACCES, errno);
> +		close(sock_fd);
> +
> +		/* Verify audit record with laddr/lport from bound socket. */
> +		EXPECT_EQ(0, matches_log_net_connect(_metadata, self->audit_fd,
> +						     1025, 1024, &denial_dom));
> +		EXPECT_NE(denial_dom, 1);
> +		EXPECT_NE(denial_dom, 0);
> +
> +		_exit(_metadata->exit_code);
> +		return;
> +	}
> +
> +	ASSERT_EQ(child, waitpid(child, &status, 0));
> +	if (WIFSIGNALED(status) || !WIFEXITED(status) ||
> +	    WEXITSTATUS(status) != EXIT_SUCCESS)
> +		_metadata->exit_code = KSFT_FAIL;
> +
> +	EXPECT_EQ(0, close(ruleset_fd));
> +}
> +
>  struct thread_data {
>  	pid_t parent_pid;
>  	int ruleset_fd, pipe_child, pipe_parent;
> -- 
> 2.53.0
> 
> 

^ permalink raw reply

* [PATCH 4/4] rtla/tests: Add runtime tests for restoring continue flag
From: Tomas Glozar @ 2026-05-26 10:25 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260526102523.2662391-1-tglozar@redhat.com>

In case an action preceding the continue action fails, not only
the continue flag should not be set, it should be unset if it was set
from a previous run of actions_perform().

Add a runtime test to both osnoise and timerlat tools that checks that
this works properly by creating a temporary file.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---

Depends on "rtla/tests: Extend runtime test coverage" patchset
- https://lore.kernel.org/linux-trace-kernel/20260423130558.882022-1-tglozar@redhat.com/

 tools/tracing/rtla/tests/osnoise.t  | 2 ++
 tools/tracing/rtla/tests/timerlat.t | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tools/tracing/rtla/tests/osnoise.t b/tools/tracing/rtla/tests/osnoise.t
index 9c2f84a4187d..a7956ab605cd 100644
--- a/tools/tracing/rtla/tests/osnoise.t
+++ b/tools/tracing/rtla/tests/osnoise.t
@@ -65,6 +65,8 @@ check "top stop at failed action" \
 	"osnoise top -S 2 --on-threshold shell,command='echo -n abc; false' --on-threshold shell,command='echo -n defgh'" 2 "^abc" "defgh"
 check_top_q_hist "with continue" \
 	"osnoise TOOL -S 2 -d 5s --on-threshold shell,command='echo TestOutput' --on-threshold continue" 0 "^TestOutput$"
+check_top_q_hist "with conditional continue" \
+	"osnoise TOOL -S 2 --on-threshold shell,command='if [ -f a ]; then echo 2; exit 1; else echo -n 1; touch a; fi' --on-threshold continue" 2 "^12$" "^2$"
 check_top_hist "with trace output at end" \
 	"osnoise TOOL -d 1s --on-end trace" 0 "^  Saving trace to osnoise_trace.txt$"
 
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index f3e5f99e862b..19fd5af26ebb 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -94,6 +94,8 @@ check "top stop at failed action" \
 	"timerlat top -T 2 --on-threshold shell,command='echo -n abc; false' --on-threshold shell,command='echo -n defgh'" 2 "^abc" "defgh"
 check_top_q_hist "with continue" \
 	"timerlat TOOL -T 2 -d 5s --on-threshold shell,command='echo TestOutput' --on-threshold continue" 0 "^TestOutput$"
+check_top_q_hist "with conditional continue" \
+	"timerlat TOOL -T 2 --on-threshold shell,command='if [ -f a ]; then echo 2; exit 1; else echo -n 1; touch a; fi' --on-threshold continue" 2 "^12$" "^2$"
 check_top_hist "with trace output at end" \
 	"timerlat TOOL -d 1s --on-end trace" 0 "^  Saving trace to timerlat_trace.txt$"
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/4] rtla/tests: Run runtime tests in temporary directory
From: Tomas Glozar @ 2026-05-26 10:25 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260526102523.2662391-1-tglozar@redhat.com>

Create a temporary directory before each test case to serve as working
directory during the duration of the test.

This prevents littering of the original working directory as well as
allows tests to use it to avoid path conflicts.

In order not to break already existing tests, also add a new "testdir"
variable containing the directory where the test file is located. This
is then used to locate artifacts used during testing like BPF programs
and scripts for checking the tracer threads.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---

Depends on "rtla/tests: Extend runtime test coverage" patchset
- https://lore.kernel.org/linux-trace-kernel/20260423130558.882022-1-tglozar@redhat.com/

 tools/tracing/rtla/tests/engine.sh  | 12 ++++++++++++
 tools/tracing/rtla/tests/osnoise.t  |  8 ++++----
 tools/tracing/rtla/tests/timerlat.t | 16 ++++++++--------
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/tools/tracing/rtla/tests/engine.sh b/tools/tracing/rtla/tests/engine.sh
index 27d92f19a322..5bf8453d354d 100644
--- a/tools/tracing/rtla/tests/engine.sh
+++ b/tools/tracing/rtla/tests/engine.sh
@@ -4,6 +4,9 @@ test_begin() {
 	# Count tests to allow the test harness to double-check if all were
 	# included correctly.
 	ctr=0
+	# Set test directory to the directory of the script
+	scriptfile=$(realpath "$0")
+	testdir=$(dirname "$scriptfile")
 	[ -z "$RTLA" ] && RTLA="./rtla"
 	[ -n "$TEST_COUNT" ] && echo "1..$TEST_COUNT"
 }
@@ -51,6 +54,11 @@ check() {
 	then
 		# Reset osnoise options before running test.
 		[ "$NO_RESET_OSNOISE" == 1 ] || reset_osnoise
+
+		# Create a temporary directory to contain rtla output
+		tmpdir=$(mktemp -d)
+		pushd $tmpdir >/dev/null
+
 		# Run rtla; in case of failure, include its output as comment
 		# in the test results.
 		result=$(eval stdbuf -oL $TIMEOUT "$RTLA" $2 2>&1); exitcode=$?
@@ -82,6 +90,10 @@ check() {
 			echo "$result" | col -b | while read line; do echo "# $line"; done
 			printf "#\n# exit code %s\n" $exitcode
 		fi
+
+		# Remove temporary directory
+		popd >/dev/null
+		rm -r $tmpdir
 	fi
 }
 
diff --git a/tools/tracing/rtla/tests/osnoise.t b/tools/tracing/rtla/tests/osnoise.t
index 06787471d0e8..9c2f84a4187d 100644
--- a/tools/tracing/rtla/tests/osnoise.t
+++ b/tools/tracing/rtla/tests/osnoise.t
@@ -16,15 +16,15 @@ check_top_q_hist "verify the --trace param" \
 
 # Thread tests
 check_top_q_hist "verify the --priority/-P param" \
-	"osnoise TOOL -P F:1 -c 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=\"tests/scripts/check-priority.sh SCHED_FIFO 1\"" \
+	"osnoise TOOL -P F:1 -c 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=\"$testdir/scripts/check-priority.sh SCHED_FIFO 1\"" \
 	2 "Priorities are set correctly"
 check_top_q_hist "verify the -C/--cgroup param" \
-	"osnoise TOOL -C -c 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=\"tests/scripts/check-cgroup-match.sh\"" \
+	"osnoise TOOL -C -c 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=\"$testdir/scripts/check-cgroup-match.sh\"" \
 	2 "cgroup matches for all workload PIDs"
 check_top_q_hist "verify the -c/--cpus param" \
-	"osnoise TOOL -P F:1 -c 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=tests/scripts/check-cpus.sh" 2 "^Affinity of threads: 0$"
+	"osnoise TOOL -P F:1 -c 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=$testdir/scripts/check-cpus.sh" 2 "^Affinity of threads: 0$"
 check_top_q_hist "verify the -H/--house-keeping param" \
-	"osnoise TOOL -P F:1 -H 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=tests/scripts/check-housekeeping-cpus.sh" 2 "^Affinity of threads: 0$"
+	"osnoise TOOL -P F:1 -H 0 -r 900000 -d 10s -S 1 --on-threshold shell,command=$testdir/scripts/check-housekeeping-cpus.sh" 2 "^Affinity of threads: 0$"
 
 # Histogram tests
 check "hist with -b/--bucket-size" \
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index 3ebfe316b39e..f3e5f99e862b 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -41,19 +41,19 @@ check_top_hist "disable auto-analysis" \
 
 # Thread tests
 check_top_hist "verify -P/--priority" \
-	"timerlat TOOL -P F:1 -c 0 -d 10s -T 1 --on-threshold shell,command=\"tests/scripts/check-priority.sh SCHED_FIFO 1\"" \
+	"timerlat TOOL -P F:1 -c 0 -d 10s -T 1 --on-threshold shell,command=\"$testdir/scripts/check-priority.sh SCHED_FIFO 1\"" \
 	2 "Priorities are set correctly"
 check_top_hist "verify -C/--cgroup" \
-	"timerlat TOOL -k -C -c 0 -d 10s -T 1 --on-threshold shell,command=\"tests/scripts/check-cgroup-match.sh\"" \
+	"timerlat TOOL -k -C -c 0 -d 10s -T 1 --on-threshold shell,command=\"$testdir/scripts/check-cgroup-match.sh\"" \
 	2 "cgroup matches for all workload PIDs"
 check_top_q_hist "verify -c/--cpus" \
-	"timerlat TOOL -c 0 -d 10s -T 1 --on-threshold shell,command=tests/scripts/check-cpus.sh" 2 "^Affinity of threads: 0$"
+	"timerlat TOOL -c 0 -d 10s -T 1 --on-threshold shell,command=$testdir/scripts/check-cpus.sh" 2 "^Affinity of threads: 0$"
 check_top_q_hist "verify -H/--house-keeping" \
-	"timerlat TOOL -H 0 -d 10s -T 1 --on-threshold shell,command=tests/scripts/check-housekeeping-cpus.sh" 2 "^Affinity of threads: 0$"
+	"timerlat TOOL -H 0 -d 10s -T 1 --on-threshold shell,command=$testdir/scripts/check-housekeeping-cpus.sh" 2 "^Affinity of threads: 0$"
 check_top_q_hist "verify -k/--kernel-threads" \
-	"timerlat TOOL -k -c 0 -d 10s -T 1 --on-threshold shell,command=tests/scripts/check-user-kernel-threads.sh" 2 "1 kernel threads, 0 user threads"
+	"timerlat TOOL -k -c 0 -d 10s -T 1 --on-threshold shell,command=$testdir/scripts/check-user-kernel-threads.sh" 2 "1 kernel threads, 0 user threads"
 check_top_q_hist "verify -u/--user-threads" \
-	"timerlat TOOL -u -c 0 -d 10s -T 1 --on-threshold shell,command=tests/scripts/check-user-kernel-threads.sh" 2 "0 kernel threads, 1 user threads"
+	"timerlat TOOL -u -c 0 -d 10s -T 1 --on-threshold shell,command=$testdir/scripts/check-user-kernel-threads.sh" 2 "0 kernel threads, 1 user threads"
 
 # Histogram tests
 check "hist with -b/--bucket-size" \
@@ -103,12 +103,12 @@ then
 	# Test BPF action program properly in BPF mode
 	[ -z "$BPFTOOL" ] && BPFTOOL=bpftool
 	check_top_q_hist "with BPF action program (BPF mode)" \
-		"timerlat TOOL -T 2 --bpf-action tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
+		"timerlat TOOL -T 2 --bpf-action $testdir/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
 		2 '"value": 42'
 else
 	# Test BPF action program failure in non-BPF mode
 	check_top_q_hist "with BPF action program (non-BPF mode)" \
-		"timerlat TOOL -T 2 --bpf-action tests/bpf/bpf_action_map.o" \
+		"timerlat TOOL -T 2 --bpf-action $testdir/bpf/bpf_action_map.o" \
 		1 "BPF actions are not supported in tracefs-only mode"
 fi
 done
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/4] rtla/tests: Add unit test for restoring continue flag
From: Tomas Glozar @ 2026-05-26 10:25 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260526102523.2662391-1-tglozar@redhat.com>

In case an action preceding the continue action fails, not only
the continue flag should not be set, it should be unset if it was set
from a previous run of actions_perform().

Add a unit test to check if this is implemented correctly.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---

Depends on "rtla/tests: Add unit tests for actions module"
- https://lore.kernel.org/linux-trace-kernel/20260424140244.958495-1-tglozar@redhat.com/

 tools/tracing/rtla/tests/unit/actions.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/tools/tracing/rtla/tests/unit/actions.c b/tools/tracing/rtla/tests/unit/actions.c
index a5808ab71a4d..94ad5ad42774 100644
--- a/tools/tracing/rtla/tests/unit/actions.c
+++ b/tools/tracing/rtla/tests/unit/actions.c
@@ -328,6 +328,18 @@ START_TEST(test_actions_perform_continue_after_failed_shell_command)
 }
 END_TEST
 
+START_TEST(test_actions_perform_continue_unset_flag)
+{
+	actions_fixture.continue_flag = true;
+
+	actions_add_shell(&actions_fixture, "exit 1");
+	actions_add_continue(&actions_fixture);
+	ck_assert_int_eq(actions_perform(&actions_fixture), 1 << 8);
+
+	ck_assert(!actions_fixture.continue_flag);
+}
+END_TEST
+
 Suite *actions_suite(void)
 {
 	Suite *s = suite_create("actions");
@@ -374,6 +386,7 @@ Suite *actions_suite(void)
 	tcase_add_test(tc, test_actions_perform_continue);
 	tcase_add_test(tc, test_actions_perform_continue_after_successful_shell_command);
 	tcase_add_test(tc, test_actions_perform_continue_after_failed_shell_command);
+	tcase_add_test(tc, test_actions_perform_continue_unset_flag);
 	suite_add_tcase(s, tc);
 
 	return s;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/4] rtla/actions: Restore continue flag in actions_perform()
From: Tomas Glozar @ 2026-05-26 10:25 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

Currently, actions_perform() only ever sets the continue flag (when
performing the continue action), but never resets it. That leads to
RTLA continuing tracing even if the continue action was not performed in
the current iteration.

For example, the following command:

$ rtla timerlat hist -T 100 --on-threshold shell,command='
    echo Spike!
    if [ -f /tmp/a ]
    then
      exit 1
    else
      touch /tmp/a
    fi' --on-threshold continue

should print Spike! at most once, because after hitting the threshold
for the first time, /tmp/a exists, the shell action will fail, and the
continue action is not performed. However, unless /tmp/a exists before
the measurement, it will print Spike! until stopped, as the continue
flag stays set.

Set the continue flag to false in the beginning of actions_perform() to
make RTLA continue only if the action was actually performed.

Fixes: 8d933d5c89e ("rtla/timerlat: Add continue action")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/src/actions.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/tracing/rtla/src/actions.c b/tools/tracing/rtla/src/actions.c
index b0d68b5de08d..bf13d9d68f16 100644
--- a/tools/tracing/rtla/src/actions.c
+++ b/tools/tracing/rtla/src/actions.c
@@ -247,6 +247,8 @@ actions_perform(struct actions *self)
 	int pid, retval;
 	const struct action *action;

+	self->continue_flag = false;
+
 	for_each_action(self, action) {
 		switch (action->type) {
 		case ACTION_TRACE_OUTPUT:
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH] tracing: Disable KCOV instrumentation for trace_irqsoff.o
From: Karl Mehltretter @ 2026-05-26 10:22 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Mathieu Desnoyers, Dmitry Vyukov,
	Andrey Konovalov, Marco Elver, kasan-dev, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260526150758.4e0f37745d688f95a1c710d8@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3082 bytes --]

On Tue, May 26, 2026 at 03:07:58PM +0100, Masami Hiramatsu wrote:
> Thanks for reporting. This looks good to me for a mitigation.
> BTW, I could not reproduce the bug with above configs.
> Is this only for arm32?

I was able to reproduce this on arm64 QEMU virt with the attached
config and log.

Test base:
  4cbfe4502e3d ("Merge tag 'v7.1-rc5-ksmbd-server-fixes' ...")

QEMU command:
  qemu-system-aarch64 \
    -machine virt,gic-version=2 -cpu cortex-a57 -m 512M -smp 1 \
    -kernel arch/arm64/boot/Image \
    -append "console=ttyAMA0,115200 earlycon=pl011,0x9000000 rdinit=/init panic_on_warn=0 oops=panic loglevel=8 printk.time=1" \
    -nographic -no-reboot

Relevant config options:
  CONFIG_TRACE_IRQFLAGS=y
  CONFIG_IRQSOFF_TRACER=y
  CONFIG_KCOV=y
  CONFIG_KCOV_INSTRUMENT_ALL=y
  CONFIG_KCOV_SELFTEST=y

The raw arm64 crash first runs into other KCOV-instrumented early
IRQ/stack helpers. To isolate the trace_irqsoff.o part, I used the
following additional changes. This is not intended for merge:

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 74b76bb70452..d69eb3fd0577 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -24,6 +24,9 @@ KASAN_SANITIZE_stacktrace.o := n
 # inhibit KCOV instrumentation, disable it for the entire compilation unit.
 KCOV_INSTRUMENT_entry-common.o := n
 KCOV_INSTRUMENT_idle.o := n
+KCOV_INSTRUMENT_irq.o := n
+KCOV_INSTRUMENT_return_address.o := n
+KCOV_INSTRUMENT_stacktrace.o := n

 # Object file lists.
 obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index eaf290c972f9..2641a44f6339 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -21,6 +21,7 @@ ifeq ($(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST),y)
  obj-$(CONFIG_TICK_ONESHOT)			+= tick-broadcast-hrtimer.o
 endif
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
+KCOV_INSTRUMENT_sched_clock.o := n
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o tick-sched.o
 obj-$(CONFIG_LEGACY_TIMER_TICK)			+= tick-legacy.o
 ifeq ($(CONFIG_SMP),y)

With these changes, but with trace_irqsoff.o still instrumented,
the kernel still crashes during the KCOV selftest:

  kcov: running self test
  pc : __sanitizer_cov_trace_pc+0x64/0x84
  Kernel panic - not syncing: kernel stack overflow
  ...
  tracer_hardirqs_off+0x1c/0x78
  trace_hardirqs_off.part.0+0x70/0x1a0
  trace_hardirqs_off_finish+0x60/0x6c
  arm64_enter_from_kernel_mode.isra.0+0x18/0x38
  el1_interrupt+0x24/0x58
  el1h_64_irq+0x6c/0x70
  kcov_init+0xc8/0x118

Then adding the line from my original ARMv5
mitigation makes the arm64 kernel boot through the KCOV selftest:

  KCOV_INSTRUMENT_trace_irqsoff.o := n

The boot log then shows:

  kcov: running self test
  kcov: done running self test
  tiny-init: reached userspace

So arm64 also confirms that trace_irqsoff.o is reachable from this early
IRQ entry path while KCOV selftest mode is active.

Arm64 appears to have additional KCOV/early-entry paths with this config,
which probably need to be investigated independently.

Regards,
Karl

[-- Attachment #2: arm64-kcov.config.gz --]
[-- Type: application/x-gunzip, Size: 11591 bytes --]

[-- Attachment #3: arm64-kcov-trace-irqsoff-crash.log.gz --]
[-- Type: application/x-gunzip, Size: 4458 bytes --]

^ permalink raw reply related

* Re: [PATCHv3 04/12] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-26 10:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Andrii Nakryiko, Oleg Nesterov, Ingo Molnar,
	Masami Hiramatsu, Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260526091944.GB4149641@noisy.programming.kicks-ass.net>

On Tue, May 26, 2026 at 11:19:44AM +0200, Peter Zijlstra wrote:
> On Fri, May 22, 2026 at 11:19:17PM +0200, Jiri Olsa wrote:
> > On Fri, May 22, 2026 at 11:50:44AM -0700, Andrii Nakryiko wrote:
> > > On Thu, May 21, 2026 at 5:44 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > > redzone area with call instruction storing return address on stack
> > > > where user code may keep temporary data without adjusting rsp.
> > > >
> > > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > > instruction, so we can squeeze another instruction to escape the
> > > > redzone area before doing the call, like:
> > > >
> > > >   lea -0x80(%rsp), %rsp
> > > >   call tramp
> > > >
> > > > Note the lea instruction is used to adjust the rsp register without
> > > > changing the flags.
> > > >
> > > > We use nop10 and following transofrmation to optimized instructions
> > > > above and back as suggested by Peterz [2].
> > > >
> > > > Optimize path (int3_update_optimize):
> > > >
> > > >   1) Initial state after set_swbp() installed the uprobe:
> > > >       cc 2e 0f 1f 84 00 00 00 00 00
> > > >
> > > >      From offset 0 this is INT3 followed by the tail of the original
> > > >      10-byte NOP.
> > > >
> > > >   2) Trap the call slot before rewriting the NOP tail:
> > > >       cc 2e 0f 1f 84 [cc] 00 00 00 00
> > > >
> > > >      From offset 0 this traps on the uprobe INT3.  A thread reaching
> > > >      offset 5 traps on the temporary INT3 instead of seeing a partially
> > > >      patched call.
> > > >
> > > >   3) Rewrite the LEA tail and call displacement, keeping both INT3 bytes:
> > > >       cc [8d 64 24 80] cc [d0 d1 d2 d3]
> > > >
> > > >      From offset 0 and offset 5 this still traps.  The bytes between
> > > >      them are not executable entry points while both traps are in place.
> > > >
> > > >   4) Restore the call opcode at offset 5:
> > > >       cc 8d 64 24 80 [e8] d0 d1 d2 d3
> > > >
> > > >      From offset 0 this still traps.  From offset 5 the instruction is
> > > >      the final CALL to the uprobe trampoline.
> > > >
> > > 
> > > I'm sorry if I'm slow, but I don't understand why we need that second
> > > cc at offset 5? Isn't original nop10 processed by CPU as single
> > > instruction? So it will either be at ip of nop10, or at ip+10, no? If
> > > we trap at ip and in int3 handler +10 from there while we are
> > > installing lea+call, why do we need cc on byte 5?
> > > 
> > > I.e., I don't understand how CPU can end up being at ip+5 until we
> > > finalize lea+call sequence? Can it?
> > 
> > hum, so I though it's for the case when you do unoptimize+optimize,
> > then you can have cpu executing the previous lea and hitting the int3
> > on +5 offset.. but as pointed by Peter (and you) the call instruction
> > never changes, so now I'm not sure why we need it
> 
> So I missed you did the second INT3 in my initial reading.
> 
> That second INT3 is absolutely required *IF* the CALL can ever change.
> However Andrii pointed out that once the CALL is written, it will always
> be the same CALL -- there is but the one trampoline, it doesn't move.
> 
> Therefore, the second INT3 is not strictly required.
> 
> Does this clarify?

yes, will change that in next version

thanks,
jirka

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Mykyta Yatsenko @ 2026-05-26 10:07 UTC (permalink / raw)
  To: bpf, rostedt
  Cc: Mykyta Yatsenko, linux-trace-kernel, Andrii Nakryiko,
	Alexei Starovoitov
In-Reply-To: <20260518-generic_tracepoint-v2-2-b755a5cf67bb@meta.com>

Hi Steven,

Gentle ping on this patch from the series.

Since this part touches tracing, I’d appreciate your thoughts on the
tracing changes whenever you have a chance.

Thanks,
Mykyta

^ permalink raw reply

* [PATCH] tracing: Use kstrdup_const() for constant hist field type
From: Yu Peng @ 2026-05-26  9:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel,
	linux-kernel, Yu Peng

The HIST_FIELD_FL_CONST path duplicates the literal "u64" type with
kstrdup(), then releases it through kfree_const().

Use kstrdup_const() instead, avoiding the allocation for a .rodata string
while keeping the matching free helper.

Signed-off-by: Yu Peng <pengyu@kylinos.cn>
---
 kernel/trace/trace_events_hist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index eb2c2bc8bc3d..6ffe9f4720a0 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1992,7 +1992,7 @@ static struct hist_field *create_hist_field(struct hist_trigger_data *hist_data,
 	if (flags & HIST_FIELD_FL_CONST) {
 		hist_field->fn_num = HIST_FIELD_FN_CONST;
 		hist_field->size = sizeof(u64);
-		hist_field->type = kstrdup("u64", GFP_KERNEL);
+		hist_field->type = kstrdup_const("u64", GFP_KERNEL);
 		if (!hist_field->type)
 			goto free;
 		goto out;
-- 
2.43.0

^ permalink raw reply related

* [PATCH] tracing: Fix field_var_str allocation errno
From: Yu Peng @ 2026-05-26  9:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Tom Zanussi,
	linux-trace-kernel, linux-kernel, Yu Peng

hist_trigger_elt_data_alloc() returns -EINVAL when the field_var_str
kcalloc() fails. Return -ENOMEM instead, matching the other allocation
failures in the function.

Fixes: c910db943d35 ("tracing: Dynamically allocate the per-elt hist_elt_data array")
Signed-off-by: Yu Peng <pengyu@kylinos.cn>
---
 kernel/trace/trace_events_hist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index eb2c2bc8bc3d..17fe13e12a4f 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1680,7 +1680,7 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
 	elt_data->field_var_str = kcalloc(n_str, sizeof(char *), GFP_KERNEL);
 	if (!elt_data->field_var_str) {
 		hist_elt_data_free(elt_data);
-		return -EINVAL;
+		return -ENOMEM;
 	}
 	elt_data->n_field_var_str = n_str;
 
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCHv3 04/12] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Peter Zijlstra @ 2026-05-26  9:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, Oleg Nesterov, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <ahDIVTM5WfVqiYE6@krava>

On Fri, May 22, 2026 at 11:19:17PM +0200, Jiri Olsa wrote:
> On Fri, May 22, 2026 at 11:50:44AM -0700, Andrii Nakryiko wrote:
> > On Thu, May 21, 2026 at 5:44 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > redzone area with call instruction storing return address on stack
> > > where user code may keep temporary data without adjusting rsp.
> > >
> > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > instruction, so we can squeeze another instruction to escape the
> > > redzone area before doing the call, like:
> > >
> > >   lea -0x80(%rsp), %rsp
> > >   call tramp
> > >
> > > Note the lea instruction is used to adjust the rsp register without
> > > changing the flags.
> > >
> > > We use nop10 and following transofrmation to optimized instructions
> > > above and back as suggested by Peterz [2].
> > >
> > > Optimize path (int3_update_optimize):
> > >
> > >   1) Initial state after set_swbp() installed the uprobe:
> > >       cc 2e 0f 1f 84 00 00 00 00 00
> > >
> > >      From offset 0 this is INT3 followed by the tail of the original
> > >      10-byte NOP.
> > >
> > >   2) Trap the call slot before rewriting the NOP tail:
> > >       cc 2e 0f 1f 84 [cc] 00 00 00 00
> > >
> > >      From offset 0 this traps on the uprobe INT3.  A thread reaching
> > >      offset 5 traps on the temporary INT3 instead of seeing a partially
> > >      patched call.
> > >
> > >   3) Rewrite the LEA tail and call displacement, keeping both INT3 bytes:
> > >       cc [8d 64 24 80] cc [d0 d1 d2 d3]
> > >
> > >      From offset 0 and offset 5 this still traps.  The bytes between
> > >      them are not executable entry points while both traps are in place.
> > >
> > >   4) Restore the call opcode at offset 5:
> > >       cc 8d 64 24 80 [e8] d0 d1 d2 d3
> > >
> > >      From offset 0 this still traps.  From offset 5 the instruction is
> > >      the final CALL to the uprobe trampoline.
> > >
> > 
> > I'm sorry if I'm slow, but I don't understand why we need that second
> > cc at offset 5? Isn't original nop10 processed by CPU as single
> > instruction? So it will either be at ip of nop10, or at ip+10, no? If
> > we trap at ip and in int3 handler +10 from there while we are
> > installing lea+call, why do we need cc on byte 5?
> > 
> > I.e., I don't understand how CPU can end up being at ip+5 until we
> > finalize lea+call sequence? Can it?
> 
> hum, so I though it's for the case when you do unoptimize+optimize,
> then you can have cpu executing the previous lea and hitting the int3
> on +5 offset.. but as pointed by Peter (and you) the call instruction
> never changes, so now I'm not sure why we need it

So I missed you did the second INT3 in my initial reading.

That second INT3 is absolutely required *IF* the CALL can ever change.
However Andrii pointed out that once the CALL is written, it will always
be the same CALL -- there is but the one trampoline, it doesn't move.

Therefore, the second INT3 is not strictly required.

Does this clarify?

^ permalink raw reply

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lorenzo Stoakes @ 2026-05-26  8:14 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, akpm, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <ahCFIDuyrvEfB9jv@lucifer>

Nico,

While I stand by the below, and we very well might wish to delay this until
the next cycle, I will try to take some time to go through this myself as
soon as I am able.

If David's happy with it for this cycle, and I don't find anything too
crazy, then it's not impossible we could still move forward with it now.

My only aim here is to avoid rushing something in that might have
unexpected changes or issues in it, given how late in the cycle we are :)

Cheers, Lorenzo

On Fri, May 22, 2026 at 06:12:59PM +0100, Lorenzo Stoakes wrote:
> On Fri, May 22, 2026 at 10:31:41AM -0600, Nico Pache wrote:
> > On Fri, May 22, 2026 at 10:20 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > There's some kind of confusion here.
> > >
> > > This series isn't suited for 7.2.
> > >
> > > Sorry but Zi's series, unless it depends on functionality here, will have
> > > to be rebased.
> > >
> > > People have been at conferences, people have been on leave, I've had to
> > > pace myself for health reasons and it seems there's been more than simply
> > > review comment-based changes happening here.
> > >
> > > (Again I strongly encourage, at this stage, to ONLY be making changes based
> > > on review, not adding ANYTHING else or changing ANYTHING else to avoid
> > > delays :)
> >
> > All the changes are based on review points. Very small changes in this
> > version; the largest being the one that you specifically argeed too.
>
> 16->17
>
>  Documentation/admin-guide/mm/transhuge.rst |  24 +++++-------------
>  include/linux/khugepaged.h                 |   7 ++---
>  include/trace/events/huge_memory.h         |   3 ++-
>  mm/huge_memory.c                           |   2 +-
>  mm/khugepaged.c                            | 168 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------------------
>  mm/vma.c                                   |   6 ++---
>  tools/testing/vma/include/stubs.h          |   3 ++-
>  7 files changed, 103 insertions(+), 110 deletions(-)
>
> 17->18
>
>  Documentation/admin-guide/mm/transhuge.rst |   5 +++--
>  include/trace/events/huge_memory.h         |   3 +--
>  mm/khugepaged.c                            | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
>  3 files changed, 66 insertions(+), 63 deletions(-)
>
> These are not small 'very small changes'.
>
> We're nearly at rc-5, and this is a major, invasive, dangerous change that
> we have to get right.
>
> You've also made changes unrelated to review, repeatedly, throughout this
> process, which as I've told you, is causing delays.
>
> You've also throughout the review of this series done stuff like make MAJOR
> changes to things and _kept review tags_.
>
> You're forcing us to use git range-diff etc. to forensically check that the
> series is what is claimed.
>
> Dude I mean you switched to using // comment style which is not used in mm
> anywhere for instance? Don't do things like that and complain about
> delays. Honestly.
>
> Also, again, LSF happened. Other confeerences happened. Bandwidth is
> reduced.
>
> So again, I'm sorry, but you've been hit with some bad luck here.
>
> I really wanted this in for 7.2, and I feel bad that we couldn't make it,
> but you're also doing thing that's making it difficult for us.
>
> I've spent double-digits hours on your series, and I've also had work
> pushed out becasue of that leading me to work evenings and weekends as a
> result.
>
> And I'm not even going to get any credit for it :))
>
> So while I sypmathise, really, please have empathy and realise it goes both
> ways, please.
>
> I'm not being mean for the sake of it, I'm pushing back because I feel this
> is not at a stage where I'd feel confident in this being merged at this
> time.
>
> And it's very much a regret, as I _really_ wanted us to have it in this
> time. But life and circumstances and the issues mentioned above have
> intervened, sadly.
>
> >
> > >
> > > Also - shouldn't mm-unstable already have mm-hotfixes-unstable in it?
> > >
> > > I think in mm-next we will have an stable branch, that everything is
> > > based on, where things go once review is complete and things are mergeable.
> > >
> > > And a separate hotfixes branch based on Linus's tree.
> > >
> > > That would avoid issues like this :)
> >
> > Im sorry im new to this, but I really dont think this tiny error, and
> > something that I'd confirmed with Andrew beforehand deserves NAKing
> > and defering it. Ive worked through my PTO to clean up some of these
> > review nits just to get it in 7.2. I even through this through my
> > rounds of testing today before resending.
>
> The issue wasn't the error (though it wasn't tiny...!), it's the state of
> review. There was fresh review comments from a few days ago, and there's
> big diffs between revisions.
>
> You've also made unrelated changes as you have done throughout the series.
>
> As I said above, I'm sorry that you spent time in your PTO on this, but we
> cannot rush this in when things are not clearly ready yet, and I am not
> confident in this being ready at this stage.
>
> >
> > >
> > > >
> > > > The intent wasn't that this is a hotfix, just that this was the
> > > > closest base before the v17 that is already in the tree.
> > >
> > > The convention is that [PATCH ... <branch>] indicates the target of the
> > > changes. Putting the hotfixes branch there implies it's a hotfix.
> >
> > Sorry I thought the <branch> was what base you used.
>
> I mean, sure there's clearly confusion here as you sent [PATCH 7.2 v16 ...]
> (against an unreleased kernel version) then a branch specifier then the
> hotfixes one...
>
> Anyway sure, it's fine, I've made vastly more dumb mistakes than that
> myself, nobody minds, but it's concerning as by convention [PATCH
> ... <mm->hotfixes<whatever>] generally is taken to mean 'please rush this
> to hotfixes!' :)
>
> So be careful with that please!
>
> >
> > >
> > > So please be careful with that in future :)
> >
> > Yes will do for sure.
>
> Thanks!
>
> >
> > >
> > > >
> > > > Sorry for the confusion, hopefully Andrew can still apply it to the
> > > > correct tree.
> > >
> > > I'm not even sure what's best for that at this stage given we have
> > > conflicts and this has to be delayed until 7.3.
> > >
> > > I wonder if given that we should not have this in mm-unstable at all and
> > > just wait it out until the next cycle begins? Review can happen
> > > concurrently.
> >
> > I still dont see why this has to be deferred, I was working with
> > Andrew to prevent merge headaches.
>
> I've explained the why above, and David and I co-maintain THP so I feel
> that ultimately given the blood, sweat and tears we've put into THP review
> we ought to have some input on this :)
>
> Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-26  6:57 UTC (permalink / raw)
  To: Nico Pache, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260525121041.2f2508a4f627c338cddd837a@linux-foundation.org>

On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
>On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> Can you please append the following fixup that reverts one of the
>> changes requested in V17. The issue with the change is described
>> below.
>
>OK.  fyi, what I received was badly mangled: wordwrapping, tabs messed
>up, etc.
>
>Here's my reconstruction:
>

Hi, Nico

I tried to reply your mail, but found it has some encoding problem, so reply
here.

>
>Author: Nico Pache <npache@redhat.com>
>Subject: fix potential use-after-free of vma in mthp_collapse()
>Date: Mon May 25 07:38:59 2026 -0600
>
>Between V17 and v18, one reviewer (Wei) brought up that we are not doing
>the uffd-armed check until deep in the collapse operation.  While not
>functionally incorrect, it can lead to unnecessary work.

So we decide to tolerate the behavioral change?

>
>We optimized this by passing the vma variable to mthp_collapse() and using
>the collapse_max_ptes_none() function to check the state of uffd-armed
>preventing the wasted work later in the collapse.
>
>mthp_collapse() is called after mmap_read_unlock(), so the vma pointer can
>become stale.  Remove the vma parameter and pass NULL to
>collapse_max_ptes_none() instead.
>
>Link: https://lore.kernel.org/2b2cda8c-358a-4a5c-989c-ae42593ef2ea@redhat.com
>Signed-off-by: Nico Pache <npache@redhat.com>
>...
>
> mm/khugepaged.c |   10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
>--- a/mm/khugepaged.c~mm-khugepaged-introduce-mthp-collapse-support-fix
>+++ a/mm/khugepaged.c
>@@ -1502,9 +1502,9 @@ static unsigned int collapse_mthp_count_
>  * If a collapse is permitted, we attempt to collapse the PTE range into a
>  * mTHP.
>  */
>-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>-		unsigned long address, int referenced, int unmapped,
>-		struct collapse_control *cc, unsigned long enabled_orders)
>+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned long enabled_orders)
> {
> 	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> 	int collapsed = 0, stack_size = 0;
>@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struc
> 		if (!test_bit(order, &enabled_orders))
> 			goto next_order;
> 
>-		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>+		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> 
> 		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> 							       nr_ptes);
>@@ -1749,7 +1749,7 @@ out_unmap:
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
>+		nr_collapsed = mthp_collapse(mm, start_addr, referenced,
> 					     unmapped, cc, enabled_orders);
> 		/* mmap_lock was released above, set lock_dropped */
> 		*lock_dropped = true;
>_

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH] tracing: Disable KCOV instrumentation for trace_irqsoff.o
From: Masami Hiramatsu @ 2026-05-26  6:07 UTC (permalink / raw)
  To: Karl Mehltretter
  Cc: Steven Rostedt, Mathieu Desnoyers, Dmitry Vyukov,
	Andrey Konovalov, Marco Elver, kasan-dev, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260525170428.67211-1-kmehltretter@gmail.com>

On Mon, 25 May 2026 19:04:28 +0200
Karl Mehltretter <kmehltretter@gmail.com> wrote:

> When KCOV runs its boot selftest with whole-kernel instrumentation
> enabled, it sets current->kcov_mode to KCOV_MODE_TRACE_PC without
> installing a coverage area. Any instrumented code accepted as task-context
> coverage in that window dereferences current->kcov_area and crashes.
> 
> On ARMv5 Versatile PB with CONFIG_KCOV_SELFTEST=y,
> CONFIG_KCOV_INSTRUMENT_ALL=y and CONFIG_IRQSOFF_TRACER=y, boot hits a
> NULL pointer fault during the selftest:
> 
>   kcov: running self test
>   Internal error: Oops: 5 [#1] ARM
>   PC is at __sanitizer_cov_trace_pc+0x4c/0x90
>   Kernel panic - not syncing: Fatal exception
> 
> A diagnostic run showed the unwanted coverage comes from the IRQs-off
> tracer callbacks reached from ARM IRQ entry before hardirq context is
> visible to KCOV:
> 
>   __sanitizer_cov_trace_pc from tracer_hardirqs_off+0x18/0x1cc
>   tracer_hardirqs_off from trace_hardirqs_off+0x34/0x54
>   trace_hardirqs_off from __irq_svc+0x58/0xb0
>   __irq_svc from kcov_init+0x7c/0xdc
> 
> and similarly through tracer_hardirqs_on().
> 
> trace_preemptirq.o is already excluded because this tracing path can run
> from early interrupt code and produce coverage unrelated to syscall
> inputs. Exclude trace_irqsoff.o as well, instead of requiring users to
> turn off CONFIG_KCOV_INSTRUMENT_ALL=y, which is the default whole-kernel
> KCOV mode.
> 
> With the exclusion in place, the same ARMv5 Versatile PB QEMU test boots
> through the KCOV selftest and reaches userspace.
> 
> Tested on ARMv5 Versatile PB QEMU with CONFIG_KCOV_SELFTEST=y,
> CONFIG_KCOV_INSTRUMENT_ALL=y and CONFIG_IRQSOFF_TRACER=y.


Thanks for reporting. This looks good to me for a mitigation.
BTW, I could not reproduce the bug with above configs.
Is this only for arm32?


> 
> Assisted-by: Codex:gpt-5
> Signed-off-by: Karl Mehltretter <kmehltretter@gmail.com>
> ---
>  kernel/trace/Makefile | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 8d3d96e847d8..f934ff586bd4 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -48,9 +48,10 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE
>  GCOV_PROFILE := y
>  endif
>  
> -# Functions in this file could be invoked from early interrupt
> -# code and produce random code coverage.
> +# Functions in these files can run from IRQ entry before hardirq context
> +# is visible to KCOV, and produce coverage unrelated to syscall inputs.
>  KCOV_INSTRUMENT_trace_preemptirq.o := n
> +KCOV_INSTRUMENT_trace_irqsoff.o := n
>  
>  CFLAGS_bpf_trace.o := -I$(src)
>  
> -- 
> 2.39.5 (Apple Git-154)
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v21 9/9] ring-buffer: Show persistent buffer dropped events in trace_pipe file
From: Masami Hiramatsu @ 2026-05-26  6:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Ian Rogers
In-Reply-To: <20260522171052.156419479@kernel.org>

On Fri, 22 May 2026 13:09:06 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When the persistent ring buffer is validated on boot up, if a subbuffer is
> deemed invalid, it resets the buffer and continues. Have the code preserve
> the RB_MISSED_EVENTS flag in the commit portion of the subbuffer header
> and pass that back so that the trace_pipe file can show the missed events
> like the trace file does.
> 
> For example:
> 
>    <...>-1242    [005] d....  4429.120116: page_fault_user: address=0x7ffaebb6e728 ip=0x7ffaeb9d4960 error_code=0x7
>    <...>-1242    [005] .....  4429.120124: mm_page_alloc: page=00000000055254f3 pfn=0x1373bd order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
>    <...>-1242    [005] d..2.  4429.120132: tlb_flush: pages:1 reason:local MM shootdown (3)
> CPU:5 [LOST EVENTS]
>    <...>-1242    [005] d....  4429.120661: page_fault_user: address=0x55ba7c2d0944 ip=0x55ba7c20cd02 error_code=0x7
>    <...>-1242    [005] .....  4429.120669: mm_page_alloc: page=0000000005a02500 pfn=0x12b6e4 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
>    <...>-1242    [005] d..2.  4429.120680: tlb_flush: pages:1 reason:local MM shootdown (3)
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> Changes since v20: https://patch.msgid.link/20260520185018.470465795@kernel.org
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,

> - Removed left over printk() (Masami Hiramatsu)
> 
>  kernel/trace/ring_buffer.c | 56 +++++++++++++++++++++++---------------
>  1 file changed, 34 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 988915f035c7..910f6b3adf74 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5801,6 +5801,7 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	struct buffer_page *reader = NULL;
>  	unsigned long overwrite;
>  	unsigned long flags;
> +	int missed_events = 0;
>  	int nr_loops = 0;
>  	bool ret;
>  
> @@ -5901,6 +5902,9 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	if (!ret)
>  		goto spin;
>  
> +	if (rb_page_commit(reader) & RB_MISSED_EVENTS)
> +		missed_events = -1;
> +
>  	if (cpu_buffer->ring_meta)
>  		rb_update_meta_reader(cpu_buffer, reader);
>  
> @@ -5965,6 +5969,8 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	 */
>  	smp_rmb();
>  
> +	if (!cpu_buffer->lost_events)
> +		cpu_buffer->lost_events = missed_events;
>  
>  	return reader;
>  }
> @@ -7066,6 +7072,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	struct buffer_page *reader;
>  	long missed_events;
>  	unsigned int commit;
> +	unsigned int size;
>  	unsigned int read;
>  	u64 save_timestamp;
>  	bool force_memcpy;
> @@ -7101,7 +7108,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	event = rb_reader_event(cpu_buffer);
>  
>  	read = reader->read;
> -	commit = rb_page_size(reader);
> +	commit = rb_page_commit(reader);
> +	size = rb_page_size(reader);
>  
>  	/* Check if any events were dropped */
>  	missed_events = cpu_buffer->lost_events;
> @@ -7115,13 +7123,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	 * we must copy the data from the page to the buffer.
>  	 * Otherwise, we can simply swap the page with the one passed in.
>  	 */
> -	if (read || (len < (commit - read)) ||
> +	if (read || (len < (size - read)) ||
>  	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
>  	    force_memcpy) {
>  		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
>  		unsigned int rpos = read;
>  		unsigned int pos = 0;
> -		unsigned int size;
> +		unsigned int event_size;
> +		unsigned int flags = 0;
>  
>  		/*
>  		 * If a full page is expected, this can still be returned
> @@ -7130,19 +7139,22 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  		 * the reader page.
>  		 */
>  		if (full &&
> -		    (!read || (len < (commit - read)) ||
> +		    (!read || (len < (size - read)) ||
>  		     cpu_buffer->reader_page == cpu_buffer->commit_page))
>  			return -1;
>  
> -		if (len > (commit - read))
> -			len = (commit - read);
> +		if (len > (size - read))
> +			len = (size - read);
>  
>  		/* Always keep the time extend and data together */
> -		size = rb_event_ts_length(event);
> +		event_size = rb_event_ts_length(event);
>  
> -		if (len < size)
> +		if (len < event_size)
>  			return -1;
>  
> +		if (commit & RB_MISSED_EVENTS)
> +			flags = RB_MISSED_EVENTS;
> +
>  		/* save the current timestamp, since the user will need it */
>  		save_timestamp = cpu_buffer->read_stamp;
>  
> @@ -7154,25 +7166,25 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  			 * one or two events.
>  			 * We have already ensured there's enough space if this
>  			 * is a time extend. */
> -			size = rb_event_length(event);
> -			memcpy(dpage->data + pos, rpage->data + rpos, size);
> +			event_size = rb_event_length(event);
> +			memcpy(dpage->data + pos, rpage->data + rpos, event_size);
>  
> -			len -= size;
> +			len -= event_size;
>  
>  			rb_advance_reader(cpu_buffer);
>  			rpos = reader->read;
> -			pos += size;
> +			pos += event_size;
>  
> -			if (rpos >= commit)
> +			if (rpos >= event_size)
>  				break;
>  
>  			event = rb_reader_event(cpu_buffer);
>  			/* Always keep the time extend and data together */
> -			size = rb_event_ts_length(event);
> -		} while (len >= size);
> +			event_size = rb_event_ts_length(event);
> +		} while (len >= event_size);
>  
>  		/* update dpage */
> -		local_set(&dpage->commit, pos);
> +		local_set(&dpage->commit, pos | flags);
>  		dpage->time_stamp = save_timestamp;
>  
>  		/* we copied everything to the beginning */
> @@ -7204,7 +7216,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  
>  	cpu_buffer->lost_events = 0;
>  
> -	commit = rb_data_page_commit(dpage);
> +	size = rb_data_page_size(dpage);
>  	/*
>  	 * Set a flag in the commit field if we lost events
>  	 */
> @@ -7214,11 +7226,11 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  		 * missed events, then record it there.
>  		 */
>  		if (missed_events > 0 &&
> -		    buffer->subbuf_size - commit >= sizeof(missed_events)) {
> -			memcpy(&dpage->data[commit], &missed_events,
> +		    buffer->subbuf_size - size >= sizeof(missed_events)) {
> +			memcpy(&dpage->data[size], &missed_events,
>  			       sizeof(missed_events));
>  			local_add(RB_MISSED_STORED, &dpage->commit);
> -			commit += sizeof(missed_events);
> +			size += sizeof(missed_events);
>  		}
>  		local_add(RB_MISSED_EVENTS, &dpage->commit);
>  	}
> @@ -7226,8 +7238,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	/*
>  	 * This page may be off to user land. Zero it out here.
>  	 */
> -	if (commit < buffer->subbuf_size)
> -		memset(&dpage->data[commit], 0, buffer->subbuf_size - commit);
> +	if (size < buffer->subbuf_size)
> +		memset(&dpage->data[size], 0, buffer->subbuf_size - size);
>  
>  	return read;
>  }
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox