Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v10 6/6] selftests/mm: add hwpoison-panic destructive test
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>

Add a destructive selftest that verifies
vm.panic_on_unrecoverable_memory_failure actually panics when a
hwpoison error hits a kernel-owned page.

Three "kinds" of kernel-owned page can be targeted, selectable via
the script's first positional argument (default: rodata):

  rodata  - a PG_reserved page in the kernel rodata range, sourced
            from the "Kernel rodata" sub-resource of "System RAM" in
            /proc/iomem.  That entry is reported on every major
            architecture and guarantees the chosen PFN is backed by
            struct page (an online System RAM range, not a firmware
            hole), is PG_reserved, and is read-only -- so even if
            the panic fails to fire for some reason, the resulting
            PG_hwpoison marker on rodata does not corrupt writable
            kernel state.

  slab    - a slab page found by walking /proc/kpageflags for the
            first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
            / KPF_COMPOUND_TAIL clear).  Exercises the get_any_page()
            path on a non PG_reserved kernel-owned page and so
            catches regressions where get_any_page() collapses
            kernel-owned pages into a transient -EIO instead of
            -ENOTRECOVERABLE.

  pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.

PageLargeKmalloc, the fourth page type matched by
is_kernel_owned_page(), is intentionally not covered: it is a
PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
a PFN from userspace is not feasible.  The slab and pgtable
variants already exercise the same get_any_page() positive-check
branch.

The script enables the sysctl and writes the selected physical
address to /sys/devices/system/memory/hard_offline_page.  A
successful run crashes the kernel with

  Memory failure: <pfn>: unrecoverable page

A return from the inject means no panic fired.  Before reporting, the
script restores the sysctl and best-effort unpoisons the target PFN
through the hwpoison debugfs interface (hard_offline_page() injects
with MF_SW_SIMULATED, so the page stays unpoisonable), then re-reads
/proc/kpageflags: a PFN that is still the kernel-owned type it selected
is a genuine failure, while one that raced to a different type before
the inject is skipped as inconclusive.  Test outcome is therefore
observed externally (serial console, kdump) rather than from the
script's own exit code.

The script is intentionally NOT wired into run_vmtests.sh: every
successful run panics the kernel, which is incompatible with the
sequential "run each category in the same VM" model that
run_vmtests.sh assumes.  It is also not registered as a TEST_PROGS /
ksft_* wrapper so a default kselftest run does not opt itself into
a panic.  The script is meant to be executed manually inside a
disposable VM (e.g. virtme-ng), one variant per VM boot, and
requires RUN_DESTRUCTIVE=1 in the environment as a safety net.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/mm/Makefile          |   4 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 249 +++++++++++++++++++++++++++
 2 files changed, 253 insertions(+)

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971c..ed321ae709dac 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -174,6 +174,10 @@ TEST_PROGS += ksft_userfaultfd.sh
 TEST_PROGS += ksft_vma_merge.sh
 TEST_PROGS += ksft_vmalloc.sh
 
+# Destructive: every successful run panics the kernel.  Installed and
+# kept executable, but not run from a default kselftest invocation.
+TEST_PROGS_EXTENDED += hwpoison-panic.sh
+
 TEST_FILES := test_vmalloc.sh
 TEST_FILES += test_hmm.sh
 TEST_FILES += va_high_addr_switch.sh
diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
new file mode 100755
index 0000000000000..aafc06e895d01
--- /dev/null
+++ b/tools/testing/selftests/mm/hwpoison-panic.sh
@@ -0,0 +1,249 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
+# error on a kernel-owned page and confirming the kernel panics.
+#
+# Three "kinds" of kernel-owned page can be targeted, selectable via the
+# first positional argument (default: rodata):
+#
+#   rodata  - a PG_reserved page in the kernel rodata range
+#             (sourced from /proc/iomem "Kernel rodata").  Exercises
+#             memory_failure() -> get_any_page() on a PageReserved page.
+#
+#   slab    - a slab page found via /proc/kpageflags (KPF_SLAB).
+#             Exercises memory_failure() -> get_any_page() on a non
+#             PG_reserved kernel-owned page.  This path is what catches
+#             regressions where get_any_page() collapses kernel-owned
+#             pages into a transient -EIO instead of -ENOTRECOVERABLE.
+#
+#   pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
+#             Same path as slab, different page type.
+#
+# This test is DESTRUCTIVE: a successful run crashes the kernel.  It is
+# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
+# serial console captured by the harness.  It is skipped unless the
+# caller opts in via RUN_DESTRUCTIVE=1.
+#
+# Test passes externally: the kernel must panic with
+#   "Memory failure: <pfn>: unrecoverable page"
+# A return from the inject means no panic fired: that is a failure,
+# unless the target PFN raced to a different page type before injection,
+# in which case the run is inconclusive and is skipped.
+#
+# Author: Breno Leitao <leitao@debian.org>
+
+set -u
+
+ksft_skip=4
+sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
+inject_path=/sys/devices/system/memory/hard_offline_page
+kpageflags_path=/proc/kpageflags
+unpoison_path=/sys/kernel/debug/hwpoison/unpoison-pfn
+
+# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
+KPF_SLAB=7
+KPF_COMPOUND_TAIL=16
+KPF_HWPOISON=19
+KPF_NOPAGE=20
+KPF_PGTABLE=26
+KPF_RESERVED=32
+
+pagesize=$(getconf PAGE_SIZE)
+
+kind=${1:-rodata}
+
+ksft_print() { echo "# $*"; }
+ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
+ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
+
+if [ "$(id -u)" -ne 0 ]; then
+	ksft_exit_skip "must run as root"
+fi
+
+if [ ! -w "$sysctl_path" ]; then
+	ksft_exit_skip "$sysctl_path not present (kernel without the sysctl?)"
+fi
+
+if [ ! -w "$inject_path" ]; then
+	ksft_exit_skip "$inject_path not present (no MEMORY_HOTPLUG?)"
+fi
+
+if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
+	ksft_exit_skip "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
+fi
+
+# Pick a PFN inside the kernel image rodata region of /proc/iomem.
+# This is preferred over a top-level "Reserved" entry because top-level
+# Reserved ranges are often firmware holes that have no backing struct
+# page; pfn_to_online_page() returns NULL on those and memory_failure()
+# bails out with -ENXIO before reaching the panic path.
+#
+# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
+# major architecture, which guarantees:
+#   - the PFN is backed by struct page (within an online memory range);
+#   - PG_reserved is set on the page (kernel image area);
+#   - the memory is read-only, so setting PG_hwpoison on it does not
+#     corrupt writable kernel state if the panic somehow does not fire.
+#
+# /proc/iomem entries look like (indented for sub-resources):
+#     "  02500000-02ffffff : Kernel rodata"
+pick_rodata_phys_addr() {
+	awk -v pagesize="$(getconf PAGE_SIZE)" '
+	# Convert a hex string to a number without relying on the gawk-only
+	# strtonum().  mawk lacks it and would otherwise spuriously skip
+	# this test on distros that ship mawk as /usr/bin/awk.
+	function hex2num(s,   n, i, c, v) {
+		n = 0
+		for (i = 1; i <= length(s); i++) {
+			c = tolower(substr(s, i, 1))
+			v = index("0123456789abcdef", c) - 1
+			if (v < 0)
+				return -1
+			n = n * 16 + v
+		}
+		return n
+	}
+	/: Kernel rodata[[:space:]]*$/ {
+		sub(/^[[:space:]]+/, "")
+		n = split($0, a, /[- ]/)
+		start = hex2num(a[1])
+		end   = hex2num(a[2])
+		if (end <= start)
+			next
+		# Page-align upward and emit the first byte of that page.
+		pfn = int((start + pagesize - 1) / pagesize)
+		printf "0x%x\n", pfn * pagesize
+		exit 0
+	}
+	' /proc/iomem
+}
+
+# Walk /proc/kpageflags and return the phys addr of the first PFN that
+# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
+# all clear (so we attack a real, non-tail, not-already-poisoned page).
+#
+# We skip the first 16 MiB of PFNs to step past low-memory special
+# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
+# exhibit the slab/pgtable type we are looking for.
+pick_kpageflags_phys_addr() {
+	local want_bit=$1
+	local pagesize skip_pfn
+
+	[ -r "$kpageflags_path" ] || return
+
+	pagesize=$(getconf PAGE_SIZE)
+	skip_pfn=$(((16 * 1024 * 1024) / pagesize))
+
+	od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
+	awk -v want_bit="$want_bit" \
+	    -v hwp_bit="$KPF_HWPOISON" \
+	    -v nopage_bit="$KPF_NOPAGE" \
+	    -v tail_bit="$KPF_COMPOUND_TAIL" \
+	    -v base_pfn="$skip_pfn" \
+	    -v pagesize="$pagesize" '
+	# Test whether bit "b" is set in the 16-hex-digit value "hex".
+	# Done with substring + per-digit lookup so we never rely on awk
+	# bitwise operators (mawk lacks them), 64-bit FP precision or the
+	# gawk-only strtonum().
+	function bit_set(hex, b,    di, bi, c, v) {
+		di = int(b / 4)
+		bi = b - di * 4
+		c = substr(hex, length(hex) - di, 1)
+		v = index("0123456789abcdef", tolower(c)) - 1
+		if (bi == 0) return (v % 2) == 1
+		if (bi == 1) return int(v / 2) % 2 == 1
+		if (bi == 2) return int(v / 4) % 2 == 1
+		return int(v / 8) % 2 == 1
+	}
+	{
+		gsub(/^[[:space:]]+/, "")
+		h = $1
+		if (bit_set(h, want_bit) &&
+		    !bit_set(h, hwp_bit) &&
+		    !bit_set(h, nopage_bit) &&
+		    !bit_set(h, tail_bit)) {
+			pfn = base_pfn + NR - 1
+			printf "0x%x\n", pfn * pagesize
+			exit 0
+		}
+	}
+	'
+}
+
+# Return 0 if /proc/kpageflags bit $2 is set for PFN $1, 1 if it is
+# clear, or 2 if the word cannot be read.  Used to re-confirm the target
+# page type after a non-panicking inject.
+kpageflags_bit_set() {
+	local word
+
+	word=$(od -An -tx8 -v -j "$(($1 * 8))" -N 8 "$kpageflags_path" 2>/dev/null | tr -d '[:space:]')
+	[ -n "$word" ] || return 2
+	(( (16#$word >> $2) & 1 ))
+}
+
+# Best-effort: drop the PG_hwpoison marker set by the inject so a failed
+# run does not leave a poisoned page behind.  hard_offline_page() injects
+# with MF_SW_SIMULATED, so the page stays unpoisonable through the
+# hwpoison debugfs interface (needs CONFIG_HWPOISON_INJECT + debugfs).
+try_unpoison() {
+	[ -w "$unpoison_path" ] || return 0
+	echo "$1" > "$unpoison_path" 2>/dev/null || true
+}
+
+case "$kind" in
+rodata)
+	phys_addr=$(pick_rodata_phys_addr)
+	recheck_bit=$KPF_RESERVED
+	missing_msg='no "Kernel rodata" entry in /proc/iomem'
+	;;
+slab)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
+	recheck_bit=$KPF_SLAB
+	missing_msg="no usable slab PFN found in $kpageflags_path"
+	;;
+pgtable)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
+	recheck_bit=$KPF_PGTABLE
+	missing_msg="no usable page-table PFN found in $kpageflags_path"
+	;;
+*)
+	ksft_exit_fail "unknown kind '$kind' (expected: rodata|slab|pgtable)"
+	;;
+esac
+
+if [ -z "$phys_addr" ]; then
+	ksft_exit_skip "$missing_msg"
+fi
+
+ksft_print "enabling $sysctl_path"
+prior=$(cat "$sysctl_path")
+echo 1 > "$sysctl_path" || ksft_exit_fail "failed to enable sysctl"
+
+pfn=$((phys_addr / pagesize))
+ksft_print "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (pfn 0x$(printf '%x' "$pfn"), kind=$kind)"
+ksft_print "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
+
+# A successful run never returns from the inject -- it panics the kernel.
+# Reaching the code below therefore means no panic fired.  Note whether
+# the write itself succeeded, then put the machine back: restore the
+# sysctl and best-effort unpoison the page we just marked.
+if echo "$phys_addr" > "$inject_path"; then
+	verdict="inject returned without panic; sysctl ineffective"
+else
+	verdict="inject failed before reaching the panic path"
+fi
+
+echo "$prior" > "$sysctl_path"
+try_unpoison "$pfn"
+
+# The page type can change between selection and injection (e.g. a slab
+# or page-table page is freed and reused).  Only treat a missing panic as
+# a failure if the target PFN is still the kernel-owned type we aimed at;
+# if it raced to another type the run is inconclusive, so skip instead.
+kpageflags_bit_set "$pfn" "$recheck_bit"
+case $? in
+0)	ksft_exit_fail "$verdict (page still $kind)" ;;
+1)	ksft_exit_skip "target PFN no longer $kind; raced before inject, inconclusive" ;;
+*)	ksft_exit_fail "$verdict (could not reconfirm page type via $kpageflags_path)" ;;
+esac

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team

A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running.  The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash.  In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.

This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context.  The default is disabled so existing workloads see no
change.

There is a selftest that test different cases, and I tested it using
the following variants:

  ┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
  │ Variant │   PFN    │                          Result                           │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ rodata  │ 0x2600   │ Panic with "Memory failure: 0x2600: unrecoverable page"   │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ slab    │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
  └─────────┴──────────┴───────────────────────────────────────────────────────────┘

Each one shows the same call trace, exactly the path the series builds:

  hard_offline_page_store
    → memory_failure
      → action_result
        → panic("Memory failure: %#lx: unrecoverable page")

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v10:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v9: https://lore.kernel.org/r/20260609-ecc_panic-v9-0-432a74002e74@debian.org

Changes in v9:
- HWPoisonKernelOwned(): wrap the head-page checks in a
  compound_head() recheck loop so a concurrent split or compound free
  cannot leave us trusting a stale view (Miaohe, Lance, David).
- selftest: drop the gawk-only strtonum() in hwpoison-panic.sh; do the
  hex parsing with a small index()-based helper so the test no longer
  spuriously skips itself on mawk-based distros (Sashiko).
- selftest: move hwpoison-panic.sh from TEST_FILES to
  TEST_PROGS_EXTENDED so the script is installed executable rather
  than as a non-executable data file (Sashiko).
- Link to v8: https://patch.msgid.link/20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org

Changes in v8:
- Commit message rewording (David)
- Add HWPoisonKernelOwned() helper (Lance)
- Removed patch "mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()"
- Broaden the selftest (Lance)
- Link to v7: https://patch.msgid.link/20260513-ecc_panic-v7-0-be2e578e61da@debian.org

Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
  get_any_page() and surface it via -ENOTRECOVERABLE, per David
  Hildenbrand's and Lance Yang's review of v6.  This drops the
  is_reserved snapshot in memory_failure() and the mf_get_page_status
  enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
  over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
  matching tracepoint string; the enum now covers both PG_reserved
  pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
  and 2/4 ("classify get_any_page() failures by reason") into a
  single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
  (David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org

Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org

Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
  unrecoverable kernel page hwpoison events (reserved pages, refcount-0
  non-buddy pages, unknown state), with a recheck to avoid racing with
  concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
  document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
  and add a selftest verifying SIGBUS recovery on userspace pages still
  works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
  https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org

Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (6):
      mm/memory-failure: drop dead error_states[] entry for reserved pages
      mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
      mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl
      selftests/mm: add hwpoison-panic destructive test

 Documentation/admin-guide/sysctl/vm.rst      |  80 +++++++++
 mm/memory-failure.c                          | 104 +++++++++--
 tools/testing/selftests/mm/Makefile          |   4 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 249 +++++++++++++++++++++++++++
 4 files changed, 419 insertions(+), 18 deletions(-)
---
base-commit: 30ffa8de54e5cc80d93fd211ca134d1764a7011f
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Masami Hiramatsu @ 2026-06-26 15:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Martin Kaiser, Masami Hiramatsu (Google), linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260626064223.60ffeeaa@fedora>

On Fri, 26 Jun 2026 06:42:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 26 Jun 2026 12:20:36 +0200
> Martin Kaiser <martin@kaiser.cx> wrote:
> 
> > > That is, to have +u0() say "this is going to be dereferencing user space".  
> > 
> > > I'll add Martin's patch and see if it makes the above work.  
> > 
> > I've just tried your command with my patch. It works for me, filenames are
> > logged correctly.
> 
> Yep, this definitely looks like a fix. We have;
> 
> 	addr = rec + field->offset;
> 
> Where addr points to the location of the field on the ring buffer, thus
> your change to make it:
> 
> 	val = *(unsigned long *)addr;
> 
> Reads the full "long size" of the event on the ring buffer, instead of
> reading just one byte. It is "val" that gets dereferenced later by the
> probe logic (the "+0u()"), which has all the protections we need.
> 
> I'll queue this up.

I've already queued this on my probes/core branch.
(which will be probes/fixes)

Thanks,

> 
> Thanks!
> 
> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Andrew Morton @ 2026-06-26 16:27 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>

On Fri, 26 Jun 2026 08:33:14 -0700 Breno Leitao <leitao@debian.org> wrote:

> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running.  The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash.  In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
> 
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context.  The default is disabled so existing workloads see no
> change.

Cool, thanks.  I added this to mm.git's mm-new branch.  Next week I'll
move it into the mm-unstable branch, where it will receive linux-next
exposure.

Sashiko identified a few possible things, some pre-existing:

	https://sashiko.dev/#/patchset/20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org

^ permalink raw reply

* Re: [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Nathan Chancellor @ 2026-06-26 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
	Peter Zijlstra, Julia Lawall, Yury Norov, linux-doc, linux-kbuild,
	linuxppc-dev, dri-devel, linux-stm32, linux-arm-kernel,
	linux-rdma, linux-usb, linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <20260626045119.659d1e6b@fedora>

On Fri, Jun 26, 2026 at 04:51:19AM -0400, Steven Rostedt wrote:
> On Thu, 25 Jun 2026 16:41:58 -0700
> Nathan Chancellor <nathan@kernel.org> wrote:
> 
> 
> > The following diff resolves it for me, should I send it as a separate
> > patch or do you want to just fold it in with a note?
> > 
> > diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> > index 621566345406..2301a701ffbb 100644
> > --- a/include/linux/lockdep.h
> > +++ b/include/linux/lockdep.h
> > @@ -10,6 +10,7 @@
> >  #ifndef __LINUX_LOCKDEP_H
> >  #define __LINUX_LOCKDEP_H
> >  
> > +#include <linux/instruction_pointer.h>
> 
> Ah, so the reason for this breakage is because lockdep was relying on
> instruction_pointer.h, that just happened to be included in kernel.h
> via trace_printk.h.

Correct.

> This is a separate issue, so it should be a separate patch. I'll add it
> as patch 1 of this series.

Sounds good, thanks!

> Can you send me the config you used. This didn't trigger in my tests.

It is a plain allmodconfig, for example on arm:

  $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- allmodconfig lib/test_context-analysis.o
  In file included from include/linux/local_lock_internal.h:8,
                   from include/linux/local_lock.h:5,
                   from lib/test_context-analysis.c:9:
  include/linux/local_lock_internal.h: In function 'local_lock_acquire':
  include/linux/lockdep.h:541:87: error: '_THIS_IP_' undeclared (first use in this function)
    541 | #define lock_map_acquire(l)                     lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
        |                                                                                       ^~~~~~~~~
  include/linux/lockdep.h:509:88: note: in definition of macro 'lock_acquire_exclusive'
    509 | #define lock_acquire_exclusive(l, s, t, n, i)           lock_acquire(l, s, t, 0, 1, n, i)
        |                                                                                        ^
  include/linux/local_lock_internal.h:46:9: note: in expansion of macro 'lock_map_acquire'
     46 |         lock_map_acquire(&l->dep_map);
        |         ^~~~~~~~~~~~~~~~
  include/linux/lockdep.h:541:87: note: each undeclared identifier is reported only once for each function it appears in
  ...

I also reproduced it on top of allnoconfig:

  $ cat allno.config
  CONFIG_CONTEXT_ANALYSIS_TEST=y
  CONFIG_DEBUG_KERNEL=y
  CONFIG_DEBUG_LOCK_ALLOC=y
  CONFIG_EXPERT=y
  CONFIG_MMU=y
  CONFIG_RUNTIME_TESTING_MENU=y

  $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- KCONFIG_ALLCONFIG=1 clean allnoconfig lib/test_context-analysis.o
  <same error as above>

-- 
Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-26 19:06 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aj3H2sxymOYTWTnE@yzhao56-desk.sh.intel.com>

On Fri, Jun 26, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 07:36:28AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 25, 2026, Yan Zhao wrote:
> > And I'm not remotely convinced that prepending allow_ to the param will help
> > end users diagnose "unexpected" memory consumption, in quotes because anyone that
> > is deploying a stack that utilizes out-of-place conversion absolutely needs to
> > understand and plan for the additional memory consumption.  I.e. if the memory
> > consumption is "unexpected" to the end user, they likely have far bigger problems.
> My first impression of gmem_in_place_conversion=true was that it enforces gmem
> in-place conversion. However, it actually only enforces per-gmem private/shared
> attribute.
> My worry was that people might think it's a kernel bug if userspace can still
> have shared memory from other sources after they configured
> gmem_in_place_conversion=true.

Ah, I see where you're coming from.  FWIW, truly enforcing in-place conversion
is flat out impossible.  E.g. userspace can simply replace the memslot, at which
point the memory effectively reverts to shared.

> However, I have no strong opinion if you think gmem_in_place_conversion is good,
> and with the above documentation. :)

Ya, I think this largely a documentation problem.  I agree that a param name
like gmem_private_memory_attributes would be more precise, but I think it'd be
far less informative for the vast majority of users that only care whether or
not KVM can do in-place conversion, and don't care about how that is done.

^ permalink raw reply

* [PATCH rfc 0/2] Improvements to ftrace comm[] handling
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Michal Koutný
  Cc: David Laight

RFC because they are untested and I want to send them before going
on holiday for 2 weeks (should still have email access).

The first patch avoids a lot of 'potentially unsized' string
functions by embedding the char[] use to hold task->comm[] in a
structure.

The second adjsuts the data structure used to cache the task
names in the thread switch code.

David Laight (2):
  tracing: Embed 'char comm[16]' in a structure
  tracing: Keep pid and comm[] in the same structure

 kernel/trace/blktrace.c              |  28 +++----
 kernel/trace/trace.c                 |   3 +-
 kernel/trace/trace.h                 |   9 ++-
 kernel/trace/trace_events_filter.c   |   2 +-
 kernel/trace/trace_events_hist.c     |  26 +++---
 kernel/trace/trace_functions_graph.c |  10 +--
 kernel/trace/trace_output.c          |  24 +++---
 kernel/trace/trace_sched_switch.c    | 113 ++++++++++++---------------
 8 files changed, 101 insertions(+), 114 deletions(-)

-- 
2.39.5


^ permalink raw reply

* [PATCH 1/2] tracing: Embed 'char comm[16]' in a structure
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Michal Koutný
  Cc: David Laight
In-Reply-To: <20260626212356.64150-1-david.laight.linux@gmail.com>

Embedding the array in a stucture makes the size explicit and lets
structure copies be used.

Limit the size to 16 charatacters even if task_struct.comm is
made larger (there are plans to increase it).

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 kernel/trace/blktrace.c              | 28 +++++++++----------
 kernel/trace/trace.c                 |  3 +-
 kernel/trace/trace.h                 |  9 ++++--
 kernel/trace/trace_events_filter.c   |  2 +-
 kernel/trace/trace_events_hist.c     | 26 +++++++----------
 kernel/trace/trace_functions_graph.c | 10 +++----
 kernel/trace/trace_output.c          | 24 ++++++++--------
 kernel/trace/trace_sched_switch.c    | 42 ++++++++++++++++------------
 8 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 8cd2520b4c99..68ffc95548b7 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1590,20 +1590,20 @@ static void blk_log_dump_pdu(struct trace_seq *s,
 
 static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
-	char cmd[TASK_COMM_LEN];
+	struct trace_comm cmd;
 
-	trace_find_cmdline(ent->pid, cmd);
+	trace_find_cmdline(ent->pid, &cmd);
 
 	if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
 		trace_seq_printf(s, "%u ", t_bytes(ent));
 		blk_log_dump_pdu(s, ent, has_cg);
-		trace_seq_printf(s, "[%s]\n", cmd);
+		trace_seq_printf(s, "[%s]\n", cmd.comm);
 	} else {
 		if (t_sec(ent))
 			trace_seq_printf(s, "%llu + %u [%s]\n",
-						t_sector(ent), t_sec(ent), cmd);
+						t_sector(ent), t_sec(ent), cmd.comm);
 		else
-			trace_seq_printf(s, "[%s]\n", cmd);
+			trace_seq_printf(s, "[%s]\n", cmd.comm);
 	}
 }
 
@@ -1637,30 +1637,30 @@ static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bo
 
 static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
-	char cmd[TASK_COMM_LEN];
+	struct trace_comm cmd;
 
-	trace_find_cmdline(ent->pid, cmd);
+	trace_find_cmdline(ent->pid, &cmd);
 
-	trace_seq_printf(s, "[%s]\n", cmd);
+	trace_seq_printf(s, "[%s]\n", cmd.comm);
 }
 
 static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
-	char cmd[TASK_COMM_LEN];
+	struct trace_comm cmd;
 
-	trace_find_cmdline(ent->pid, cmd);
+	trace_find_cmdline(ent->pid, &cmd);
 
-	trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
+	trace_seq_printf(s, "[%s] %llu\n", cmd.comm, get_pdu_int(ent, has_cg));
 }
 
 static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
-	char cmd[TASK_COMM_LEN];
+	struct trace_comm cmd;
 
-	trace_find_cmdline(ent->pid, cmd);
+	trace_find_cmdline(ent->pid, &cmd);
 
 	trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
-			 get_pdu_int(ent, has_cg), cmd);
+			 get_pdu_int(ent, has_cg), cmd.comm);
 }
 
 static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..7de658b8ee0d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -52,6 +52,7 @@
 #include <linux/sort.h>
 #include <linux/io.h> /* vmap_page_range() */
 #include <linux/fs_context.h>
+#include <linux/trace_printk.h>
 
 #include <asm/setup.h> /* COMMAND_LINE_SIZE */
 
@@ -2972,7 +2973,7 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter)
 	seq_puts(m, "#    -----------------\n");
 	seq_printf(m, "#    | task: %.16s-%d "
 		   "(uid:%d nice:%ld policy:%ld rt_prio:%ld)\n",
-		   data->comm, data->pid,
+		   data->comm.comm, data->pid,
 		   from_kuid_munged(seq_user_ns(m), data->uid), data->nice,
 		   data->policy, data->rt_priority);
 	seq_puts(m, "#    -----------------\n");
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..afd59d79e1fe 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -183,6 +183,11 @@ struct fexit_trace_entry_head {
 
 struct trace_array;
 
+/* task_struct->comm[] may be truncated to save memory/width */
+struct trace_comm {
+	char comm[16];
+};
+
 /*
  * The CPU trace array - it consists of thousands of trace entries
  * plus some other descriptor data: (for example which task started
@@ -203,7 +208,7 @@ struct trace_array_cpu {
 	u64			preempt_timestamp;
 	pid_t			pid;
 	kuid_t			uid;
-	char			comm[TASK_COMM_LEN];
+	struct trace_comm	comm;
 
 #ifdef CONFIG_FUNCTION_TRACER
 	int			ftrace_ignore_pid;
@@ -906,7 +911,7 @@ void trace_last_func_repeats(struct trace_array *tr,
 
 extern u64 ftrace_now(int cpu);
 
-extern void trace_find_cmdline(int pid, char comm[]);
+extern void trace_find_cmdline(int pid, struct trace_comm *comm);
 extern int trace_find_tgid(int pid);
 extern void trace_event_follow_fork(struct trace_array *tr, bool enable);
 
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 609325f57942..749887aff315 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -994,7 +994,7 @@ static int filter_pred_comm(struct filter_pred *pred, void *event)
 	int cmp;
 
 	cmp = pred->regex->match(current->comm, pred->regex,
-				TASK_COMM_LEN);
+				sizeof(current->comm));
 	return cmp ^ pred->not;
 }
 
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0dbbf6cca9bc..1b51491b2a41 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -680,7 +680,7 @@ struct track_data {
 };
 
 struct hist_elt_data {
-	char *comm;
+	struct trace_comm *comm;
 	u64 *var_ref_vals;
 	char **field_var_str;
 	int n_field_var_str;
@@ -756,7 +756,7 @@ static struct track_data *track_data_alloc(unsigned int key_len,
 
 	data->elt.private_data = elt_data;
 
-	elt_data->comm = kzalloc(TASK_COMM_LEN, GFP_KERNEL);
+	elt_data->comm = kzalloc_obj(*elt_data->comm);
 	if (!elt_data->comm) {
 		track_data_free(data);
 		return ERR_PTR(-ENOMEM);
@@ -1608,19 +1608,19 @@ parse_hist_trigger_attrs(struct trace_array *tr, char *trigger_str)
 	return ERR_PTR(ret);
 }
 
-static inline void save_comm(char *comm, struct task_struct *task)
+static inline void save_comm(struct trace_comm *comm, struct task_struct *task)
 {
 	if (!task->pid) {
-		strcpy(comm, "<idle>");
+		strcpy(comm->comm, "<idle>");
 		return;
 	}
 
 	if (WARN_ON_ONCE(task->pid < 0)) {
-		strcpy(comm, "<XXX>");
+		strcpy(comm->comm, "<XXX>");
 		return;
 	}
 
-	strscpy(comm, task->comm, TASK_COMM_LEN);
+	strscpy(comm->comm, task->comm);
 }
 
 static void hist_elt_data_free(struct hist_elt_data *elt_data)
@@ -1646,7 +1646,6 @@ static void hist_trigger_elt_data_free(struct tracing_map_elt *elt)
 static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
 {
 	struct hist_trigger_data *hist_data = elt->map->private_data;
-	unsigned int size = TASK_COMM_LEN;
 	struct hist_elt_data *elt_data;
 	struct hist_field *hist_field;
 	unsigned int i, n_str;
@@ -1659,7 +1658,7 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
 		hist_field = hist_data->fields[i];
 
 		if (hist_field->flags & HIST_FIELD_FL_EXECNAME) {
-			elt_data->comm = kzalloc(size, GFP_KERNEL);
+			elt_data->comm = kzalloc_obj(*elt_data->comm);
 			if (!elt_data->comm) {
 				kfree(elt_data);
 				return -ENOMEM;
@@ -1677,8 +1676,6 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
 
 	BUILD_BUG_ON(STR_VAR_LEN_MAX & (sizeof(u64) - 1));
 
-	size = STR_VAR_LEN_MAX;
-
 	elt_data->field_var_str = kcalloc(n_str, sizeof(char *), GFP_KERNEL);
 	if (!elt_data->field_var_str) {
 		hist_elt_data_free(elt_data);
@@ -1687,7 +1684,7 @@ static int hist_trigger_elt_data_alloc(struct tracing_map_elt *elt)
 	elt_data->n_field_var_str = n_str;
 
 	for (i = 0; i < n_str; i++) {
-		elt_data->field_var_str[i] = kzalloc(size, GFP_KERNEL);
+		elt_data->field_var_str[i] = kzalloc(STR_VAR_LEN_MAX, GFP_KERNEL);
 		if (!elt_data->field_var_str[i]) {
 			hist_elt_data_free(elt_data);
 			return -ENOMEM;
@@ -3449,7 +3446,7 @@ static bool cond_snapshot_update(struct trace_array *tr, void *cond_data)
 	elt_data = context->elt->private_data;
 	track_elt_data = track_data->elt.private_data;
 	if (elt_data->comm)
-		strscpy(track_elt_data->comm, elt_data->comm, TASK_COMM_LEN);
+		track_elt_data->comm = elt_data->comm;
 
 	track_data->updated = true;
 
@@ -5505,16 +5502,13 @@ static void hist_trigger_print_key(struct seq_file *m,
 				   uval, (void *)(uintptr_t)uval);
 		} else if (key_field->flags & HIST_FIELD_FL_EXECNAME) {
 			struct hist_elt_data *elt_data = elt->private_data;
-			char *comm;
 
 			if (WARN_ON_ONCE(!elt_data))
 				return;
 
-			comm = elt_data->comm;
-
 			uval = *(u64 *)(key + key_field->offset);
 			seq_printf(m, "%s: %-16s[%10llu]", field_name,
-				   comm, uval);
+				   elt_data->comm->comm, uval);
 		} else if (key_field->flags & HIST_FIELD_FL_SYSCALL) {
 			const char *syscall_name;
 
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 0d2d3a2ea7dd..e46c5aa0d4e4 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -561,19 +561,19 @@ static void print_graph_cpu(struct trace_seq *s, int cpu)
 
 static void print_graph_proc(struct trace_seq *s, pid_t pid)
 {
-	char comm[TASK_COMM_LEN];
+	struct trace_comm comm;
 	/* sign + log10(MAX_INT) + '\0' */
 	char pid_str[12];
 	int spaces = 0;
 	int len;
 	int i;
 
-	trace_find_cmdline(pid, comm);
-	comm[7] = '\0';
+	trace_find_cmdline(pid, &comm);
+	comm.comm[7] = '\0';
 	sprintf(pid_str, "%d", pid);
 
 	/* 1 stands for the "-" character */
-	len = strlen(comm) + strlen(pid_str) + 1;
+	len = strlen(comm.comm) + strlen(pid_str) + 1;
 
 	if (len < TRACE_GRAPH_PROCINFO_LENGTH)
 		spaces = TRACE_GRAPH_PROCINFO_LENGTH - len;
@@ -582,7 +582,7 @@ static void print_graph_proc(struct trace_seq *s, pid_t pid)
 	for (i = 0; i < spaces / 2; i++)
 		trace_seq_putc(s, ' ');
 
-	trace_seq_printf(s, "%s-%s", comm, pid_str);
+	trace_seq_printf(s, "%s-%s", comm.comm, pid_str);
 
 	/* Last spaces to align center */
 	for (i = 0; i < spaces - (spaces / 2); i++)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..58405291b44f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -554,12 +554,12 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
 static int
 lat_print_generic(struct trace_seq *s, struct trace_entry *entry, int cpu)
 {
-	char comm[TASK_COMM_LEN];
+	struct trace_comm comm;
 
-	trace_find_cmdline(entry->pid, comm);
+	trace_find_cmdline(entry->pid, &comm);
 
 	trace_seq_printf(s, "%8.8s-%-7d %3d",
-			 comm, entry->pid, cpu);
+			 comm.comm, entry->pid, cpu);
 
 	return trace_print_lat_fmt(s, entry);
 }
@@ -658,11 +658,11 @@ int trace_print_context(struct trace_iterator *iter)
 	struct trace_array *tr = iter->tr;
 	struct trace_seq *s = &iter->seq;
 	struct trace_entry *entry = iter->ent;
-	char comm[TASK_COMM_LEN];
+	struct trace_comm comm;
 
-	trace_find_cmdline(entry->pid, comm);
+	trace_find_cmdline(entry->pid, &comm);
 
-	trace_seq_printf(s, "%16s-%-7d ", comm, entry->pid);
+	trace_seq_printf(s, "%16s-%-7d ", comm.comm, entry->pid);
 
 	if (tr->trace_flags & TRACE_ITER(RECORD_TGID)) {
 		unsigned int tgid = trace_find_tgid(entry->pid);
@@ -700,13 +700,13 @@ int trace_print_lat_context(struct trace_iterator *iter)
 	entry = iter->ent;
 
 	if (verbose) {
-		char comm[TASK_COMM_LEN];
+		struct trace_comm comm;
 
-		trace_find_cmdline(entry->pid, comm);
+		trace_find_cmdline(entry->pid, &comm);
 
 		trace_seq_printf(
 			s, "%16s %7d %3d %d %08x %08lx ",
-			comm, entry->pid, iter->cpu, entry->flags,
+			comm.comm, entry->pid, iter->cpu, entry->flags,
 			entry->preempt_count & 0xf, iter->idx);
 	} else {
 		lat_print_generic(s, entry, iter->cpu);
@@ -1276,7 +1276,7 @@ static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter,
 					     char *delim)
 {
 	struct ctx_switch_entry *field;
-	char comm[TASK_COMM_LEN];
+	struct trace_comm comm;
 	int S, T;
 
 
@@ -1284,7 +1284,7 @@ static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter,
 
 	T = task_index_to_char(field->next_state);
 	S = task_index_to_char(field->prev_state);
-	trace_find_cmdline(field->next_pid, comm);
+	trace_find_cmdline(field->next_pid, &comm);
 	trace_seq_printf(&iter->seq,
 			 " %7d:%3d:%c %s [%03d] %7d:%3d:%c %s\n",
 			 field->prev_pid,
@@ -1293,7 +1293,7 @@ static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter,
 			 field->next_cpu,
 			 field->next_pid,
 			 field->next_prio,
-			 T, comm);
+			 T, comm.comm);
 
 	return trace_handle_return(&iter->seq);
 }
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e9f0ff962660..972883643097 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -172,27 +172,33 @@ struct saved_cmdlines_buffer {
 	unsigned *map_cmdline_to_pid;
 	unsigned cmdline_num;
 	int cmdline_idx;
-	char saved_cmdlines[];
+	struct trace_comm saved_cmdlines[];
 };
 static struct saved_cmdlines_buffer *savedcmd;
 
 /* Holds the size of a cmdline and pid element */
 #define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s)			\
-	(TASK_COMM_LEN + sizeof((s)->map_cmdline_to_pid[0]))
+	(sizeof(struct trace_comm) + sizeof((s)->map_cmdline_to_pid[0]))
 
-static inline char *get_saved_cmdlines(int idx)
+static inline struct trace_comm *get_saved_cmdlines(int idx)
 {
-	return &savedcmd->saved_cmdlines[idx * TASK_COMM_LEN];
+	return &savedcmd->saved_cmdlines[idx];
 }
 
-static inline void set_cmdline(int idx, const char *cmdline)
+static inline void set_cmdline(int idx, const struct task_struct *tsk)
 {
-	strscpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN);
+	struct trace_comm *comm = get_saved_cmdlines(idx);
+
+	BUILD_BUG_ON(sizeof(comm->comm) > sizeof(tsk->comm));
+
+	memcpy(comm->comm, tsk->comm, sizeof comm->comm);
+	if (sizeof(comm->comm) != sizeof(tsk->comm))
+		comm->comm[ARRAY_SIZE(comm->comm) - 1] = 0;
 }
 
 static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s)
 {
-	int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN);
+	int order = get_order(sizeof(*s) + s->cmdline_num * sizeof(struct trace_comm));
 
 	kmemleak_free(s);
 	free_pages((unsigned long)s, order);
@@ -222,7 +228,7 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
 	s->cmdline_num = val;
 
 	/* Place map_cmdline_to_pid array right after saved_cmdlines */
-	s->map_cmdline_to_pid = (unsigned *)&s->saved_cmdlines[val * TASK_COMM_LEN];
+	s->map_cmdline_to_pid = (unsigned *)&s->saved_cmdlines[val];
 
 	memset(&s->map_pid_to_cmdline, NO_CMDLINE_MAP,
 	       sizeof(s->map_pid_to_cmdline));
@@ -273,25 +279,25 @@ int trace_save_cmdline(struct task_struct *tsk)
 	}
 
 	savedcmd->map_cmdline_to_pid[idx] = tsk->pid;
-	set_cmdline(idx, tsk->comm);
+	set_cmdline(idx, tsk);
 
 	arch_spin_unlock(&trace_cmdline_lock);
 
 	return 1;
 }
 
-static void __trace_find_cmdline(int pid, char comm[])
+static void __trace_find_cmdline(int pid, struct trace_comm *comm)
 {
 	unsigned map;
 	int tpid;
 
 	if (!pid) {
-		strcpy(comm, "<idle>");
+		strcpy(comm->comm, "<idle>");
 		return;
 	}
 
 	if (WARN_ON_ONCE(pid < 0)) {
-		strcpy(comm, "<XXX>");
+		strcpy(comm->comm, "<XXX>");
 		return;
 	}
 
@@ -300,14 +306,14 @@ static void __trace_find_cmdline(int pid, char comm[])
 	if (map != NO_CMDLINE_MAP) {
 		tpid = savedcmd->map_cmdline_to_pid[map];
 		if (tpid == pid) {
-			strscpy(comm, get_saved_cmdlines(map), TASK_COMM_LEN);
+			*comm = *get_saved_cmdlines(map);
 			return;
 		}
 	}
-	strcpy(comm, "<...>");
+	strcpy(comm->comm, "<...>");
 }
 
-void trace_find_cmdline(int pid, char comm[])
+void trace_find_cmdline(int pid, struct trace_comm *comm)
 {
 	preempt_disable();
 	arch_spin_lock(&trace_cmdline_lock);
@@ -561,11 +567,11 @@ static void saved_cmdlines_stop(struct seq_file *m, void *v)
 
 static int saved_cmdlines_show(struct seq_file *m, void *v)
 {
-	char buf[TASK_COMM_LEN];
+	struct trace_comm buf;
 	unsigned int *pid = v;
 
-	__trace_find_cmdline(*pid, buf);
-	seq_printf(m, "%d %s\n", *pid, buf);
+	__trace_find_cmdline(*pid, &buf);
+	seq_printf(m, "%d %s\n", *pid, buf.comm);
 	return 0;
 }
 
-- 
2.39.5


^ permalink raw reply related

* [PATCH 2/2] tracing: Keep pid and comm[] in the same structure
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Michal Koutný
  Cc: David Laight
In-Reply-To: <20260626212356.64150-1-david.laight.linux@gmail.com>

Rather than have two separate dynamic arrays on the end of struct
saved_commandlines_buffer have a single dynamic array where each
entry contains the pid and associated task->comm[].
This simplifies the initialisation and lookup.

Don't bother trying to initialise the pid field no a non-zero value,
it only matters in the tracing_saved_cmdlines_seq_ops code.
Allocate entry [0] first so that the tracing_saved_cmdlines_seq_ops
code can just index the array with the file offset.

The code now uses the correct size when determining the page 'order'
to free the structure. The smaller size will always give the same
'order'.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Is there any reason why this code uses alloc_pages() rather
than vmalloc()?
map_pid_to_cmdline[] is 64k*sizeof(int) so the whole structure
expands to 512k with about 64k/20 (about 3200) pid entries even
though the default is 128.
AFAICT there is only one copy of the data - so it could be static.
Perhaps with pointers to map_pid_cmdline[] and (after this patch)
pid_comm[], both of which could be separately resized.

I also noticed that map_pid_to_cmdline[] contains indexes into
pid_comm[], restricting these to 16bits would half the data area.

 kernel/trace/trace_sched_switch.c | 97 +++++++++++++------------------
 1 file changed, 39 insertions(+), 58 deletions(-)

diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 972883643097..5e7c8cf444b8 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -167,27 +167,21 @@ static size_t tgid_map_max;
  * where interrupt is disabled.
  */
 static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED;
+struct pid_comm {
+	pid_t pid;
+	struct trace_comm comm;
+};
 struct saved_cmdlines_buffer {
 	unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
-	unsigned *map_cmdline_to_pid;
 	unsigned cmdline_num;
 	int cmdline_idx;
-	struct trace_comm saved_cmdlines[];
+	struct pid_comm pid_comm[];
 };
 static struct saved_cmdlines_buffer *savedcmd;
 
-/* Holds the size of a cmdline and pid element */
-#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s)			\
-	(sizeof(struct trace_comm) + sizeof((s)->map_cmdline_to_pid[0]))
-
-static inline struct trace_comm *get_saved_cmdlines(int idx)
-{
-	return &savedcmd->saved_cmdlines[idx];
-}
-
-static inline void set_cmdline(int idx, const struct task_struct *tsk)
+static inline void set_cmdline(struct pid_comm *pid_comm, const struct task_struct *tsk)
 {
-	struct trace_comm *comm = get_saved_cmdlines(idx);
+	struct trace_comm *comm = &pid_comm->comm;
 
 	BUILD_BUG_ON(sizeof(comm->comm) > sizeof(tsk->comm));
 
@@ -212,7 +206,7 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
 	int order;
 
 	/* Figure out how much is needed to hold the given number of cmdlines */
-	orig_size = sizeof(*s) + val * SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
+	orig_size = sizeof(*s) + val * sizeof(s->pid_comm[0]);
 	order = get_order(orig_size);
 	size = 1 << (order + PAGE_SHIFT);
 	page = alloc_pages(GFP_KERNEL, order);
@@ -224,16 +218,11 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
 	memset(s, 0, sizeof(*s));
 
 	/* Round up to actual allocation */
-	val = (size - sizeof(*s)) / SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
+	val = (size - sizeof(*s)) / sizeof(s->pid_comm[0]);
 	s->cmdline_num = val;
 
-	/* Place map_cmdline_to_pid array right after saved_cmdlines */
-	s->map_cmdline_to_pid = (unsigned *)&s->saved_cmdlines[val];
-
 	memset(&s->map_pid_to_cmdline, NO_CMDLINE_MAP,
 	       sizeof(s->map_pid_to_cmdline));
-	memset(s->map_cmdline_to_pid, NO_CMDLINE_MAP,
-	       val * sizeof(*s->map_cmdline_to_pid));
 
 	return s;
 }
@@ -247,6 +236,7 @@ int trace_create_savedcmd(void)
 
 int trace_save_cmdline(struct task_struct *tsk)
 {
+	struct pid_comm *pid_comm;
 	unsigned tpid, idx;
 
 	/* treat recording of idle task as a success */
@@ -272,14 +262,16 @@ int trace_save_cmdline(struct task_struct *tsk)
 
 	idx = savedcmd->map_pid_to_cmdline[tpid];
 	if (idx == NO_CMDLINE_MAP) {
-		idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;
-
+		idx = savedcmd->cmdline_idx;
 		savedcmd->map_pid_to_cmdline[tpid] = idx;
+		if (++idx >= savedcmd->cmdline_num)
+			idx = 0;
 		savedcmd->cmdline_idx = idx;
 	}
 
-	savedcmd->map_cmdline_to_pid[idx] = tsk->pid;
-	set_cmdline(idx, tsk);
+	pid_comm = savedcmd->pid_comm + idx;
+	pid_comm->pid = tsk->pid;
+	set_cmdline(pid_comm, tsk);
 
 	arch_spin_unlock(&trace_cmdline_lock);
 
@@ -288,8 +280,8 @@ int trace_save_cmdline(struct task_struct *tsk)
 
 static void __trace_find_cmdline(int pid, struct trace_comm *comm)
 {
+	struct pid_comm *pid_comm;
 	unsigned map;
-	int tpid;
 
 	if (!pid) {
 		strcpy(comm->comm, "<idle>");
@@ -301,12 +293,11 @@ static void __trace_find_cmdline(int pid, struct trace_comm *comm)
 		return;
 	}
 
-	tpid = pid & (PID_MAX_DEFAULT - 1);
-	map = savedcmd->map_pid_to_cmdline[tpid];
+	map = savedcmd->map_pid_to_cmdline[pid & (PID_MAX_DEFAULT - 1)];
 	if (map != NO_CMDLINE_MAP) {
-		tpid = savedcmd->map_cmdline_to_pid[map];
-		if (tpid == pid) {
-			*comm = *get_saved_cmdlines(map);
+		pid_comm = savedcmd->pid_comm + map;
+		if (pid_comm->pid == pid) {
+			*comm = pid_comm->comm;;
 			return;
 		}
 	}
@@ -521,42 +512,34 @@ const struct file_operations tracing_saved_tgids_fops = {
 	.release	= seq_release,
 };
 
-static void *saved_cmdlines_next(struct seq_file *m, void *v, loff_t *pos)
+static struct pid_comm *saved_cmdlines_entry(loff_t off)
 {
-	unsigned int *ptr = v;
+	struct pid_comm *pid_comm;
 
-	if (*pos || m->count)
-		ptr++;
-
-	(*pos)++;
+	if (off >= savedcmd->cmdline_num)
+		return NULL;
 
-	for (; ptr < &savedcmd->map_cmdline_to_pid[savedcmd->cmdline_num];
-	     ptr++) {
-		if (*ptr == -1 || *ptr == NO_CMDLINE_MAP)
-			continue;
+	/* Entries are used in sequence and never freed */
+	pid_comm = &savedcmd->pid_comm[off];
+	return pid_comm->pid ? pid_comm : NULL;
+}
 
-		return ptr;
-	}
+static void *saved_cmdlines_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	loff_t off = *pos;
 
-	return NULL;
+	v = saved_cmdlines_entry(off);
+	if (v)
+		*pos = off + 1;
+	return v;
 }
 
 static void *saved_cmdlines_start(struct seq_file *m, loff_t *pos)
 {
-	void *v;
-	loff_t l = 0;
-
 	preempt_disable();
 	arch_spin_lock(&trace_cmdline_lock);
 
-	v = &savedcmd->map_cmdline_to_pid[0];
-	while (l <= *pos) {
-		v = saved_cmdlines_next(m, v, &l);
-		if (!v)
-			return NULL;
-	}
-
-	return v;
+	return saved_cmdlines_entry(*pos);
 }
 
 static void saved_cmdlines_stop(struct seq_file *m, void *v)
@@ -567,11 +550,9 @@ static void saved_cmdlines_stop(struct seq_file *m, void *v)
 
 static int saved_cmdlines_show(struct seq_file *m, void *v)
 {
-	struct trace_comm buf;
-	unsigned int *pid = v;
+	struct pid_comm *ptr = v;
 
-	__trace_find_cmdline(*pid, &buf);
-	seq_printf(m, "%d %s\n", *pid, buf.comm);
+	seq_printf(m, "%d %s\n", ptr->pid, ptr->comm.comm);
 	return 0;
 }
 
-- 
2.39.5


^ permalink raw reply related

page:              | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox