Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-06-09 11:01 UTC (permalink / raw)
  To: Nico Pache
  Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcD7WAiA1b9GTLAuNZ+kHaFx0SzZwpBkqAZ=s+RHsTUaow@mail.gmail.com>



On 2026/6/9 18:50, Nico Pache wrote:
> On Tue, Jun 9, 2026 at 4:37 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2026/6/9 17:32, Nico Pache wrote:
>>> On Tue, Jun 9, 2026 at 3:26 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>>
>>>> On 6/9/26 11:06, Nico Pache wrote:
>>>>> On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>>>>
>>>>>> On 6/6/26 12:28, Lance Yang wrote:
>>>>>>>
>>>>>>>
>>>>>>> Looks broken for swap PTEs in PMD collapse ...
>>>>>>>
>>>>>>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
>>>>>>> unmapped, but they don't get a bit in mthp_present_ptes. And then
>>>>>>> mthp_collapse() does the check above:
>>>>>>
>>>>>> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
>>>>>>
>>>>>>           if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>>>>>>                   max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>>>>>>
>>>>>> But we perform the check a second time.
>>>>>>
>>>>>>>
>>>>>>> nr_occupied_ptes >= nr_ptes - max_ptes_none
>>>>>>>
>>>>>>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
>>>>>>> call collapse_huge_page() for PMD order.
>>>>>>>
>>>>>>> Shouldn't we account for them in the PMD-order check? Something like:
>>>>>>>
>>>>>>> if (is_pmd_order(order))
>>>>>>>         nr_occupied_ptes += unmapped;
>>>>>
>>>>> This solution seems good for a temporary fixup. but longterm we may
>>>>> want something else. I'm still not sure how we plan on supporting
>>>>> swapin without causing creep. So I'd be ok with adding a fix for
>>>>> legacy PMD behavior until we know how to handle mTHP creep correctly.
>>>>>
>>>>>> As an alternative, we could either 1) skip the check there for
>>>>>> pmd order (as the check was already done); or 2) introduce+maintain
>>>>>> a bitmap that tracks non-present PTEs.
>>>>>>
>>>>>> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>>>>>>                   nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>>>>>>                                                         offset + nr_ptes);
>>>>>>
>>>>>> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>>>>> +               /* Check was already done in the caller. */
>>>>>> +               if (is_pmd_order(order) ||
>>>>>> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>>>>>                           enum scan_result ret;
>>>>>>
>>>>>>                           collapse_address = address + offset * PAGE_SIZE;
>>>>>>
>>>>>> 2) would probably be cleanest long-term.
>>>>>
>>>>> That would be best for future swapin support in mTHP, but I still
>>>>> don't think it solves the creep issue.
>>>>
>>>> It wouldn't, we'd simply maintain the state we collect + rely on in separate
>>>> bitmaps. On swapin, we'd have to update/refresh bitmaps I guess.
>>>
>>> Yeah, I'm saying for the future, it obviously solves this issue here
>>> as well, but if we have positional tracking of the swapout, shared,
>>> and none PTEs, I think we can use this to determine whether the
>>> collapse would lead to creep. If we detect creep would happen it may
>>> be best to automatically collapse to the N+1 (or greater) candidate.
>>> Just thinking outloud here.
>>>
>>>>
>>>>> Perhaps we could combine the
>>>>> two bitmaps to determine if it would make the future collapse eligible
>>>>> again? Not sure but ill start thinking about it.
>>>>>
>>>>> Should I send a fixup for this using Lance's solution? Or does Lance
>>>>> want to send a patch out with the fixes tag?
>>>>
>>>> If Lance could send a fixup, explaining the situation, that would be nice.
>>
>> Sure, happy to send a fixup :P
>>
>> Should I send it as a fixup to be folded into this patch, or as a
>> separate patch with a Fixes: tag?
> 
> Id assume a seperate patch so you can keep credit for the discovery :)

Okay :D

> Thank you for all the review you provided on this series, its been
> really helpful!

Appreciate it!

Nice work getting it this far. Nice one, Nico :P

^ permalink raw reply

* [PATCH v9 6/6] selftests/mm: add hwpoison-panic destructive test
From: Breno Leitao @ 2026-06-09 10:57 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

Add a destructive selftest that verifies
vm.panic_on_unrecoverable_memory_failure actually panics when a
hwpoison error hits a kernel-owned page.

Three "kinds" of kernel-owned page can be targeted, selectable via
the script's first positional argument (default: rodata):

  rodata  - a PG_reserved page in the kernel rodata range, sourced
            from the "Kernel rodata" sub-resource of "System RAM" in
            /proc/iomem.  That entry is reported on every major
            architecture and guarantees the chosen PFN is backed by
            struct page (an online System RAM range, not a firmware
            hole), is PG_reserved, and is read-only -- so even if
            the panic fails to fire for some reason, the resulting
            PG_hwpoison marker on rodata does not corrupt writable
            kernel state.

  slab    - a slab page found by walking /proc/kpageflags for the
            first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
            / KPF_COMPOUND_TAIL clear).  Exercises the get_any_page()
            path on a non PG_reserved kernel-owned page and so
            catches regressions where get_any_page() collapses
            kernel-owned pages into a transient -EIO instead of
            -ENOTRECOVERABLE.

  pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.

PageLargeKmalloc, the fourth page type matched by
HWPoisonKernelOwned(), is intentionally not covered: it is a
PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
a PFN from userspace is not feasible.  The slab and pgtable
variants already exercise the same get_any_page() positive-check
branch.

The script enables the sysctl and writes the selected physical
address to /sys/devices/system/memory/hard_offline_page.  A
successful run crashes the kernel with

  Memory failure: <pfn>: unrecoverable page

A return from the inject means the panic did not fire and the test
fails.  Test outcome is therefore observed externally (serial
console, kdump) rather than from the script's own exit code.

The script is intentionally NOT wired into run_vmtests.sh: every
successful run panics the kernel, which is incompatible with the
sequential "run each category in the same VM" model that
run_vmtests.sh assumes.  It is also not registered as a TEST_PROGS /
ksft_* wrapper so a default kselftest run does not opt itself into
a panic.  The script is meant to be executed manually inside a
disposable VM (e.g. virtme-ng), one variant per VM boot, and
requires RUN_DESTRUCTIVE=1 in the environment as a safety net.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/mm/Makefile          |   4 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 208 +++++++++++++++++++++++++++
 2 files changed, 212 insertions(+)

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971..ed321ae709da 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -174,6 +174,10 @@ TEST_PROGS += ksft_userfaultfd.sh
 TEST_PROGS += ksft_vma_merge.sh
 TEST_PROGS += ksft_vmalloc.sh
 
+# Destructive: every successful run panics the kernel.  Installed and
+# kept executable, but not run from a default kselftest invocation.
+TEST_PROGS_EXTENDED += hwpoison-panic.sh
+
 TEST_FILES := test_vmalloc.sh
 TEST_FILES += test_hmm.sh
 TEST_FILES += va_high_addr_switch.sh
diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
new file mode 100755
index 000000000000..fe58e7638a8b
--- /dev/null
+++ b/tools/testing/selftests/mm/hwpoison-panic.sh
@@ -0,0 +1,208 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
+# error on a kernel-owned page and confirming the kernel panics.
+#
+# Three "kinds" of kernel-owned page can be targeted, selectable via the
+# first positional argument (default: rodata):
+#
+#   rodata  - a PG_reserved page in the kernel rodata range
+#             (sourced from /proc/iomem "Kernel rodata").  Exercises
+#             memory_failure() -> get_any_page() on a PageReserved page.
+#
+#   slab    - a slab page found via /proc/kpageflags (KPF_SLAB).
+#             Exercises memory_failure() -> get_any_page() on a non
+#             PG_reserved kernel-owned page.  This path is what catches
+#             regressions where get_any_page() collapses kernel-owned
+#             pages into a transient -EIO instead of -ENOTRECOVERABLE.
+#
+#   pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
+#             Same path as slab, different page type.
+#
+# This test is DESTRUCTIVE: a successful run crashes the kernel.  It is
+# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
+# serial console captured by the harness.  It is skipped unless the
+# caller opts in via RUN_DESTRUCTIVE=1.
+#
+# Test passes externally: the kernel must panic with
+#   "Memory failure: <pfn>: unrecoverable page"
+# A return from the inject means the panic did not fire and the test
+# fails.
+#
+# Author: Breno Leitao <leitao@debian.org>
+
+set -u
+
+ksft_skip=4
+sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
+inject_path=/sys/devices/system/memory/hard_offline_page
+kpageflags_path=/proc/kpageflags
+
+# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
+KPF_SLAB=7
+KPF_COMPOUND_TAIL=16
+KPF_HWPOISON=19
+KPF_NOPAGE=20
+KPF_PGTABLE=26
+
+kind=${1:-rodata}
+
+ksft_print() { echo "# $*"; }
+ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
+ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
+
+if [ "$(id -u)" -ne 0 ]; then
+	ksft_exit_skip "must run as root"
+fi
+
+if [ ! -w "$sysctl_path" ]; then
+	ksft_exit_skip "$sysctl_path not present (kernel without the sysctl?)"
+fi
+
+if [ ! -w "$inject_path" ]; then
+	ksft_exit_skip "$inject_path not present (no MEMORY_HOTPLUG?)"
+fi
+
+if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
+	ksft_exit_skip "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
+fi
+
+# Pick a PFN inside the kernel image rodata region of /proc/iomem.
+# This is preferred over a top-level "Reserved" entry because top-level
+# Reserved ranges are often firmware holes that have no backing struct
+# page; pfn_to_online_page() returns NULL on those and memory_failure()
+# bails out with -ENXIO before reaching the panic path.
+#
+# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
+# major architecture, which guarantees:
+#   - the PFN is backed by struct page (within an online memory range);
+#   - PG_reserved is set on the page (kernel image area);
+#   - the memory is read-only, so setting PG_hwpoison on it does not
+#     corrupt writable kernel state if the panic somehow does not fire.
+#
+# /proc/iomem entries look like (indented for sub-resources):
+#     "  02500000-02ffffff : Kernel rodata"
+pick_rodata_phys_addr() {
+	awk -v pagesize="$(getconf PAGE_SIZE)" '
+	# Convert a hex string to a number without relying on the gawk-only
+	# strtonum().  mawk lacks it and would otherwise spuriously skip
+	# this test on distros that ship mawk as /usr/bin/awk.
+	function hex2num(s,   n, i, c, v) {
+		n = 0
+		for (i = 1; i <= length(s); i++) {
+			c = tolower(substr(s, i, 1))
+			v = index("0123456789abcdef", c) - 1
+			if (v < 0)
+				return -1
+			n = n * 16 + v
+		}
+		return n
+	}
+	/: Kernel rodata[[:space:]]*$/ {
+		sub(/^[[:space:]]+/, "")
+		n = split($0, a, /[- ]/)
+		start = hex2num(a[1])
+		end   = hex2num(a[2])
+		if (end <= start)
+			next
+		# Page-align upward and emit the first byte of that page.
+		pfn = int((start + pagesize - 1) / pagesize)
+		printf "0x%x\n", pfn * pagesize
+		exit 0
+	}
+	' /proc/iomem
+}
+
+# Walk /proc/kpageflags and return the phys addr of the first PFN that
+# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
+# all clear (so we attack a real, non-tail, not-already-poisoned page).
+#
+# We skip the first 16 MiB of PFNs to step past low-memory special
+# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
+# exhibit the slab/pgtable type we are looking for.
+pick_kpageflags_phys_addr() {
+	local want_bit=$1
+	local pagesize skip_pfn
+
+	[ -r "$kpageflags_path" ] || return
+
+	pagesize=$(getconf PAGE_SIZE)
+	skip_pfn=$(((16 * 1024 * 1024) / pagesize))
+
+	od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
+	awk -v want_bit="$want_bit" \
+	    -v hwp_bit="$KPF_HWPOISON" \
+	    -v nopage_bit="$KPF_NOPAGE" \
+	    -v tail_bit="$KPF_COMPOUND_TAIL" \
+	    -v base_pfn="$skip_pfn" \
+	    -v pagesize="$pagesize" '
+	# Test whether bit "b" is set in the 16-hex-digit value "hex".
+	# Done with substring + per-digit lookup so we never rely on awk
+	# bitwise operators (mawk lacks them), 64-bit FP precision or the
+	# gawk-only strtonum().
+	function bit_set(hex, b,    di, bi, c, v) {
+		di = int(b / 4)
+		bi = b - di * 4
+		c = substr(hex, length(hex) - di, 1)
+		v = index("0123456789abcdef", tolower(c)) - 1
+		if (bi == 0) return (v % 2) == 1
+		if (bi == 1) return int(v / 2) % 2 == 1
+		if (bi == 2) return int(v / 4) % 2 == 1
+		return int(v / 8) % 2 == 1
+	}
+	{
+		gsub(/^[[:space:]]+/, "")
+		h = $1
+		if (bit_set(h, want_bit) &&
+		    !bit_set(h, hwp_bit) &&
+		    !bit_set(h, nopage_bit) &&
+		    !bit_set(h, tail_bit)) {
+			pfn = base_pfn + NR - 1
+			printf "0x%x\n", pfn * pagesize
+			exit 0
+		}
+	}
+	'
+}
+
+case "$kind" in
+rodata)
+	phys_addr=$(pick_rodata_phys_addr)
+	missing_msg='no "Kernel rodata" entry in /proc/iomem'
+	;;
+slab)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
+	missing_msg="no usable slab PFN found in $kpageflags_path"
+	;;
+pgtable)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
+	missing_msg="no usable page-table PFN found in $kpageflags_path"
+	;;
+*)
+	ksft_exit_fail "unknown kind '$kind' (expected: rodata|slab|pgtable)"
+	;;
+esac
+
+if [ -z "$phys_addr" ]; then
+	ksft_exit_skip "$missing_msg"
+fi
+
+ksft_print "enabling $sysctl_path"
+prior=$(cat "$sysctl_path")
+echo 1 > "$sysctl_path" || ksft_exit_fail "failed to enable sysctl"
+
+ksft_print "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (kind=$kind)"
+ksft_print "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
+
+# If this returns, the kernel did not panic → test failed.  Restore the
+# sysctl before reporting so the system is left as we found it.
+if echo "$phys_addr" > "$inject_path"; then
+	echo "$prior" > "$sysctl_path"
+	ksft_exit_fail "inject returned without panic; sysctl ineffective"
+fi
+
+# Write failed (e.g. -EINVAL on offlining a non-online region): also a
+# failure for this test, since we expected the panic path.
+echo "$prior" > "$sysctl_path"
+ksft_exit_fail "inject failed before reaching the panic path"

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v9 5/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-06-09 10:56 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 85 +++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c..f71d87039904 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- panic_on_unrecoverable_memory_failure
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -925,6 +926,90 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation.  This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting kernel-owned pages that the handler cannot recover:
+``PageReserved`` (firmware reservations, kernel image, vDSO, zero
+page, and similar memblock-reserved regions), ``PageSlab``,
+``PageTable``, and ``PageLargeKmalloc``.  These are owned by the
+kernel and the memory failure handler cannot reliably evict their
+contents.
+
+For soft offline (``madvise(MADV_SOFT_OFFLINE)``,
+``/sys/devices/system/memory/soft_offline_page``), pages owned by
+``movable_ops`` are exempted, since soft offline is allowed to
+migrate them even though they are not on the LRU.
+
+Other unrecoverable kernel-owned populations (vmalloc allocations,
+kernel stack pages, ...) are not currently covered because the
+handler has no page-type signal that distinguishes them from a
+userspace folio temporarily off the LRU during migration or
+compaction.  Such pages still go through the standard
+MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them and a
+delayed crash on the next access remains possible.  Coverage may
+grow as the handler gains stronger kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies.  Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+  regularly and post-mortem analysis of an unrelated downstream crash
+  (often seconds to minutes after the original error) consumes
+  significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+  hardware error produces a vmcore that still contains the faulting
+  address, the affected page state, and the originating MCE/GHES
+  record — context that is typically lost by the time a delayed crash
+  occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+  failure for failover, and prefer an immediate panic over silent data
+  corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+  tools such as ``mce-inject`` or error-injection debugfs interfaces,
+  where panicking on the unrecoverable path makes regressions
+  immediately visible instead of surfacing as later, unrelated
+  failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately.  If the ``panic`` sysctl is also non-zero then the
+  machine will be rebooted.
+= =====================================================================
+
+Example::
+
+     echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
 percpu_pagelist_high_fraction
 =============================
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v9 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-09 10:56 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
default) that triggers a kernel panic when memory_failure()
encounters pages that cannot be recovered.  This provides a clean
crash with useful debug information rather than allowing silent
data corruption or a delayed crash at an unrelated code path.

Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
large-kmalloc).

All other action types are excluded:

- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
  transient refcount races with the page allocator (an in-flight buddy
  allocation has refcount 0 and is no longer on the buddy free list,
  briefly), and panicking on them would risk killing the box for what
  is actually a recoverable userspace page.

- MF_MSG_UNKNOWN means identify_page_state() could not classify the
  page; that is precisely the wrong basis for a panic decision.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 35f2b5d89fbe..a8b466a48b02 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	if (!sysctl_panic_on_unrecoverable_mf)
+		return false;
+
+	return type == MF_MSG_KERNEL && result == MF_IGNORED;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1272,6 +1292,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v9 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-06-09 10:56 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, page tables,
large-kmalloc).  memory_failure() still folds every negative return
into MF_MSG_GET_HWPOISON, so callers that want to react to the
unrecoverable cases (a panic option, smarter logging) cannot tell
them apart from transient page-allocator races.

Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.

The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, page tables and large-kmalloc
pages.

Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eed9de387694..35f2b5d89fbe 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2444,7 +2444,8 @@ int memory_failure(unsigned long pfn, int flags)
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
 	res = get_hwpoison_page(p, flags);
-	if (!res) {
+	switch (res) {
+	case 0:
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -2463,7 +2464,19 @@ int memory_failure(unsigned long pfn, int flags)
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;
-	} else if (res < 0) {
+	case 1:
+		/* Got a refcount on a handlable page. */
+		break;
+	case -ENOTRECOVERABLE:
+		/*
+		 * Stable unhandlable kernel-owned page (PG_reserved,
+		 * slab, page tables, large-kmalloc).
+		 * No recovery possible.
+		 */
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	default:
+		/* Transient lifecycle race with the page allocator. */
 		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 		goto unlock_mutex;
 	}

-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-09 10:56 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

get_any_page() collapses every HWPoisonHandlable() rejection into a
single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
-> retry path.  That is correct for the transient case (a userspace
folio briefly off LRU during migration or compaction, which a later
shake can drag back), but wrong for stable kernel-owned pages: slab,
page-table, large-kmalloc and PG_reserved pages will never become
HWPoisonHandlable(), so the retry loop is wasted work and the final
-EIO loses the "this is structurally unrecoverable" information.
memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
panic-on-unrecoverable sysctl deliberately does not act on.

Introduce HWPoisonKernelOwned(), a small predicate that positively
identifies pages the hwpoison handler cannot recover from:

  HWPoisonKernelOwned(p, flags) :=
      !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
      (PageReserved(p) ||
       PageSlab(head) || PageTable(head) || PageLargeKmalloc(head))

  where head = compound_head(p).

PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
page directly.  The slab, page-table and large-kmalloc page-type bits
are only stored on the head page, so those tests resolve the compound
head first, then re-read compound_head(page) afterwards: a concurrent
split or compound free that moves head invalidates the just-read flags
and the loop retries.  The lookup still takes no refcount, mirroring
the rest of get_any_page(); the recheck closes the common split race,
and a residual free->alloc->free in the same window can only mis-tag
a genuinely poisoned page, never reclassify a handlable one.

The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
same exception in HWPoisonHandlable(): soft-offline is allowed to
migrate movable_ops pages even though they are not on the LRU, and
we must not pre-empt that with an unrecoverable verdict.

The list is intentionally not exhaustive.  vmalloc and kernel-stack
pages, for example, do not carry a page_type bit and would need a
different oracle; they keep going through the existing retry path
unchanged.  This is the smallest set we can identify with certainty
by page type.

Wire the helper into the top of get_any_page() to short-circuit
those pages before the retry loop runs.  On a hit, drop the caller's
MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
straight away.  Pages outside the helper's positive list still take
the existing retry path and return -EIO, leaving operator-visible
behaviour for those cases unchanged.

Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc to document the new return.

memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch on its own only changes the errno that soft_offline_page()
can propagate to its callers.  A follow-up wires -ENOTRECOVERABLE
through memory_failure() and reports MF_MSG_KERNEL for the
unrecoverable cases, which is what the
panic_on_unrecoverable_memory_failure sysctl observes.

Suggested-by: David Hildenbrand <david@kernel.org>
Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f4d3e6e20e13..eed9de387694 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1325,6 +1325,46 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
 	return PageLRU(page) || is_free_buddy_page(page);
 }
 
+/*
+ * Positive identification of pages the hwpoison handler cannot recover.
+ * These page types are owned by kernel internals (no userspace mapping
+ * to unmap, no file mapping to invalidate, no migration target), so the
+ * shake_page() / retry loop in get_any_page() can never turn them into
+ * something HWPoisonHandlable() will accept.  Short-circuit them to
+ * -ENOTRECOVERABLE so callers can panic on operator request instead of
+ * spinning through retries that exit as a transient-looking -EIO.
+ *
+ * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
+ * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
+ * pages even though they are not on the LRU.
+ */
+static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
+{
+	struct page *head;
+
+	if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
+		return false;
+
+	/* PG_reserved is a per-page flag, never set on a compound page. */
+	if (PageReserved(page))
+		return true;
+
+	/*
+	 * Page-type bits live only on the head page, so resolve any tail
+	 * first.  The check takes no refcount; recheck the head afterwards
+	 * so a concurrent split or compound free cannot leave us trusting
+	 * a stale view.  A free->alloc->free in the same window is still
+	 * possible but closing it would require taking a reference here.
+	 */
+retry:
+	head = compound_head(page);
+	if (!(PageSlab(head) || PageTable(head) || PageLargeKmalloc(head)))
+		return false;
+	if (head != compound_head(page))
+		goto retry;
+	return true;
+}
+
 static int __get_hwpoison_page(struct page *page, unsigned long flags)
 {
 	struct folio *folio = page_folio(page);
@@ -1371,6 +1411,19 @@ static int get_any_page(struct page *p, unsigned long flags)
 	if (flags & MF_COUNT_INCREASED)
 		count_increased = true;
 
+	/*
+	 * Page types we know are kernel-owned and cannot be recovered.
+	 * Short-circuit before the shake_page() / retry loop, which
+	 * cannot turn any of these into something HWPoisonHandlable().
+	 * Drop the caller's reference if MF_COUNT_INCREASED took one.
+	 */
+	if (HWPoisonKernelOwned(p, flags)) {
+		if (count_increased)
+			put_page(p);
+		ret = -ENOTRECOVERABLE;
+		goto out;
+	}
+
 try_again:
 	if (!count_increased) {
 		ret = __get_hwpoison_page(p, flags);
@@ -1418,7 +1471,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 		ret = -EIO;
 	}
 out:
-	if (ret == -EIO)
+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
 
 	return ret;
@@ -1475,7 +1528,10 @@ static int __get_unpoison_page(struct page *page)
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
  *         operations like allocation and free,
- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ *         -ENOTRECOVERABLE for kernel-owned pages identified by
+ *         HWPoisonKernelOwned() (PG_reserved, slab,
+ *         page-table, large-kmalloc) that the handler cannot recover.
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v9 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-09 10:56 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team

A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running.  The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash.  In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.

This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context.  The default is disabled so existing workloads see no
change.

There is a selftest that test different cases, and I tested it using
the following variants:

  ┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
  │ Variant │   PFN    │                          Result                           │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ rodata  │ 0x2600   │ Panic with "Memory failure: 0x2600: unrecoverable page"   │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ slab    │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
  └─────────┴──────────┴───────────────────────────────────────────────────────────┘

Each one shows the same call trace, exactly the path the series builds:

  hard_offline_page_store
    → memory_failure
      → action_result
        → panic("Memory failure: %#lx: unrecoverable page")

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v9:
- HWPoisonKernelOwned(): wrap the head-page checks in a
  compound_head() recheck loop so a concurrent split or compound free
  cannot leave us trusting a stale view (Miaohe, Lance, David).
- selftest: drop the gawk-only strtonum() in hwpoison-panic.sh; do the
  hex parsing with a small index()-based helper so the test no longer
  spuriously skips itself on mawk-based distros (Sashiko).
- selftest: move hwpoison-panic.sh from TEST_FILES to
  TEST_PROGS_EXTENDED so the script is installed executable rather
  than as a non-executable data file (Sashiko).
- Link to v8: https://patch.msgid.link/20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org

Changes in v8:
- Commit message rewording (David)
- Add HWPoisonKernelOwned() helper (Lance)
- Removed patch "mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()"
- Broaden the selftest (Lance)
- Link to v7: https://patch.msgid.link/20260513-ecc_panic-v7-0-be2e578e61da@debian.org

Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
  get_any_page() and surface it via -ENOTRECOVERABLE, per David
  Hildenbrand's and Lance Yang's review of v6.  This drops the
  is_reserved snapshot in memory_failure() and the mf_get_page_status
  enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
  over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
  matching tracepoint string; the enum now covers both PG_reserved
  pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
  and 2/4 ("classify get_any_page() failures by reason") into a
  single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
  (David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org

Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org

Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
  unrecoverable kernel page hwpoison events (reserved pages, refcount-0
  non-buddy pages, unknown state), with a recheck to avoid racing with
  concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
  document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
  and add a selftest verifying SIGBUS recovery on userspace pages still
  works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
  https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org

Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (6):
      mm/memory-failure: drop dead error_states[] entry for reserved pages
      mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
      mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl
      selftests/mm: add hwpoison-panic destructive test

 Documentation/admin-guide/sysctl/vm.rst      |  85 +++++++++++
 mm/memory-failure.c                          | 114 ++++++++++++---
 tools/testing/selftests/mm/Makefile          |   4 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 208 +++++++++++++++++++++++++++
 4 files changed, 393 insertions(+), 18 deletions(-)
---
base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* [PATCH v9 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-06-09 10:56 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

The first entry of error_states[],

	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },

is unreachable.  identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():

  * memory_failure() reaches identify_page_state() only after
    get_hwpoison_page() returned 1.  get_any_page() reaches that
    return only via __get_hwpoison_page(), which only takes a
    refcount when the page is HWPoisonHandlable().
    HWPoisonHandlable() is an allowlist for LRU, free-buddy, and
    (for soft-offline) movable_ops pages -- PG_reserved pages do
    not satisfy any of these, so they fail with -EBUSY/-EIO long
    before identify_page_state() runs.

  * try_memory_failure_hugetlb() reaches identify_page_state() only
    via the MF_HUGETLB_IN_USED branch, where the page is necessarily
    a hugetlb folio.  hugetlb folios don't carry PG_reserved at that
    point: hugetlb_folio_init_vmemmap() calls __folio_clear_reserved()
    during init, so the reserved entry would not match even if it
    were still present.

me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.

Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.

No functional change.

Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c405..f4d3e6e20e13 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -980,17 +980,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
 	return false;
 }
 
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
-	unlock_page(p);
-	return MF_IGNORED;
-}
-
 /*
  * Page in unknown state. Do nothing.
  * This is a catch-all in case we fail to make sense of the page state.
@@ -1199,10 +1188,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 #define mlock		(1UL << PG_mlocked)
 #define lru		(1UL << PG_lru)
 #define head		(1UL << PG_head)
-#define reserved	(1UL << PG_reserved)
 
 static struct page_state error_states[] = {
-	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
 	 * PG_buddy pages only make a small fraction of all free pages.
@@ -1234,7 +1221,6 @@ static struct page_state error_states[] = {
 #undef mlock
 #undef lru
 #undef head
-#undef reserved
 
 static void update_per_node_mf_stats(unsigned long pfn,
 				     enum mf_result result)

-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH v6 07/12] PCI: Refactor matching logic for pci_dev_acs_ops
From: Pranjal Shrivastava @ 2026-06-09 10:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
	Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
	Pratyush Yadav, Saeed Mahameed, Samiullah Khawaja, Shuah Khan,
	Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <aic46OtIKfLhdoKy@google.com>

On Mon, Jun 08, 2026 at 09:49:28PM +0000, David Matlack wrote:
> On 2026-06-07 07:01 PM, Pranjal Shrivastava wrote:
> > On Fri, May 22, 2026 at 08:24:05PM +0000, David Matlack wrote:
> > > Refactor the logic to match devices to pci_dev_acs_ops by factoring out
> > > the loop and device matching into its own routine. This eliminates some
> > > duplicate code between pci_dev_specific_enable_acs() and
> > > pci_dev_specific_disable_acs_redir(), and will also be used in a
> > > subsequent commit to check if a device requires device-specific
> > > enable_acs() during a Live Update.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  drivers/pci/quirks.c | 50 ++++++++++++++++++--------------------------
> > >  1 file changed, 20 insertions(+), 30 deletions(-)
> > > 
> > 
> > [...]
> > 
> > >  } pci_dev_acs_ops[] = {
> > >  	{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
> > > +	    .match = pci_quirk_intel_pch_acs_match,
> > >  	    .enable_acs = pci_quirk_enable_intel_pch_acs,
> > >  	},
> > >  	{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
> > > +	    .match = pci_quirk_intel_spt_pch_acs_match,
> > >  	    .enable_acs = pci_quirk_enable_intel_spt_pch_acs,
> > >  	    .disable_acs_redir = pci_quirk_disable_intel_spt_pch_acs_redir,
> > >  	},
> > >  };
> > >  
> > > -int pci_dev_specific_enable_acs(struct pci_dev *dev)
> > > +static const struct pci_dev_acs_ops *pci_dev_acs_ops_get(struct pci_dev *dev)
> > >  {
> > >  	const struct pci_dev_acs_ops *p;
> > > -	int i, ret;
> > > +	int i;
> > >  
> > >  	for (i = 0; i < ARRAY_SIZE(pci_dev_acs_ops); i++) {
> > >  		p = &pci_dev_acs_ops[i];
> > > @@ -5481,33 +5475,29 @@ int pci_dev_specific_enable_acs(struct pci_dev *dev)
> > >  		     p->vendor == (u16)PCI_ANY_ID) &&
> > >  		    (p->device == dev->device ||
> > >  		     p->device == (u16)PCI_ANY_ID) &&
> > > -		    p->enable_acs) {
> > > -			ret = p->enable_acs(dev);
> > > -			if (ret >= 0)
> > > -				return ret;
> > > -		}
> > > +		    p->match(dev))
> > > +			return p;
> > 
> > Nit:
> > Should we check if (p->match != NULL) like we check for p->enable_acs &
> > p->disable_acs_redir(). 
> > 
> > Otherwise, it seems like we're mandating the existence of a match op in
> > the pci_dev_acs_ops here? Today, we just have two Intel entries in that
> > array, both of which need the match op. However, AFAICT, it shouldn't be
> > mandatory for future SoCs that might only need a simple vid + devid match
> 
> *shrug*
> 
> I would usually say those future SoCs should be the ones to make it
> optional if and when they need to.

Well.. that's fair I guess.

> 
> But making p->matc optional now isn't so bad:
> 
>         for (i = 0; i < ARRAY_SIZE(pci_dev_acs_ops); i++) {
>                 p = &pci_dev_acs_ops[i];
>                 if ((p->vendor == dev->vendor ||
>                      p->vendor == (u16)PCI_ANY_ID) &&
>                     (p->device == dev->device ||
> -                    p->device == (u16)PCI_ANY_ID) &&
> -                   p->enable_acs) {
> -                       ret = p->enable_acs(dev);
> -                       if (ret >= 0)
> -                               return ret;
> +                    p->device == (u16)PCI_ANY_ID)) {
> +                       if (!p->match || p->match(dev))
> +                               return p;
>                 }
>         }
> 
> I can include this in v7 if you would like.

I don't have a strong opinion here, this should be fine.
It's just that we have NULL checks for p->enable_acs and 
disable_acs_redir too. It'd be nice to keep the same pattern.

Thanks,
Praan

^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-09 10:50 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand (Arm), linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <7e36f7f0-b4d5-41c9-b399-9e0079907d33@linux.dev>

On Tue, Jun 9, 2026 at 4:37 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/6/9 17:32, Nico Pache wrote:
> > On Tue, Jun 9, 2026 at 3:26 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >>
> >> On 6/9/26 11:06, Nico Pache wrote:
> >>> On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >>>>
> >>>> On 6/6/26 12:28, Lance Yang wrote:
> >>>>>
> >>>>>
> >>>>> Looks broken for swap PTEs in PMD collapse ...
> >>>>>
> >>>>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
> >>>>> unmapped, but they don't get a bit in mthp_present_ptes. And then
> >>>>> mthp_collapse() does the check above:
> >>>>
> >>>> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
> >>>>
> >>>>          if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> >>>>                  max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >>>>
> >>>> But we perform the check a second time.
> >>>>
> >>>>>
> >>>>> nr_occupied_ptes >= nr_ptes - max_ptes_none
> >>>>>
> >>>>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
> >>>>> call collapse_huge_page() for PMD order.
> >>>>>
> >>>>> Shouldn't we account for them in the PMD-order check? Something like:
> >>>>>
> >>>>> if (is_pmd_order(order))
> >>>>>        nr_occupied_ptes += unmapped;
> >>>
> >>> This solution seems good for a temporary fixup. but longterm we may
> >>> want something else. I'm still not sure how we plan on supporting
> >>> swapin without causing creep. So I'd be ok with adding a fix for
> >>> legacy PMD behavior until we know how to handle mTHP creep correctly.
> >>>
> >>>> As an alternative, we could either 1) skip the check there for
> >>>> pmd order (as the check was already done); or 2) introduce+maintain
> >>>> a bitmap that tracks non-present PTEs.
> >>>>
> >>>> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> >>>>                  nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> >>>>                                                        offset + nr_ptes);
> >>>>
> >>>> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >>>> +               /* Check was already done in the caller. */
> >>>> +               if (is_pmd_order(order) ||
> >>>> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >>>>                          enum scan_result ret;
> >>>>
> >>>>                          collapse_address = address + offset * PAGE_SIZE;
> >>>>
> >>>> 2) would probably be cleanest long-term.
> >>>
> >>> That would be best for future swapin support in mTHP, but I still
> >>> don't think it solves the creep issue.
> >>
> >> It wouldn't, we'd simply maintain the state we collect + rely on in separate
> >> bitmaps. On swapin, we'd have to update/refresh bitmaps I guess.
> >
> > Yeah, I'm saying for the future, it obviously solves this issue here
> > as well, but if we have positional tracking of the swapout, shared,
> > and none PTEs, I think we can use this to determine whether the
> > collapse would lead to creep. If we detect creep would happen it may
> > be best to automatically collapse to the N+1 (or greater) candidate.
> > Just thinking outloud here.
> >
> >>
> >>> Perhaps we could combine the
> >>> two bitmaps to determine if it would make the future collapse eligible
> >>> again? Not sure but ill start thinking about it.
> >>>
> >>> Should I send a fixup for this using Lance's solution? Or does Lance
> >>> want to send a patch out with the fixes tag?
> >>
> >> If Lance could send a fixup, explaining the situation, that would be nice.
>
> Sure, happy to send a fixup :P
>
> Should I send it as a fixup to be folded into this patch, or as a
> separate patch with a Fixes: tag?

Id assume a seperate patch so you can keep credit for the discovery :)

Thank you for all the review you provided on this series, its been
really helpful!

-- Nico

>
> Will get one out soon :)
>
> > OK, I'd appreciate that :)
>
> Cheers!
>


^ permalink raw reply

* Re: [PATCH v6 03/12] PCI: liveupdate: Track incoming preserved PCI devices
From: Pranjal Shrivastava @ 2026-06-09 10:48 UTC (permalink / raw)
  To: David Matlack
  Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
	Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
	Pratyush Yadav, Saeed Mahameed, Samiullah Khawaja, Shuah Khan,
	Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <aicsyesGrqcWj7vu@google.com>

On Mon, Jun 08, 2026 at 08:57:45PM +0000, David Matlack wrote:
> On 2026-06-06 10:08 AM, Pranjal Shrivastava wrote:
> > On Fri, May 22, 2026 at 08:24:01PM +0000, David Matlack wrote:
> > > During PCI enumeration, the previous kernel might have passed state about
> > > devices that were preserved across kexec. The PCI core needs to fetch
> > > this state to identify which devices are "incoming" and require special
> > > handling.
> > > 
> > > Add pci_liveupdate_setup_device() which is called during device setup
> > > to fetch the serialized state (struct pci_ser) from the Live Update
> > > Orchestrator. The first time this happens, pci_flb_retrieve() will run
> > > and convert the array of pci_dev_ser structs into an xarray so that it
> > > can be looked up efficiently.
> > > 
> > > If a device is found in the xarray, the PCI core stores a pointer to its
> > > state in dev->liveupdate_incoming and holds a reference to the incoming
> > > FLB until pci_liveupdate_finish() is called by the driver.
> > > 
> > > This ensures proper lifecycle management for incoming preserved devices
> > > and allows the PCI core and drivers to apply specific Live Update
> > > logic to them in subsequent commits.
> > > 
> > > Drivers can check if a device is an incoming preserved device (e.g.
> > > during probe) by calling pci_liveupdate_is_incoming().
> > > 
> > > CONFIG_64BIT is now required to enable CONFIG_PCI_LIVEUPDATE so that the
> > > domain and bdf can be guaranteed to fit in an unsigned long and be used
> > > as the xarray key.
> > > 
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  MAINTAINERS                    |   1 +
> > >  drivers/pci/Kconfig            |   2 +-
> > >  drivers/pci/liveupdate.c       | 230 ++++++++++++++++++++++++++++++++-
> > >  drivers/pci/liveupdate.h       |   5 +
> > >  drivers/pci/probe.c            |   3 +
> > >  include/linux/pci_liveupdate.h |  13 ++
> > >  6 files changed, 251 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index 6c618830cf61..0e262c0ceb43 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -20537,6 +20537,7 @@ L:	linux-pci@vger.kernel.org
> > >  S:	Maintained
> > >  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git
> > >  F:	drivers/pci/liveupdate.c
> > > +F:	drivers/pci/liveupdate.h
> > >  F:	include/linux/kho/abi/pci.h
> > >  F:	include/linux/pci_liveupdate.h
> > >  
> > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> > > index 10c9b65aa242..e68ae5c172d4 100644
> > > --- a/drivers/pci/Kconfig
> > > +++ b/drivers/pci/Kconfig
> > > @@ -330,7 +330,7 @@ config VGA_ARB_MAX_GPUS
> > >  
> > >  config PCI_LIVEUPDATE
> > >  	bool "PCI Live Update Support"
> > > -	depends on PCI && LIVEUPDATE
> > > +	depends on PCI && LIVEUPDATE && 64BIT
> > 
> > I see that the static assertions in Patch 1 work because of the 64BIT
> > enforcement here. In that case, should we have the assertions check u64?
> 
> The static asserts have nothing to do with the 64BIT enforcement here.
> The static asserts just verify that the array elements in struct pci_ser
> are naturally aligned (unsigned long) so they can be accessed
> efficiently. The requirement here for CONFIG_64BIT is for the xarray
> key.
> 
> Theoretically if we got the xarray to work with 32-bit architectures
> then we could drop the CONFIG_64BIT requirement here.
> 

Ack. I see.

[...]
> > > +	kho_restore_free(ser);
> > 
> > I tend to partly agree with Sashiko[1] here.. it raises a policy-hole.
> > We may need a policy here, the options I have in mind are:
> > 
> > 1. Retrieve shall ONLY be tried once, if it fails (like -ENOMEM in the
> >    xArray alloc), it's a liveupdate failure. We can't retry liveupdate.
> > 
> > 2. Retrying retrieve is allowed.
> > 
> > The only downside with option 1 is, the user may want flexibility due to
> > certain subsystems OR may choose NOT to use the proposed LUOd and instead
> > have its own user-space component which might try funny things or have a
> > different use-case.
> > 
> > In such a situation, the system may have transiently run out of memory
> > during the kexec transition (for e.g. a subsystem uses GFP_ATOMIC to
> > allocate memory and temporarily runs out of the atomic pool). [Note we
> > removed it in IOMMU v1 [2] but subsystems may have a use-case for it]
> > 
> > If the kernel frees the KHO page on the first failure, it removes any
> > chance of recovery. :/
> > 
> > Thus, it might make sense to let the user decide if it wants to fail the
> > liveupdate or retry again based on the failure type / source?
> 
> The plan is to have LUO enforce that retrieve() is only called once:
> 
>   https://lore.kernel.org/kexec/20260528174140.1921129-3-dmatlack@google.com/
> 
> Supporting retry gets complicated since there's many different places
> where retrieve() could have failed.

Ack. Thanks for pointing me to the thread.
In that case, no problem.

Thanks,
Praan

^ permalink raw reply

* Re: [PATCH v6 01/12] PCI: liveupdate: Set up FLB handler for the PCI core
From: Pranjal Shrivastava @ 2026-06-09 10:45 UTC (permalink / raw)
  To: David Matlack
  Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
	Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
	Pratyush Yadav, Saeed Mahameed, Samiullah Khawaja, Shuah Khan,
	Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <aicrNNVrMBtJD2iZ@google.com>

On Mon, Jun 08, 2026 at 08:51:00PM +0000, David Matlack wrote:
> On 2026-06-05 05:41 AM, Pranjal Shrivastava wrote:
> > On Fri, May 22, 2026 at 08:23:59PM +0000, David Matlack wrote:
> > > Set up a File-Lifecycle-Bound (FLB) handler for the PCI core to enable
> > > it to participate in the preservation of PCI devices across Live Update.
> > > Essentially, this commit enables the PCI core to allocate a struct
> > > (struct pci_ser) and preserve it across a Live Update whenever at least
> > > one device is preserved.
> > > 
> > > Preserving PCI devices across Live Update is built on top of the Live
> > > Update Orchestrator's (LUO) support for file preservation. Drivers are
> > > expected to expose a file to userspace to represent a single PCI device
> > > and support preservation of that file. This is intended primarily to
> > > support preservation of PCI devices bound to VFIO drivers.
> > > 

[...]

> > > + * struct pci_ser - PCI Subsystem Live Update State
> > > + *
> > > + * This struct tracks state about all devices that are being preserved across
> > > + * a Live Update for the next kernel.
> > > + *
> > > + * @max_nr_devices: The length of the devices[] flexible array.
> > > + * @nr_devices: The number of devices that were preserved.
> > > + * @devices: Flexible array of pci_dev_ser structs for each device.
> > > + */
> > > +struct pci_ser {
> > > +	u32 max_nr_devices;
> > > +	u32 nr_devices;
> > > +	struct pci_dev_ser devices[];
> > > +} __packed;
> > > +
> > > +/* Ensure all elements of devices[] are naturally aligned. */
> > > +static_assert(offsetof(struct pci_ser, devices) % sizeof(unsigned long) == 0);
> > > +static_assert(sizeof(struct pci_dev_ser) % sizeof(unsigned long) == 0);
> > 
> > Minor Nit: Shall we consider using specific bitwidth types here?
> > I'm wondering if down the line another u32 field is added to 
> > struct pci_dev_ser.. in that case on a 32-bit machine 12 % 4 == 0 but on
> > a 64-bit machine 12 % 8 != 0..
> 
> I think natural alignment is what matters for efficient access of the
> array elements. So failing the assert only on 64-bit architectures seems
> like the correct behavior.
> 

Ack. I guess what I was trying to say was we'd anyway need to keep both
architectures in mind, i.e. adding members should be 64-bit aligned
(which implicitly also handles 32-bit) as with the current assert
any new addition would have to be both 32 & 64-bit aligned. I guess we
can keep this as is.

Thanks,
Praan

^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-06-09 10:36 UTC (permalink / raw)
  To: Nico Pache, David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcAhw8V+_dYcrqmtZ9ht4Pqz5PPB8EOcDrVCp4DA4y7pLg@mail.gmail.com>



On 2026/6/9 17:32, Nico Pache wrote:
> On Tue, Jun 9, 2026 at 3:26 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> On 6/9/26 11:06, Nico Pache wrote:
>>> On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>>
>>>> On 6/6/26 12:28, Lance Yang wrote:
>>>>>
>>>>>
>>>>> Looks broken for swap PTEs in PMD collapse ...
>>>>>
>>>>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
>>>>> unmapped, but they don't get a bit in mthp_present_ptes. And then
>>>>> mthp_collapse() does the check above:
>>>>
>>>> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
>>>>
>>>>          if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>>>>                  max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>>>>
>>>> But we perform the check a second time.
>>>>
>>>>>
>>>>> nr_occupied_ptes >= nr_ptes - max_ptes_none
>>>>>
>>>>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
>>>>> call collapse_huge_page() for PMD order.
>>>>>
>>>>> Shouldn't we account for them in the PMD-order check? Something like:
>>>>>
>>>>> if (is_pmd_order(order))
>>>>>        nr_occupied_ptes += unmapped;
>>>
>>> This solution seems good for a temporary fixup. but longterm we may
>>> want something else. I'm still not sure how we plan on supporting
>>> swapin without causing creep. So I'd be ok with adding a fix for
>>> legacy PMD behavior until we know how to handle mTHP creep correctly.
>>>
>>>> As an alternative, we could either 1) skip the check there for
>>>> pmd order (as the check was already done); or 2) introduce+maintain
>>>> a bitmap that tracks non-present PTEs.
>>>>
>>>> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>>>>                  nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>>>>                                                        offset + nr_ptes);
>>>>
>>>> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>>> +               /* Check was already done in the caller. */
>>>> +               if (is_pmd_order(order) ||
>>>> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>>>                          enum scan_result ret;
>>>>
>>>>                          collapse_address = address + offset * PAGE_SIZE;
>>>>
>>>> 2) would probably be cleanest long-term.
>>>
>>> That would be best for future swapin support in mTHP, but I still
>>> don't think it solves the creep issue.
>>
>> It wouldn't, we'd simply maintain the state we collect + rely on in separate
>> bitmaps. On swapin, we'd have to update/refresh bitmaps I guess.
> 
> Yeah, I'm saying for the future, it obviously solves this issue here
> as well, but if we have positional tracking of the swapout, shared,
> and none PTEs, I think we can use this to determine whether the
> collapse would lead to creep. If we detect creep would happen it may
> be best to automatically collapse to the N+1 (or greater) candidate.
> Just thinking outloud here.
> 
>>
>>> Perhaps we could combine the
>>> two bitmaps to determine if it would make the future collapse eligible
>>> again? Not sure but ill start thinking about it.
>>>
>>> Should I send a fixup for this using Lance's solution? Or does Lance
>>> want to send a patch out with the fixes tag?
>>
>> If Lance could send a fixup, explaining the situation, that would be nice.

Sure, happy to send a fixup :P

Should I send it as a fixup to be folded into this patch, or as a
separate patch with a Fixes: tag?

Will get one out soon :)

> OK, I'd appreciate that :)

Cheers!


^ permalink raw reply

* Re: [PATCH v5 0/4] Enable sysfs module symlink for more built-in drivers
From: Danilo Krummrich @ 2026-06-09 10:29 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: Shashank Balaji, James Clark, Alexander Shishkin,
	Greg Kroah-Hartman, Rafael J . Wysocki, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, Jonathan Corbet, Shuah Khan,
	Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Mike Leach, Leo Yan, Thierry Reding,
	Jonathan Hunter, Rahul Bukte, linux-kernel, coresight,
	linux-arm-kernel, driver-core, rust-for-linux, linux-doc,
	Daniel Palmer, Tim Bird, linux-modules, linux-tegra, Sumit Gupta
In-Reply-To: <1c8e441a-6b33-465a-88f9-9552f346ae18@arm.com>

On Tue Jun 9, 2026 at 11:08 AM CEST, Suzuki K Poulose wrote:
> On 08/06/2026 23:24, Danilo Krummrich wrote:
>> On Mon, 18 May 2026 19:19:56 +0900, Shashank Balaji wrote:
>>> [PATCH v5 0/4] Enable sysfs module symlink for more built-in drivers
>> 
>> Applied, thanks!
>> 
>>    Branch: driver-core-testing
>>    Tree:   git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git
>> 
>> [1/4] soc/tegra: cbb: Move driver registration from pure_initcall to core_initcall
>>        commit: cd6e95e7ab29
>> [2/4] kernel: param: initialize module_kset in a pure_initcall
>>        commit: c82dfce47833
>> [3/4] coresight: pass THIS_MODULE implicitly through a macro
>>        commit: efc22b3f89a3
>> [4/4] driver core: platform: set mod_name in driver registration
>>        commit: a7a7dc5c46a0
>> 
>> The patches will appear in the next linux-next integration (typically within 24
>> hours on weekdays).
>> 
>> The patches are in the driver-core-testing branch and will be promoted to
>> driver-core-next after validation.
>
> Apologies, I missed your emails. I am fine with those, happy to fixup 
> anything if the linux-next screams.

Thanks for confirming! I did a test merge with linux-next and an allmodconfig
arm64 build before picking it up, so it should be fine.

Thanks,
Danilo

^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-09 10:21 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand (Arm), Miaohe Lin, linux-mm, linux-kernel,
	linux-doc, linux-kselftest, linux-trace-kernel, kernel-team,
	Andrew Morton, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <f21d7c12-e6c7-49b0-8d83-c26946d0d4ee@linux.dev>

On Tue, Jun 09, 2026 at 05:08:14PM +0800, Lance Yang wrote:
> 
> 
> On 2026/6/9 15:09, David Hildenbrand (Arm) wrote:
> > On 6/9/26 04:39, Miaohe Lin wrote:
> > > On 2026/6/8 22:15, Breno Leitao wrote:
> > > > On Fri, Jun 05, 2026 at 11:42:53AM +0200, David Hildenbrand (Arm) wrote:
> > > > > 
> > > > > I mean, any such races can currently already happen one way or the other?
> > > > > 
> > > > > Really, the only way to not get races is to tryget the (compound)page,
> > > > > revalidate that the page is still part of the compound page.
> > > > > 
> > > > > I'm not sure if that's really a good idea.
> > > > > 
> > > > > But my memory is a bit vague in which scenarios we already hold a page reference
> > > > > here to prevent any concurrent freeing?
> > > > 
> > > > No, we don't hold one here in the case that matters.
> > > > 
> > > > HWPoisonKernelOwned() runs at the very top of get_any_page(), before
> > > > try_again: and before __get_hwpoison_page(). The first refcount taken in
> > > > the whole path is the folio_try_get() inside __get_hwpoison_page(), which
> > > > runs *after* the short-circuit.
> > > > 
> > > > So get_any_page() itself never holds a reference at the check -- the only way
> > > > one exists is if the caller passed MF_COUNT_INCREASED (count_increased ==
> > > > true).
> > > > 
> > > > So on the MCE/GHES path -- the one this panic option exists for -- no
> > > > reference is held when HWPoisonKernelOwned() does its compound_head() +
> > > > PageSlab()/PageTable()/PageLargeKmalloc() checks.
> > > > 
> > > > Given that, I'd rather keep it racy and take no refcount than add a
> > > > tryget + revalidate purely for this check. As I've said earleir, an operator
> > > 
> > > Would it be acceptable to add a simple recheck? Something like below:
> > > 
> > > retry:
> > > head = compound_head(page);
> > > PageSlab()/PageTable()/PageLargeKmalloc() checks
> > > if (head != compound_head(page))
> > > 	goto retry
> > 
> > Sure. I guess it could still be racy in some weird scenarios where we
> > free+allocate+free in-between.
> 
> +1, sounds reasonable to me. Still racy, but acceptable here I guess :D

Ack. I will post v9 shortly with this plus a couple of selftest fixes
Sashiko flagged.

^ permalink raw reply

* Re: [PATCH 2/2] module: restrict autoload to CAP_SYS_ADMIN if CONFIG_MODULE_RESTRICT_AUTOLOAD
From: Michal Gorlas @ 2026-06-09 10:19 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: Jonathan Corbet, Shuah Khan, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Aaron Tomlin, linux-doc, linux-kernel,
	linux-modules
In-Reply-To: <20260605183002.GB2939956@google.com>

On Fri Jun 5, 2026 at 8:30 PM CEST, Sami Tolvanen wrote:
> On Fri, May 15, 2026 at 07:20:20PM +0200, Michal Gorlas wrote:
>> Restrict module auto-loading to CAP_SYS_ADMIN if
>> CONFIG_MODULE_RESTRICT_AUTOLOAD is enabled, cmdline parameter
>> modrestrict=true, or kernel.modrestrict=1 is set with sysctl.
>> 
>> Signed-off-by: Michal Gorlas <michal.gorlas@9elements.com>
>> ---
>>  kernel/module/internal.h |  1 +
>>  kernel/module/kmod.c     |  5 +++++
>>  kernel/module/main.c     | 11 +++++++++++
>>  3 files changed, 17 insertions(+)
>> 
>> diff --git a/kernel/module/internal.h b/kernel/module/internal.h
>> index 061161cc79d9..496d8703f0c6 100644
>> --- a/kernel/module/internal.h
>> +++ b/kernel/module/internal.h
>> @@ -46,6 +46,7 @@ struct kernel_symbol {
>>  
>>  extern struct mutex module_mutex;
>>  extern struct list_head modules;
>> +extern bool module_autoload_restrict;
>>  
>>  extern const struct module_attribute *const modinfo_attrs[];
>>  extern const size_t modinfo_attrs_count;
>> diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c
>> index a25dccdf7aa7..58b28c23f571 100644
>> --- a/kernel/module/kmod.c
>> +++ b/kernel/module/kmod.c
>> @@ -156,6 +156,11 @@ int __request_module(bool wait, const char *fmt, ...)
>>  	if (ret)
>>  		return ret;
>>  
>> +	if (module_autoload_restrict && !capable(CAP_SYS_ADMIN)) {
>> +		pr_alert("denied attempt to auto-load module %s\n", module_name);
>
> Is pr_alert appropriate here or can this be a warning? Also, use the _ratelimited
> variant like the pre-existing warning in this function.

pr_alert was here in the grsec version (thus I assumed it makes sense
here), but agree, pr_warn_ratelimited makes more sense. 

Best,
Michal

^ permalink raw reply

* Re: [PATCH 1/2] module: add CONFIG_MODULE_RESTRICT_AUTOLOAD
From: Michal Gorlas @ 2026-06-09 10:07 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: Jonathan Corbet, Shuah Khan, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Aaron Tomlin, linux-doc, linux-kernel,
	linux-modules
In-Reply-To: <20260605182517.GA2939956@google.com>

On Fri Jun 5, 2026 at 8:25 PM CEST, Sami Tolvanen wrote:
> On Fri, May 15, 2026 at 07:20:19PM +0200, Michal Gorlas wrote:
>> Add CONFIG_MODULE_RESTRICT_AUTOLOAD and modrestrict parameter
>> documentation.
>> 
>> Signed-off-by: Michal Gorlas <michal.gorlas@9elements.com>
>> ---
>>  Documentation/admin-guide/kernel-parameters.txt |  5 +++++
>>  kernel/module/Kconfig                           | 15 +++++++++++++++
>>  2 files changed, 20 insertions(+)
>> 
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 03a550630644..1013104f0943 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -4185,6 +4185,11 @@ Kernel parameters
>>  			For details see:
>>  			Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
>>  
>> +	modrestrict=<bool>
>> +			Control the restriction of module auto-loading to
>> +			CAP_SYS_ADMIN. If no <bool> value is specified, this
>> +			is set to the value of CONFIG_MODULE_RESTRICT_AUTOLOAD.
>
> Doesn't this default to true if no bool value is specified? It only uses
> the config if modrestrict is not passed to the kernel at all.

Right. Will adjust the description here.

>
>>  	<module>.async_probe[=<bool>] [KNL]
>>  			If no <bool> value is specified or if the value
>>  			specified is not a valid <bool>, enable asynchronous
>> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
>> index 43b1bb01fd27..c9e01bb848c0 100644
>> --- a/kernel/module/Kconfig
>> +++ b/kernel/module/Kconfig
>> @@ -337,6 +337,21 @@ config MODULE_SIG_HASH
>>  
>>  endif # MODULE_SIG || IMA_APPRAISE_MODSIG
>>  
>> +config MODULE_RESTRICT_AUTOLOAD
>> +	bool "Restrict module auto-loading to privileged users"
>> +	default n
>
> You don't need to specify default n here.
>
> Also, I think you can just squash the two patches. There's no benefit
> in splitting the config/documentation into a separate patch.

Alright, will squash them in v2.

Best,
Michal


^ permalink raw reply

* Re: [PATCH v2] hwmon: add a driver for the temp/voltage sensor on PolarFire SoC
From: Conor Dooley @ 2026-06-09  9:43 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: linux-hwmon, Lars Randers, Conor Dooley, Jonathan Corbet,
	Shuah Khan, Daire McNamara, linux-doc, linux-kernel, linux-riscv,
	Valentina.FernandezAlanis
In-Reply-To: <fd92d7c9-9594-47b9-bd84-a6bd5ebae66d@roeck-us.net>

[-- Attachment #1: Type: text/plain, Size: 5386 bytes --]

On Mon, Jun 08, 2026 at 10:03:48AM -0700, Guenter Roeck wrote:
> On 6/3/26 06:19, Conor Dooley wrote:
> > From: Lars Randers <lranders@mail.dk>
> > 
> > Add a driver for the temperature and voltage sensors on PolarFire SoC.
> > The temperature reports how hot the die is, and the voltages are the
> > SoC's 1.05, 1.8 and 2.5 volt rails respectively.
> > 
> > The hardware supports alarms in theory, but there is an erratum that
> > prevents clearing them once triggered, so no support is added for them.
> > 
> > The hardware measures voltage with 16 bits, of which 1 is a sign bit and
> > the remainder holds the voltage as a fixed point integer value. It's
> > improbable that the hardware will work if the voltages are negative, so
> > the driver ignores the sign bits.
> > 
> > There's no dt support etc here because this is the child of a simple-mfd
> > syscon.
> > 
> > Signed-off-by: Lars Randers <lranders@mail.dk>
> > Co-developed-by: Conor Dooley <conor.dooley@microchip.com>
> > Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
> 
> Comments inline.

Cheers.

> > v2:
> > - Fix some minor things pointed out by Sashiko including inaccurate
> >    comments, bounds checking of values read from sysfs and Kconfig
> >    dependencies.
> > - Make update_interval use milliseconds instead of microseconds
> >    (I'll add update_interval_us support when that lands, there's a
> >    proposed workaround for the erratum circulating internally, so it'll
> >    probably come alongside alarm support).
> > 
> > CC: Guenter Roeck <linux@roeck-us.net>
> > CC: Jonathan Corbet <corbet@lwn.net>
> > CC: Shuah Khan <skhan@linuxfoundation.org>
> > CC: Conor Dooley <conor.dooley@microchip.com>
> > CC: Daire McNamara <daire.mcnamara@microchip.com>
> > CC: linux-hwmon@vger.kernel.org
> > CC: linux-doc@vger.kernel.org
> > CC: linux-kernel@vger.kernel.org
> > CC: linux-riscv@lists.infradead.org
> > CC: Valentina.FernandezAlanis@microchip.com

> > +Usage Notes
> > +-----------
> > +
> > +update_interval has a permitted range of 0 to 8.
> > +
> > +
> 
> It might make sense to document what "0" means.

Sure. The interval governs how much of a delay there is between the end
of one measurement and the start of the next one. Zero means no delay,
both here and in the register. Think that answers your question below
too?

> > +static int mpfs_tvs_temp_read(struct mpfs_tvs *data, u32 attr, long *val)
> > +{
> > +	u32 tmp, control;
> > +
> > +	if (attr != hwmon_temp_input && attr != hwmon_temp_enable)
> > +		return -EOPNOTSUPP;
> > +
> > +	regmap_read(data->regmap, MPFS_TVS_CTRL, &control);
> > +
> > +	if (attr == hwmon_temp_enable) {
> > +		*val = FIELD_GET(MPFS_TVS_CTRL_TEMP_ENABLE, control);
> > +		return 0;
> > +	}
> > +
> > +	if (!(control & MPFS_TVS_CTRL_TEMP_VALID))
> > +		return -EINVAL;
> > +
> "Invalid argument" can not be correct for data read from the chip.
> I don't know what this means. It should be either -ENODATA (no data available)
> if this is transient or -EIO (I/O error) if it is a permanent problem.
> The same applies to other validation checks.

-ENODATA then. It's realistically only possible to hit this when the
channel is disabled, although in you can also hit it in the gap
between the channel being enabled and the first measurement becoming
available.


> > +	regmap_read(data->regmap, MPFS_TVS_OUTPUT1, &tmp);
> > +	*val = FIELD_GET(MPFS_OUTPUT1_TEMP_MASK, tmp);
> > +	*val -= MPFS_TVS_K_TO_C;
> > +	*val = (1000 * *val) >> 4; /* fixed point (11.4) to millidegrees */
> > +
> > +	return 0;
> > +}

> > +static int mpfs_tvs_interval_write(struct mpfs_tvs *data, u32 attr, long val)
> > +{
> > +	unsigned long temp = val;
> > +
> > +	if (attr != hwmon_chip_update_interval)
> > +		return -EOPNOTSUPP;
> > +
> > +	temp *= 1000;
> 
> This is likely to result in overflow issues (for example if val == LONG_MAX).
> 
> > +	temp /= MPFS_TVS_INTERVAL_SCALE;
> > +
> > +	/*
> > +	 * The value is 8 bits wide, but 255 is described as
> > +	 * "255= Do single set of transfers when scoverride set"
> > +	 * but there's no scoverride bit in the tvs register region.
> > +	 * Ban using 255 since its behaviour is suspect.
> > +	 */
> > +	if (temp > 254)
> > +		return -EINVAL;
> 
> Hardware monitoring drivers should use clamp() and not return -EINVAL
> for ranges such as this. Since the valid range (in ms) is 0..8, I would
> suggest to clamp val to (0, 8) before any calculations to also avoid

Sure, I'll do that.

> the overflow issue mentioned above. That makes me wonder: What does "0"
> stand for ? 32 us or 0 us ? It does not make a difference here, but it
> may be relevant when microsecond intervals are implemented.

I think I answered this above, but 0 means 0 us between the end of a
measurement/conversion and the start of the next one. 

> > +
> > +	temp <<= MPFS_TVS_INTERVAL_OFFSET;
> > +	regmap_update_bits(data->regmap, MPFS_TVS_CTRL,
> > +			   MPFS_TVS_INTERVAL_MASK, temp);
> 
> If regmap never returns errors this needs to be documented in the driver.

It's an mmio regmap via a syscon, it evaluates to readl()/writel() so
there's nothing that can fail /and/ return an error.
I mean, I can add if (ret) return ret, there's not a clean place to put
a comment about it I don't think.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-09  9:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <b7fb4184-7a99-42c7-8ee2-4c7fa20827c4@kernel.org>

On Tue, Jun 9, 2026 at 3:26 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 6/9/26 11:06, Nico Pache wrote:
> > On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
> >>
> >> On 6/6/26 12:28, Lance Yang wrote:
> >>>
> >>>
> >>> Looks broken for swap PTEs in PMD collapse ...
> >>>
> >>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
> >>> unmapped, but they don't get a bit in mthp_present_ptes. And then
> >>> mthp_collapse() does the check above:
> >>
> >> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
> >>
> >>         if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> >>                 max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >>
> >> But we perform the check a second time.
> >>
> >>>
> >>> nr_occupied_ptes >= nr_ptes - max_ptes_none
> >>>
> >>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
> >>> call collapse_huge_page() for PMD order.
> >>>
> >>> Shouldn't we account for them in the PMD-order check? Something like:
> >>>
> >>> if (is_pmd_order(order))
> >>>       nr_occupied_ptes += unmapped;
> >
> > This solution seems good for a temporary fixup. but longterm we may
> > want something else. I'm still not sure how we plan on supporting
> > swapin without causing creep. So I'd be ok with adding a fix for
> > legacy PMD behavior until we know how to handle mTHP creep correctly.
> >
> >> As an alternative, we could either 1) skip the check there for
> >> pmd order (as the check was already done); or 2) introduce+maintain
> >> a bitmap that tracks non-present PTEs.
> >>
> >> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> >>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> >>                                                       offset + nr_ptes);
> >>
> >> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >> +               /* Check was already done in the caller. */
> >> +               if (is_pmd_order(order) ||
> >> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >>                         enum scan_result ret;
> >>
> >>                         collapse_address = address + offset * PAGE_SIZE;
> >>
> >> 2) would probably be cleanest long-term.
> >
> > That would be best for future swapin support in mTHP, but I still
> > don't think it solves the creep issue.
>
> It wouldn't, we'd simply maintain the state we collect + rely on in separate
> bitmaps. On swapin, we'd have to update/refresh bitmaps I guess.

Yeah, I'm saying for the future, it obviously solves this issue here
as well, but if we have positional tracking of the swapout, shared,
and none PTEs, I think we can use this to determine whether the
collapse would lead to creep. If we detect creep would happen it may
be best to automatically collapse to the N+1 (or greater) candidate.
Just thinking outloud here.

>
> > Perhaps we could combine the
> > two bitmaps to determine if it would make the future collapse eligible
> > again? Not sure but ill start thinking about it.
> >
> > Should I send a fixup for this using Lance's solution? Or does Lance
> > want to send a patch out with the fixes tag?
>
> If Lance could send a fixup, explaining the situation, that would be nice.

OK, I'd appreciate that :)

Cheers,
-- Nico

>
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: David Hildenbrand (Arm) @ 2026-06-09  9:25 UTC (permalink / raw)
  To: Nico Pache, Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcBY_2372eJru8VoCq90rUMxn7w23hHou68MmXRv48NRXg@mail.gmail.com>

On 6/9/26 11:06, Nico Pache wrote:
> On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> On 6/6/26 12:28, Lance Yang wrote:
>>>
>>>
>>> Looks broken for swap PTEs in PMD collapse ...
>>>
>>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
>>> unmapped, but they don't get a bit in mthp_present_ptes. And then
>>> mthp_collapse() does the check above:
>>
>> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
>>
>>         if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>>                 max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>>
>> But we perform the check a second time.
>>
>>>
>>> nr_occupied_ptes >= nr_ptes - max_ptes_none
>>>
>>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
>>> call collapse_huge_page() for PMD order.
>>>
>>> Shouldn't we account for them in the PMD-order check? Something like:
>>>
>>> if (is_pmd_order(order))
>>>       nr_occupied_ptes += unmapped;
> 
> This solution seems good for a temporary fixup. but longterm we may
> want something else. I'm still not sure how we plan on supporting
> swapin without causing creep. So I'd be ok with adding a fix for
> legacy PMD behavior until we know how to handle mTHP creep correctly.
> 
>> As an alternative, we could either 1) skip the check there for
>> pmd order (as the check was already done); or 2) introduce+maintain
>> a bitmap that tracks non-present PTEs.
>>
>> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>>                                                       offset + nr_ptes);
>>
>> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> +               /* Check was already done in the caller. */
>> +               if (is_pmd_order(order) ||
>> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>                         enum scan_result ret;
>>
>>                         collapse_address = address + offset * PAGE_SIZE;
>>
>> 2) would probably be cleanest long-term.
> 
> That would be best for future swapin support in mTHP, but I still
> don't think it solves the creep issue. 

It wouldn't, we'd simply maintain the state we collect + rely on in separate
bitmaps. On swapin, we'd have to update/refresh bitmaps I guess.

> Perhaps we could combine the
> two bitmaps to determine if it would make the future collapse eligible
> again? Not sure but ill start thinking about it.
> 
> Should I send a fixup for this using Lance's solution? Or does Lance
> want to send a patch out with the fixes tag?

If Lance could send a fixup, explaining the situation, that would be nice.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v2] platform/x86: thinkpad_acpi: Add USB-C Security (USCS) support
From: Ilpo Järvinen @ 2026-06-09  9:12 UTC (permalink / raw)
  To: Vishnu Sankar
  Cc: Mark Pearson, hmh, Hans de Goede, corbet, derekjohn.clark, skhan,
	LKML, ibm-acpi-devel, linux-doc, platform-driver-x86, vsankar
In-Reply-To: <20260609041402.328509-1-vishnuocv@gmail.com>

On Tue, 9 Jun 2026, Vishnu Sankar wrote:

> Newer ThinkPad systems expose a USB-C Security (Restricted Mode) feature.
> When active, USB-C data connections are disabled while power delivery is
> preserved. This is useful for kiosk and physically-secured deployments.
> 
> Hardware interface:
> 
> The HKEY device exposes a read-only ACPI method USCS():
> 
>   Return value bit layout:
>     Bit 16 : Capability flag (1 = feature present on this SKU)
>     Bit  0 : Current state  (0 = security OFF, 1 = security ON)
> 
> The sysfs attribute is read-only.
> 
> The Fn+U followed by Fn+S hotkey chord is the only way to toggle the
> hardware state.
> 
> Hotkey:
> 
> Fn+U followed by Fn+S generates HKEY event 0x131e.
> 
> sysfs interface:
> 
>   /sys/devices/platform/thinkpad_acpi/usb_c_security  (read-only)
>   "enabled\n"  -- data connections are currently blocked
>   "disabled\n" -- data connections are currently allowed
> 
>   The attribute is hidden on SKUs where the USCS capability bit (bit 16)
>   is not set, so there is no ABI impact on unsupported hardware.
> 
> Suggested-by: Mark Pearson <mpearson-lenovo@squebb.ca>
> Signed-off-by: Vishnu Sankar <vishnuocv@gmail.com>
> ---
> Changes since v1:
> - Use guard(mutex) from cleanup.h instead of manual mutex_lock/unlock
> - Revert usbc_security_query() to return int (-EIO/-ENODEV/0) instead
>   of bool to avoid uninitialized *enabled bug on unsupported platforms
> - Remove !! when assigning to bool in usbc_security_query()
> - Remove dead tp_features.usbc_security_supported check in show()
>   since is_visible() already gates the attribute on unsupported SKUs
> - Use str_enabled_disabled() from string_choices.h in show()
> - Fix uninitialized *enabled bug in tpacpi_usbc_security_init() by
>   only assigning usbc_security_enabled after a successful query
> ---
>  .../admin-guide/laptops/thinkpad-acpi.rst     |  24 ++++
>  drivers/platform/x86/lenovo/thinkpad_acpi.c   | 118 ++++++++++++++++++
>  2 files changed, 142 insertions(+)
> 
> diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
> index f874db31801d..db4588af0278 100644
> --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst
> +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
> @@ -1543,6 +1543,30 @@ Values:
>  
>  	This setting can also be toggled via the Fn+doubletap hotkey.
>  
> +USB-C Security
> +--------------
> +
> +sysfs: usb_c_security
> +
> +Reports the current state of the USB-C Security (Restricted Mode) feature
> +on supported ThinkPad systems. When enabled, USB-C data connections are
> +disabled while power delivery is preserved.
> +
> +The available command is::
> +
> +        cat /sys/devices/platform/thinkpad_acpi/usb_c_security
> +
> +Values:
> +
> +	* ``enabled``  - USB-C data connections are currently blocked
> +	* ``disabled`` - USB-C data connections are currently allowed
> +
> +The attribute is read-only. The USB-C Security state can only be toggled
> +via the Fn+U followed by Fn+S hotkey chord.
> +
> +The sysfs attribute is not created on platforms that do not support this
> +feature.
> +
>  Auxmac
>  ------
>  
> diff --git a/drivers/platform/x86/lenovo/thinkpad_acpi.c b/drivers/platform/x86/lenovo/thinkpad_acpi.c
> index e1cee42a1683..379769b62c80 100644
> --- a/drivers/platform/x86/lenovo/thinkpad_acpi.c
> +++ b/drivers/platform/x86/lenovo/thinkpad_acpi.c
> @@ -38,6 +38,7 @@
>  #include <linux/backlight.h>
>  #include <linux/bitfield.h>
>  #include <linux/bitops.h>
> +#include <linux/cleanup.h>
>  #include <linux/delay.h>
>  #include <linux/dmi.h>
>  #include <linux/freezer.h>
> @@ -66,6 +67,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/slab.h>
>  #include <linux/string.h>
> +#include <linux/string_choices.h>
>  #include <linux/string_helpers.h>
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
> @@ -185,6 +187,7 @@ enum tpacpi_hkey_event_t {
>  	TP_HKEY_EV_AMT_TOGGLE		= 0x131a, /* Toggle AMT on/off */
>  	TP_HKEY_EV_CAMERASHUTTER_TOGGLE = 0x131b, /* Toggle Camera Shutter */
>  	TP_HKEY_EV_DOUBLETAP_TOGGLE	= 0x131c, /* Toggle trackpoint doubletap on/off */
> +	TP_HKEY_EV_USB_C_SECURITY	= 0x131e, /* Toggle USB C Security ON/OFF */
>  	TP_HKEY_EV_PROFILE_TOGGLE	= 0x131f, /* Toggle platform profile in 2024 systems */
>  	TP_HKEY_EV_PROFILE_TOGGLE2	= 0x1401, /* Toggle platform profile in 2025 + systems */
>  
> @@ -373,6 +376,8 @@ static struct {
>  	u32 has_adaptive_kbd:1;
>  	u32 kbd_lang:1;
>  	u32 trackpoint_doubletap_enable:1;
> +	u32 usbc_security_supported:1;
> +	u32 usbc_security_enabled:1;

Sashiko (sashiko.dev) warned that there may be concurrent, unprotected 
updates to these bitfields (changing bitfields require unsafe RMW). It 
looks pre-existing problem at least for trackpoint_doubletap_enable (maybe 
others).

To avoid adding to the problems, usbc_security_enabled should be added 
outside the bitfield to avoid need to do locking for this bitfield.

And trackpoint_doubletap_enable (and possibly others) which are touched in 
the notify context or in sysfs write should be fixed in a separate patch 
(can be done after this series as it's pre-existing problem for them).

Anything that is only written during init is fine inside the bitfield.

>  	struct quirk_entry *quirks;
>  } tp_features;
>  
> @@ -11265,6 +11270,112 @@ static struct ibm_struct hwdd_driver_data = {
>  	.name = "hwdd",
>  };
>  
> +/*************************************************************************
> + * USB-C Security subdriver
> + *
> + * HKEY.USCS(0) is a read-only ACPI method; its argument is ignored.
> + * It always returns:
> + *   bit 16 - USB-C security capability present on this SKU or not
> + *   bit  0 - USB-C Security state (enable or disable)
> + *
> + * Hotkey
> + * ------
> + * 0x131e (Fn+U, Fn+S): firmware toggles USBS before firing the event.
> + * The driver reads back the new state and notifies the sysfs attribute.
> + *

Remove the extra line.

> + */
> +
> +/* USCS() return word bit layout */
> +#define USCS_CAP_BIT		BIT(16)	/* capability: feature present on SKU */
> +#define USCS_STATUS_BIT		BIT(0)	/* current security state */
> +
> +static DEFINE_MUTEX(usbc_security_mutex);
> +
> +/*
> + * usbc_security_query - read current USB-C security state via USCS()
> + * @enabled: out - true when security is ON (data connections blocked)
> + *
> + * Returns true if the feature is supported and query succeeded,

Kerneldoc doc compatible syntax is:

Returns:

> + * false otherwise (feature absent or ACPI call failed).

Please rewrite this as this function no longer returns true/false. :-)

> + */
> +static int usbc_security_query(bool *enabled)
> +{
> +	int status;
> +
> +	guard(mutex)(&usbc_security_mutex);
> +	if (!acpi_evalf(hkey_handle, &status, "USCS", "dd", 0))
> +		return -EIO;
> +
> +	if (!(status & USCS_CAP_BIT)) {
> +		pr_debug("USCS cap bit absent (raw=0x%x)\n", status);
> +		return -ENODEV;
> +	}
> +
> +	*enabled = status & USCS_STATUS_BIT;
> +	return 0;
> +}
> +
> +/* sysfs: /sys/devices/platform/thinkpad_acpi/usb_c_security ---------- */
> +static ssize_t usb_c_security_show(struct device *dev,
> +				   struct device_attribute *attr,
> +				   char *buf)
> +{
> +	return sysfs_emit(buf, "%s\n",
> +			  str_enabled_disabled(tp_features.usbc_security_enabled));
> +}
> +
> +static DEVICE_ATTR_RO(usb_c_security);
> +
> +static struct attribute *usbc_security_attributes[] = {
> +	&dev_attr_usb_c_security.attr,
> +	NULL,
> +};
> +
> +static umode_t usbc_security_attr_is_visible(struct kobject *kobj,
> +					     struct attribute *attr, int n)
> +{
> +	return tp_features.usbc_security_supported ? attr->mode : 0;
> +}
> +
> +static const struct attribute_group usbc_security_attr_group = {
> +	.is_visible = usbc_security_attr_is_visible,
> +	.attrs = usbc_security_attributes,
> +};
> +
> +static int tpacpi_usbc_security_init(struct ibm_init_struct *iibm)
> +{
> +	bool enabled;
> +	int err;
> +
> +	err = usbc_security_query(&enabled);
> +	if (err)
> +		return err == -ENODEV ? 0 : err;

Just split this to two if () + returns for clarity.

> +
> +	tp_features.usbc_security_supported = true;
> +	tp_features.usbc_security_enabled = enabled;
> +	return 0;
> +}
> +
> +/* tpacpi_usbc_security_hotkey - handle Fn+U Fn+S hotkey (0x131e) */
> +static bool tpacpi_usbc_security_hotkey(void)
> +{
> +	bool enabled;
> +
> +	if (!tp_features.usbc_security_supported)
> +		return false;
> +
> +	if (usbc_security_query(&enabled))
> +		return false;
> +
> +	tp_features.usbc_security_enabled = enabled;
> +	sysfs_notify(&tpacpi_pdev->dev.kobj, NULL, "usb_c_security");
> +	return true;
> +}
> +
> +static struct ibm_struct usbc_security_driver_data = {
> +	.name = "usbc_security",
> +};
> +
>  /* --------------------------------------------------------------------- */
>  
>  static struct attribute *tpacpi_driver_attributes[] = {
> @@ -11325,6 +11436,7 @@ static const struct attribute_group *tpacpi_groups[] = {
>  	&dprc_attr_group,
>  	&auxmac_attr_group,
>  	&hwdd_attr_group,
> +	&usbc_security_attr_group,
>  	NULL,
>  };
>  
> @@ -11479,6 +11591,8 @@ static bool tpacpi_driver_event(const unsigned int hkey_event)
>  	case TP_HKEY_EV_PROFILE_TOGGLE2:
>  		platform_profile_cycle();
>  		return true;
> +	case TP_HKEY_EV_USB_C_SECURITY:
> +		return tpacpi_usbc_security_hotkey();
>  	}
>  
>  	return false;
> @@ -11930,6 +12044,10 @@ static struct ibm_init_struct ibms_init[] __initdata = {
>  		.init = tpacpi_hwdd_init,
>  		.data = &hwdd_driver_data,
>  	},
> +	{
> +		.init = tpacpi_usbc_security_init,
> +		.data = &usbc_security_driver_data,
> +	},
>  };
>  
>  static int __init set_ibm_param(const char *val, const struct kernel_param *kp)
> 

-- 
 i.


^ permalink raw reply

* Re: [PATCH v5 0/4] Enable sysfs module symlink for more built-in drivers
From: Suzuki K Poulose @ 2026-06-09  9:08 UTC (permalink / raw)
  To: Danilo Krummrich, Shashank Balaji
  Cc: James Clark, Alexander Shishkin, Greg Kroah-Hartman,
	Rafael J . Wysocki, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Jonathan Corbet, Shuah Khan, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Mike Leach,
	Leo Yan, Thierry Reding, Jonathan Hunter, Rahul Bukte,
	linux-kernel, coresight, linux-arm-kernel, driver-core,
	rust-for-linux, linux-doc, Daniel Palmer, Tim Bird, linux-modules,
	linux-tegra, Sumit Gupta
In-Reply-To: <20260608222448.1353773-1-dakr@kernel.org>

On 08/06/2026 23:24, Danilo Krummrich wrote:
> On Mon, 18 May 2026 19:19:56 +0900, Shashank Balaji wrote:
>> [PATCH v5 0/4] Enable sysfs module symlink for more built-in drivers
> 
> Applied, thanks!
> 
>    Branch: driver-core-testing
>    Tree:   git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git
> 
> [1/4] soc/tegra: cbb: Move driver registration from pure_initcall to core_initcall
>        commit: cd6e95e7ab29
> [2/4] kernel: param: initialize module_kset in a pure_initcall
>        commit: c82dfce47833
> [3/4] coresight: pass THIS_MODULE implicitly through a macro
>        commit: efc22b3f89a3
> [4/4] driver core: platform: set mod_name in driver registration
>        commit: a7a7dc5c46a0
> 
> The patches will appear in the next linux-next integration (typically within 24
> hours on weekdays).
> 
> The patches are in the driver-core-testing branch and will be promoted to
> driver-core-next after validation.

Apologies, I missed your emails. I am fine with those, happy to fixup 
anything if the linux-next screams.

Cheers
Suzuki

^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-06-09  9:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Miaohe Lin, Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Andrew Morton, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <f2a4d5c8-3d7d-4fc3-8769-66e0c24866fb@kernel.org>



On 2026/6/9 15:09, David Hildenbrand (Arm) wrote:
> On 6/9/26 04:39, Miaohe Lin wrote:
>> On 2026/6/8 22:15, Breno Leitao wrote:
>>> On Fri, Jun 05, 2026 at 11:42:53AM +0200, David Hildenbrand (Arm) wrote:
>>>>
>>>> I mean, any such races can currently already happen one way or the other?
>>>>
>>>> Really, the only way to not get races is to tryget the (compound)page,
>>>> revalidate that the page is still part of the compound page.
>>>>
>>>> I'm not sure if that's really a good idea.
>>>>
>>>> But my memory is a bit vague in which scenarios we already hold a page reference
>>>> here to prevent any concurrent freeing?
>>>
>>> No, we don't hold one here in the case that matters.
>>>
>>> HWPoisonKernelOwned() runs at the very top of get_any_page(), before
>>> try_again: and before __get_hwpoison_page(). The first refcount taken in
>>> the whole path is the folio_try_get() inside __get_hwpoison_page(), which
>>> runs *after* the short-circuit.
>>>
>>> So get_any_page() itself never holds a reference at the check -- the only way
>>> one exists is if the caller passed MF_COUNT_INCREASED (count_increased ==
>>> true).
>>>
>>> So on the MCE/GHES path -- the one this panic option exists for -- no
>>> reference is held when HWPoisonKernelOwned() does its compound_head() +
>>> PageSlab()/PageTable()/PageLargeKmalloc() checks.
>>>
>>> Given that, I'd rather keep it racy and take no refcount than add a
>>> tryget + revalidate purely for this check. As I've said earleir, an operator
>>
>> Would it be acceptable to add a simple recheck? Something like below:
>>
>> retry:
>> head = compound_head(page);
>> PageSlab()/PageTable()/PageLargeKmalloc() checks
>> if (head != compound_head(page))
>> 	goto retry
> 
> Sure. I guess it could still be racy in some weird scenarios where we
> free+allocate+free in-between.

+1, sounds reasonable to me. Still racy, but acceptable here I guess :D

^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-09  9:06 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <2553caae-9e0e-42a7-8b61-d1216f1e81fa@kernel.org>

On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 6/6/26 12:28, Lance Yang wrote:
> >
> > On Fri, Jun 05, 2026 at 10:14:18AM -0600, Nico Pache wrote:
> >> Enable khugepaged to collapse to mTHP orders. This patch implements the
> >> main scanning logic using a bitmap to track occupied pages and the
> >> algorithm to find optimal collapse sizes.
> >>
> >> Previous to this patch, PMD collapse had 3 main phases, a light weight
> >> scanning phase (mmap_read_lock) that determines a potential PMD
> >> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >> phase (mmap_write_lock).
> >>
> >> To enabled mTHP collapse we make the following changes:
> >>
> >> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >> orders are enabled, we remove the restriction of max_ptes_none during the
> >> scan phase to avoid missing potential mTHP collapse candidates. Once we
> >> have scanned the full PMD range and updated the bitmap to track occupied
> >> pages, we use the bitmap to find the optimal mTHP size.
> >>
> >> Implement mthp_collapse() to walk forward through the bitmap and
> >> determine the best eligible order for each naturally-aligned region. The
> >> algorithm starts at the beginning of the PMD range and, for each offset,
> >> tries the highest order that fits the alignment. If the number of
> >> occupied PTEs in that region satisfies the max_ptes_none threshold for
> >> that order, a collapse is attempted. On failure, the order is
> >> decremented and the same offset is retried at the next smaller size. Once
> >> the smallest enabled order is exhausted (or a collapse succeeds), the
> >> offset advances past the region just processed, and the next attempt
> >> starts at the highest order permitted by the new offset's natural
> >> alignment.
> >>
> >> The algorithm works as follows:
> >>    1) set offset=0 and order=HPAGE_PMD_ORDER
> >>    2) if the order is not enabled, go to step (5)
> >>    3) count occupied PTEs in the (offset, order) range using
> >>       bitmap_weight_from()
> >>    4) if the count satisfies the max_ptes_none threshold, attempt
> >>       collapse; on success, advance to step (6)
> >>    5) if a smaller enabled order exists, decrement order and retry
> >>       from step (2) at the same offset
> >>    6) advance offset past the current region and compute the next
> >>       order from the new offset's natural alignment via __ffs(offset),
> >>       capped at HPAGE_PMD_ORDER
> >>    7) repeat from step (2) until the full PMD range is covered
> >>
> >> mTHP collapses reject regions containing swapped out or shared pages.
> >> This is because adding new entries can lead to new none pages, and these
> >> may lead to constant promotion into a higher order mTHP. A similar
> >> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >> introducing at least 2x the number of pages, and on a future scan will
> >> satisfy the promotion condition once again. This issue is prevented via
> >> the collapse_max_ptes_none() function which imposes the max_ptes_none
> >> restrictions above.
> >>
> >> We currently only support mTHP collapse for max_ptes_none values of 0
> >> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >>
> >>    - max_ptes_none=0: Never introduce new empty pages during collapse
> >>    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >>      available mTHP order
> >>
> >> Any other max_ptes_none value will emit a warning and default mTHP
> >> collapse to max_ptes_none=0. There should be no behavior change for PMD
> >> collapse.
> >>
> >> Once we determine what mTHP sizes fits best in that PMD range a collapse
> >> is attempted. A minimum collapse order of 2 is used as this is the lowest
> >> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >>
> >> Currently madv_collapse is not supported and will only attempt PMD
> >> collapse.
> >>
> >> We can also remove the check for is_khugepaged inside the PMD scan as
> >> the collapse_max_ptes_none() function handles this logic now.
> >>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >> ---
> >> mm/khugepaged.c | 146 +++++++++++++++++++++++++++++++++++++++++++++---
> >> 1 file changed, 138 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index ec886a031952..430047316f43 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >>
> >> static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>
> >> +#define KHUGEPAGED_MIN_MTHP_ORDER   2
> >> +
> >> struct collapse_control {
> >>      bool is_khugepaged;
> >>
> >> @@ -110,6 +112,9 @@ struct collapse_control {
> >>
> >>      /* nodemask for allocation fallback */
> >>      nodemask_t alloc_nmask;
> >> +
> >> +    /* Each bit represents a single occupied (!none/zero) page. */
> >> +    DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> >> };
> >>
> >> /**
> >> @@ -1440,20 +1445,130 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >>      return result;
> >> }
> >>
> >> +/* Return the highest naturally aligned order that fits at @offset within a PMD. */
> >> +static unsigned int max_order_from_offset(unsigned int offset)
> >> +{
> >> +    if (offset == 0)
> >> +            return HPAGE_PMD_ORDER;
> >> +
> >> +    return min_t(unsigned int, __ffs(offset), HPAGE_PMD_ORDER);
> >> +}
> >> +
> >> +/*
> >> + * mthp_collapse() consumes the bitmap that is generated during
> >> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> >> + *
> >> + * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> >> + * page. We start at the PMD order and check if it is eligible for collapse;
> >> + * if not, we check the left and right halves of the PTE page table we are
> >> + * examining at a lower order.
> >> + *
> >> + * For each of these, we determine how many PTE entries are occupied in the
> >> + * range of PTE entries we propose to collapse, then we compare this to a
> >> + * threshold number of PTE entries which would need to be occupied for a
> >> + * collapse to be permitted at that order (accounting for max_ptes_none).
> >> + *
> >> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> >> + * mTHP.
> >> + */
> >> +static enum scan_result mthp_collapse(struct mm_struct *mm,
> >> +            unsigned long address, int referenced, int unmapped,
> >> +            struct collapse_control *cc, unsigned long enabled_orders)
> >> +{
> >> +    unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> >> +    enum scan_result last_result = SCAN_FAIL;
> >> +    int collapsed = 0;
> >> +    bool alloc_failed = false;
> >> +    unsigned long collapse_address;
> >> +    unsigned int offset = 0;
> >> +    unsigned int order = HPAGE_PMD_ORDER;
> >> +
> >> +    while (offset < HPAGE_PMD_NR) {
> >> +            nr_ptes = 1UL << order;
> >> +
> >> +            if (!test_bit(order, &enabled_orders))
> >> +                    goto next_order;
> >> +
> >> +            max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >> +            nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> >> +                                                  offset + nr_ptes);
> >> +
> >> +            if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >
> > Looks broken for swap PTEs in PMD collapse ...
> >
> > collapse_scan_pmd() allows them up to max_ptes_swap and record them in
> > unmapped, but they don't get a bit in mthp_present_ptes. And then
> > mthp_collapse() does the check above:
>
> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
>
>         if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>                 max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>
> But we perform the check a second time.
>
> >
> > nr_occupied_ptes >= nr_ptes - max_ptes_none
> >
> > So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
> > call collapse_huge_page() for PMD order.
> >
> > Shouldn't we account for them in the PMD-order check? Something like:
> >
> > if (is_pmd_order(order))
> >       nr_occupied_ptes += unmapped;

This solution seems good for a temporary fixup. but longterm we may
want something else. I'm still not sure how we plan on supporting
swapin without causing creep. So I'd be ok with adding a fix for
legacy PMD behavior until we know how to handle mTHP creep correctly.

> As an alternative, we could either 1) skip the check there for
> pmd order (as the check was already done); or 2) introduce+maintain
> a bitmap that tracks non-present PTEs.
>
> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>                                                       offset + nr_ptes);
>
> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +               /* Check was already done in the caller. */
> +               if (is_pmd_order(order) ||
> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>                         enum scan_result ret;
>
>                         collapse_address = address + offset * PAGE_SIZE;
>
> 2) would probably be cleanest long-term.

That would be best for future swapin support in mTHP, but I still
don't think it solves the creep issue. Perhaps we could combine the
two bitmaps to determine if it would make the future collapse eligible
again? Not sure but ill start thinking about it.

Should I send a fixup for this using Lance's solution? Or does Lance
want to send a patch out with the fixes tag?

>
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH v3 2/3] PM: dpm_watchdog: Allow disabling DPM watchdog by default
From: Tzung-Bi Shih @ 2026-06-09  9:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jonathan Corbet, Greg Kroah-Hartman, Danilo Krummrich, Shuah Khan,
	Pavel Machek, Len Brown, linux-doc, linux-kernel, linux-pm,
	driver-core, tfiga, senozhatsky, Randy Dunlap
In-Reply-To: <CAJZ5v0g4VuR20dF+Zw0b75u4=ajFBOBAKokCoyDBtjCETexK3Q@mail.gmail.com>

On Mon, Jun 08, 2026 at 04:14:09PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jun 8, 2026 at 4:16 AM Tzung-Bi Shih <tzungbi@kernel.org> wrote:
> >
> > Introduce the CONFIG_DPM_WATCHDOG_DEFAULT_ENABLED Kconfig option to
> > allow the device suspend/resume watchdog (DPM watchdog) to be disabled
> > by default at compile time.
> >
> > Additionally, introduce the "dpm_watchdog_enabled" module parameter to
> > allow the watchdog to be enabled or disabled at boot time (via
> > "power.dpm_watchdog_enabled") and at runtime (via sysfs).
> 
> I think that the new module param is more important because the new
> config option is just its default value, so I'd rearrange the
> changelog.
> 
> Also, I think that the "DEFAULT_" part of the new config option name
> doesn't provide any additional value, so I'd just drop it.

Will fix them in the next version.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox