Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Xiaoyao Li @ 2026-06-30 10:55 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-5-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable, only for (CoCo) VM types
> that might use vm_memory_attributes.
> 
> Also document CONFIG_KVM_VM_MEMORY_ATTRIBUTES to specifically be about the
> private/shared attribute.

I think this patch needs to be moved later after per-gmem shared/private 
attribute is implemented. Because so far, TDX/SEV indeed depend on 
CONFIG_KVM_VM_MEMORY_ATTRIBUTES.

Not to discuss if it makes sense to report TDX as supported VM TYPE when 
CONFIG_KVM_VM_MEMORY_ATTRIBUTES is not enabled, this patch just fails 
the compilation when

   CONFIG_KVM_VM_MEMORY_ATTRIBUTES = n

and KVM_INTEL_TDX/KVM_AMD_SEV is enabled:

arch/x86/kvm/../../../virt/kvm/guest_memfd.c: In function 
‘__kvm_gmem_populate’:
arch/x86/kvm/../../../virt/kvm/guest_memfd.c:918:14: error: implicit 
declaration of function ‘kvm_range_has_memory_attributes’ 
[-Werror=implicit-function-declaration]
   918 |         if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
       |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/Kconfig | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 24f96396cfa1c..c28393dc664eb 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -81,13 +81,16 @@ config KVM_WERROR
>   	  If in doubt, say "N".
>   
>   config KVM_VM_MEMORY_ATTRIBUTES
> -	bool
> +	depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
> +	bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
> +	help
> +	  Enable support for tracking PRIVATE vs. SHARED memory using per-VM
> +	  memory attributes.
>   
>   config KVM_SW_PROTECTED_VM
>   	bool "Enable support for KVM software-protected VMs"
>   	depends on EXPERT
>   	depends on KVM_X86 && X86_64
> -	select KVM_VM_MEMORY_ATTRIBUTES
>   	help
>   	  Enable support for KVM software-protected VMs.  Currently, software-
>   	  protected VMs are purely a development and testing vehicle for
> @@ -138,7 +141,6 @@ config KVM_INTEL_TDX
>   	bool "Intel Trust Domain Extensions (TDX) support"
>   	default y
>   	depends on INTEL_TDX_HOST
> -	select KVM_VM_MEMORY_ATTRIBUTES
>   	select HAVE_KVM_ARCH_GMEM_POPULATE
>   	help
>   	  Provides support for launching Intel Trust Domain Extensions (TDX)
> @@ -162,7 +164,6 @@ config KVM_AMD_SEV
>   	depends on KVM_AMD && X86_64
>   	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
>   	select ARCH_HAS_CC_PLATFORM
> -	select KVM_VM_MEMORY_ATTRIBUTES
>   	select HAVE_KVM_ARCH_GMEM_PREPARE
>   	select HAVE_KVM_ARCH_GMEM_INVALIDATE
>   	select HAVE_KVM_ARCH_GMEM_POPULATE
> 


^ permalink raw reply

* Re: [RFC PATCH v2 3/4] rtla/osnoise: Trace IPI events when recording a trace file
From: Tomas Glozar @ 2026-06-30 11:32 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Crystal Wood, John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <20260617131803.2988989-4-vschneid@redhat.com>

st 17. 6. 2026 v 15:18 odesílatel Valentin Schneider
<vschneid@redhat.com> napsal:
>
> IPIs can now be monitored and accounted by osnoise top. When that is
> the case, also record them when saving a trace file.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  tools/tracing/rtla/src/common.c  |  2 +-
>  tools/tracing/rtla/src/common.h  |  2 +-
>  tools/tracing/rtla/src/osnoise.c | 17 ++++++++++++++++-
>  3 files changed, 18 insertions(+), 3 deletions(-)
>

Looks good, the numbers match between what is seen in (non-truncated)
trace and what is detected in RTLA:

$ rtla osnoise top -q -d 5s --ipi --on-end
trace,file=/tmp/ipi_trace.txt | awk '/^ *[0-9]/{ print "CPU: " $1 ",
IPI count: " $13 }'
CPU: 0, IPI count: 20
CPU: 1, IPI count: 2
CPU: 2, IPI count: 1
CPU: 3, IPI count: 3
CPU: 4, IPI count: 0
CPU: 5, IPI count: 2
CPU: 6, IPI count: 1
CPU: 7, IPI count: 1
CPU: 8, IPI count: 0
CPU: 9, IPI count: 0
CPU: 10, IPI count: 2
CPU: 11, IPI count: 3
CPU: 12, IPI count: 0
CPU: 13, IPI count: 20
$ grep ipi_send_cpumask /tmp/ipi_trace.txt | wc -l
0
$ for cpu in {0..13}; do n=$(grep -F "ipi_send_cpu: cpu=$cpu "
/tmp/ipi_trace.txt | wc -l); echo "CPU: $cpu, IPI count: $n"; done
CPU: 0, IPI count: 20
CPU: 1, IPI count: 2
CPU: 2, IPI count: 1
CPU: 3, IPI count: 3
CPU: 4, IPI count: 0
CPU: 5, IPI count: 2
CPU: 6, IPI count: 1
CPU: 7, IPI count: 1
CPU: 8, IPI count: 0
CPU: 9, IPI count: 0
CPU: 10, IPI count: 2
CPU: 11, IPI count: 3
CPU: 12, IPI count: 0
CPU: 13, IPI count: 20

(This is in a VM, with apparently no ipi_send_cpumask events, so I
didn't test that.)

Tomas


^ permalink raw reply

* Re: [PATCH v2 1/2] x86/uprobes: Keep shadow stack in sync for emulated CALLs
From: Jiri Olsa @ 2026-06-30 11:59 UTC (permalink / raw)
  To: David Windsor
  Cc: mhiramat, oleg, peterz, tglx, mingo, bp, dave.hansen, x86, shuah,
	rick.p.edgecombe, linux-trace-kernel, linux-kselftest,
	linux-kernel
In-Reply-To: <8b5b1c7407b98f31664ad7b6a6faf20d2d4a6cad.1782777969.git.dwindsor@gmail.com>

On Mon, Jun 29, 2026 at 08:13:33PM -0400, David Windsor wrote:
> Uprobe CALL emulation updates the normal user stack, but not the CET user
> shadow stack. The subsequent RET then sees a stale shadow stack entry and
> raises #CP.
> 
> Update the relative CALL emulation and XOL CALL fixup paths to keep the
> shadow stack in sync.
> 
> Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface")
> Signed-off-by: David Windsor <dwindsor@gmail.com>

hi, lgtm

Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>

jirka


> ---
> 
> v2:
>  - propagate error from shshk_update_last_frame() rather than returning
>    -ERESTART in default_post_xol_op(). (Oleg)
> 
> v1: https://lore.kernel.org/all/20260622183109.1137245-1-dwindsor@gmail.com/
> 
>  arch/x86/kernel/uprobes.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..d74bb54543b6 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1246,9 +1246,15 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs
>  		long correction = utask->vaddr - utask->xol_vaddr;
>  		regs->ip += correction;
>  	} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
> +		unsigned long retaddr = utask->vaddr + auprobe->defparam.ilen;
> +		int err;
> +
>  		regs->sp += sizeof_long(regs); /* Pop incorrect return address */
> -		if (emulate_push_stack(regs, utask->vaddr + auprobe->defparam.ilen))
> +		if (emulate_push_stack(regs, retaddr))
>  			return -ERESTART;
> +		err = shstk_update_last_frame(retaddr);
> +		if (err)
> +			return err;
>  	}
>  	/* popf; tell the caller to not touch TF */
>  	if (auprobe->defparam.fixups & UPROBE_FIX_SETF)
> @@ -1338,6 +1344,10 @@ static bool branch_emulate_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
>  		 */
>  		if (emulate_push_stack(regs, new_ip))
>  			return false;
> +		if (shstk_push(new_ip) == -EFAULT) {
> +			regs->sp += sizeof_long(regs);
> +			return false;
> +		}
>  	} else if (!check_jmp_cond(auprobe, regs)) {
>  		offs = 0;
>  	}
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Xiaoyao Li @ 2026-06-30 12:19 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-9-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Introduce function for KVM to check the private/shared status of guest
> memory at a given GFN.
> 
> This will be used in a later patch.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

With binbin's comment resolved,

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

BTW, it looks better to move the patch 01 just behind this one.

> ---
>   include/linux/kvm_host.h |  2 ++
>   virt/kvm/guest_memfd.c   | 31 +++++++++++++++++++++++++++++++
>   2 files changed, 33 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3915da2a61778..27687fb9d5201 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2575,6 +2575,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>   #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>   
>   #ifdef CONFIG_KVM_GUEST_MEMFD
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn);
> +
>   int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
>   		     int *max_order);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8101f64e0366f..bca912db5be6e 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -510,6 +510,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>   	return 0;
>   }
>   
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> +	struct inode *inode;
> +
> +	/*
> +	 * If this gfn has no associated memslot, there's no chance of the gfn
> +	 * being backed by private memory, since guest_memfd must be used for
> +	 * private memory, and guest_memfd must be associated with some memslot.
> +	 */
> +	if (!slot)
> +		return 0;
> +
> +	CLASS(gmem_get_file, file)(slot);
> +	if (!file)
> +		return 0;
> +
> +	inode = file_inode(file);
> +
> +	/*
> +	 * Rely on the maple tree's internal RCU lock to ensure a
> +	 * stable result. This result can become stale as soon as the
> +	 * lock is dropped, so the caller _must_ still protect
> +	 * consumption of private vs. shared by checking
> +	 * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> +	 * against ongoing attribute updates.
> +	 */
> +	return kvm_gmem_is_private_mem(inode, kvm_gmem_get_index(slot, gfn));
> +}
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_is_private);
> +
>   static struct file_operations kvm_gmem_fops = {
>   	.mmap		= kvm_gmem_mmap,
>   	.open		= generic_file_open,
> 


^ permalink raw reply

* [PATCH v10 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260630-ecc_panic-v10-0-c6ed5b62eea2@debian.org>

The first entry of error_states[],

	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },

is unreachable.  identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():

  * memory_failure() reaches identify_page_state() only after
    get_hwpoison_page() returned 1.  get_any_page() reaches that
    return only via __get_hwpoison_page(), which only takes a
    refcount when the page is HWPoisonHandlable().
    HWPoisonHandlable() is an allowlist for LRU, free-buddy, and
    (for soft-offline) movable_ops pages -- PG_reserved pages do
    not satisfy any of these, so they fail with -EBUSY/-EIO long
    before identify_page_state() runs.

  * try_memory_failure_hugetlb() reaches identify_page_state() only
    via the MF_HUGETLB_IN_USED branch, where the page is necessarily
    a hugetlb folio.  hugetlb folios don't carry PG_reserved at that
    point: hugetlb_folio_init_vmemmap() calls __folio_clear_reserved()
    during init, so the reserved entry would not match even if it
    were still present.

me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.

Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro.  Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.

No functional change.

Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c4055..f4d3e6e20e13f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -980,17 +980,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
 	return false;
 }
 
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
-	unlock_page(p);
-	return MF_IGNORED;
-}
-
 /*
  * Page in unknown state. Do nothing.
  * This is a catch-all in case we fail to make sense of the page state.
@@ -1199,10 +1188,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 #define mlock		(1UL << PG_mlocked)
 #define lru		(1UL << PG_lru)
 #define head		(1UL << PG_head)
-#define reserved	(1UL << PG_reserved)
 
 static struct page_state error_states[] = {
-	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
 	 * PG_buddy pages only make a small fraction of all free pages.
@@ -1234,7 +1221,6 @@ static struct page_state error_states[] = {
 #undef mlock
 #undef lru
 #undef head
-#undef reserved
 
 static void update_per_node_mf_stats(unsigned long pfn,
 				     enum mf_result result)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team

A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running.  The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash.  In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.

This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context.  The default is disabled so existing workloads see no
change.

There is a selftest that test different cases, and I tested it using
the following variants:

  ┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
  │ Variant │   PFN    │                          Result                           │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ rodata  │ 0x2600   │ Panic with "Memory failure: 0x2600: unrecoverable page"   │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ slab    │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
  ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
  └─────────┴──────────┴───────────────────────────────────────────────────────────┘

Each one shows the same call trace, exactly the path the series builds:

  hard_offline_page_store
    → memory_failure
      → action_result
        → panic("Memory failure: %#lx: unrecoverable page")

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v10:
- Reuse kselftest declarations
- Residual race harmless documentation
- Link to v9: https://lore.kernel.org/r/20260609-ecc_panic-v9-0-432a74002e74@debian.org

Changes in v9:
- HWPoisonKernelOwned(): wrap the head-page checks in a
  compound_head() recheck loop so a concurrent split or compound free
  cannot leave us trusting a stale view (Miaohe, Lance, David).
- selftest: drop the gawk-only strtonum() in hwpoison-panic.sh; do the
  hex parsing with a small index()-based helper so the test no longer
  spuriously skips itself on mawk-based distros (Sashiko).
- selftest: move hwpoison-panic.sh from TEST_FILES to
  TEST_PROGS_EXTENDED so the script is installed executable rather
  than as a non-executable data file (Sashiko).
- Link to v8: https://patch.msgid.link/20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org

Changes in v8:
- Commit message rewording (David)
- Add HWPoisonKernelOwned() helper (Lance)
- Removed patch "mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()"
- Broaden the selftest (Lance)
- Link to v7: https://patch.msgid.link/20260513-ecc_panic-v7-0-be2e578e61da@debian.org

Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
  get_any_page() and surface it via -ENOTRECOVERABLE, per David
  Hildenbrand's and Lance Yang's review of v6.  This drops the
  is_reserved snapshot in memory_failure() and the mf_get_page_status
  enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
  over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
  matching tracepoint string; the enum now covers both PG_reserved
  pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
  and 2/4 ("classify get_any_page() failures by reason") into a
  single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
  (David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org

Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org

Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
  unrecoverable kernel page hwpoison events (reserved pages, refcount-0
  non-buddy pages, unknown state), with a recheck to avoid racing with
  concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
  document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
  and add a selftest verifying SIGBUS recovery on userspace pages still
  works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
  https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org

Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (6):
      mm/memory-failure: drop dead error_states[] entry for reserved pages
      mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
      mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl
      selftests/mm: add hwpoison-panic destructive test

 Documentation/admin-guide/sysctl/vm.rst      |  80 +++++++++
 mm/memory-failure.c                          | 106 +++++++++--
 tools/testing/selftests/mm/Makefile          |   4 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 255 +++++++++++++++++++++++++++
 4 files changed, 427 insertions(+), 18 deletions(-)
---
base-commit: 30ffa8de54e5cc80d93fd211ca134d1764a7011f
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* [PATCH v10 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260630-ecc_panic-v10-0-c6ed5b62eea2@debian.org>

The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, page tables,
large-kmalloc).  memory_failure() still folds every negative return
into MF_MSG_GET_HWPOISON, so callers that want to react to the
unrecoverable cases (a panic option, smarter logging) cannot tell
them apart from transient page-allocator races.

Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.

The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, page tables and large-kmalloc
pages.

Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 087658484e242..5fc3de474014d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2436,7 +2436,8 @@ int memory_failure(unsigned long pfn, int flags)
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
 	res = get_hwpoison_page(p, flags);
-	if (!res) {
+	switch (res) {
+	case 0:
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -2455,7 +2456,19 @@ int memory_failure(unsigned long pfn, int flags)
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;
-	} else if (res < 0) {
+	case 1:
+		/* Got a refcount on a handlable page. */
+		break;
+	case -ENOTRECOVERABLE:
+		/*
+		 * Stable unhandlable kernel-owned page (PG_reserved,
+		 * slab, page tables, large-kmalloc).
+		 * No recovery possible.
+		 */
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
+	default:
+		/* Transient lifecycle race with the page allocator. */
 		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 		goto unlock_mutex;
 	}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v10 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260630-ecc_panic-v10-0-c6ed5b62eea2@debian.org>

get_any_page() collapses every HWPoisonHandlable() rejection into a
single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
-> retry path.  That is correct for the transient case (a userspace
folio briefly off LRU during migration or compaction, which a later
shake can drag back), but wrong for stable kernel-owned pages: slab,
page-table, large-kmalloc and PG_reserved pages will never become
HWPoisonHandlable(), so the retry loop is wasted work and the final
-EIO loses the "this is structurally unrecoverable" information.
memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
panic-on-unrecoverable sysctl deliberately does not act on.

Introduce is_kernel_owned_page(), a small predicate that positively
identifies pages the hwpoison handler cannot recover from:

  is_kernel_owned_page(p) :=
      PageReserved(p) ||
      PageSlab(head) || PageTable(head) || PageLargeKmalloc(head)

  where head = compound_head(p).

PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
page directly.  The slab, page-table and large-kmalloc page-type bits
are only stored on the head page, so those tests resolve the compound
head first, then re-read compound_head(page) afterwards: a concurrent
split or compound free that moves head invalidates the just-read flags
and the loop retries.  The lookup still takes no refcount, mirroring
the rest of get_any_page(); the recheck closes the common split race,
and a residual free->alloc->free in the same window can only mis-tag
a genuinely poisoned page, never reclassify a handlable one.

No MF_SOFT_OFFLINE / page_has_movable_ops() opt-out is needed: a
movable_ops page is always PageOffline or PageZsmalloc, whose
page_type is mutually exclusive with slab, page-table and
large-kmalloc, and it never carries PG_reserved, so it can never
match any of the checks above.

The list is intentionally not exhaustive.  vmalloc and kernel-stack
pages, for example, do not carry a page_type bit and would need a
different oracle; they keep going through the existing retry path
unchanged.  This is the smallest set we can identify with certainty
by page type.

Wire the helper into the top of get_any_page() to short-circuit
those pages before the retry loop runs.  On a hit, drop the caller's
MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
straight away.  Pages outside the helper's positive list still take
the existing retry path and return -EIO, leaving operator-visible
behaviour for those cases unchanged.

Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc to document the new return.

memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch on its own only changes the errno that soft_offline_page()
can propagate to its callers.  A follow-up wires -ENOTRECOVERABLE
through memory_failure() and reports MF_MSG_KERNEL for the
unrecoverable cases, which is what the
panic_on_unrecoverable_memory_failure sysctl observes.

Suggested-by: David Hildenbrand <david@kernel.org>
Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f4d3e6e20e13f..087658484e242 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1325,6 +1325,38 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
 	return PageLRU(page) || is_free_buddy_page(page);
 }
 
+/*
+ * Positive identification of pages the hwpoison handler cannot recover:
+ * pages owned by kernel internals with no userspace mapping to unmap, no
+ * file mapping to invalidate, and no migration target.
+ */
+static inline bool is_kernel_owned_page(struct page *page)
+{
+	struct page *head;
+	bool kernel_owned;
+
+	/* PG_reserved is a per-page flag, never set on a compound page. */
+	if (PageReserved(page))
+		return true;
+
+	/*
+	 * Page-type bits live only on the head page, so resolve any tail
+	 * first.  The check takes no refcount; recheck the head afterwards
+	 * so a concurrent split or compound free cannot leave us trusting
+	 * a stale view.  A residual free->alloc->free cannot be closed here
+	 * (frozen slab and large-kmalloc pages cannot be pinned), but is
+	 * harmless: where a wrong verdict could panic, memory_failure() has
+	 * already set PageHWPoison, which bars the page from the allocator.
+	 */
+retry:
+	head = compound_head(page);
+	kernel_owned = PageSlab(head) || PageTable(head) ||
+		       PageLargeKmalloc(head);
+	if (head != compound_head(page))
+		goto retry;
+	return kernel_owned;
+}
+
 static int __get_hwpoison_page(struct page *page, unsigned long flags)
 {
 	struct folio *folio = page_folio(page);
@@ -1371,6 +1403,19 @@ static int get_any_page(struct page *p, unsigned long flags)
 	if (flags & MF_COUNT_INCREASED)
 		count_increased = true;
 
+	/*
+	 * Page types we know are kernel-owned and cannot be recovered.
+	 * Short-circuit before the shake_page() / retry loop, which
+	 * cannot turn any of these into something HWPoisonHandlable().
+	 * Drop the caller's reference if MF_COUNT_INCREASED took one.
+	 */
+	if (is_kernel_owned_page(p)) {
+		if (count_increased)
+			put_page(p);
+		ret = -ENOTRECOVERABLE;
+		goto out;
+	}
+
 try_again:
 	if (!count_increased) {
 		ret = __get_hwpoison_page(p, flags);
@@ -1418,7 +1463,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 		ret = -EIO;
 	}
 out:
-	if (ret == -EIO)
+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
 
 	return ret;
@@ -1475,7 +1520,10 @@ static int __get_unpoison_page(struct page *page)
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
  *         operations like allocation and free,
- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ *         -ENOTRECOVERABLE for kernel-owned pages identified by
+ *         is_kernel_owned_page() (PG_reserved, slab,
+ *         page-table, large-kmalloc) that the handler cannot recover.
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v10 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260630-ecc_panic-v10-0-c6ed5b62eea2@debian.org>

Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
default) that triggers a kernel panic when memory_failure()
encounters pages that cannot be recovered.  This provides a clean
crash with useful debug information rather than allowing silent
data corruption or a delayed crash at an unrelated code path.

Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
large-kmalloc).

All other action types are excluded:

- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
  transient refcount races with the page allocator (an in-flight buddy
  allocation has refcount 0 and is no longer on the buddy free list,
  briefly), and panicking on them would risk killing the box for what
  is actually a recoverable userspace page.

- MF_MSG_UNKNOWN means identify_page_state() could not classify the
  page; that is precisely the wrong basis for a panic decision.

Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5fc3de474014d..e097fc8262cf8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	if (!sysctl_panic_on_unrecoverable_mf)
+		return false;
+
+	return type == MF_MSG_KERNEL && result == MF_IGNORED;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1272,6 +1292,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v10 5/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260630-ecc_panic-v10-0-c6ed5b62eea2@debian.org>

Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index b9b0c218bfb44..22cc54cac3b21 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- panic_on_unrecoverable_memory_failure
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -925,6 +926,85 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation.  This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting kernel-owned pages that the handler cannot recover:
+``PageReserved`` (firmware reservations, kernel image, vDSO, zero
+page, and similar memblock-reserved regions), ``PageSlab``,
+``PageTable``, and ``PageLargeKmalloc``.  These are owned by the
+kernel and the memory failure handler cannot reliably evict their
+contents.
+
+Other unrecoverable kernel-owned populations (vmalloc allocations,
+kernel stack pages, ...) are not currently covered because the
+handler has no page-type signal that distinguishes them from a
+userspace folio temporarily off the LRU during migration or
+compaction.  Such pages still go through the standard
+MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them and a
+delayed crash on the next access remains possible.  Coverage may
+grow as the handler gains stronger kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies.  Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+  regularly and post-mortem analysis of an unrelated downstream crash
+  (often seconds to minutes after the original error) consumes
+  significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+  hardware error produces a vmcore that still contains the faulting
+  address, the affected page state, and the originating MCE/GHES
+  record — context that is typically lost by the time a delayed crash
+  occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+  failure for failover, and prefer an immediate panic over silent data
+  corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+  tools such as ``mce-inject`` or error-injection debugfs interfaces,
+  where panicking on the unrecoverable path makes regressions
+  immediately visible instead of surfacing as later, unrelated
+  failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately.  If the ``panic`` sysctl is also non-zero then the
+  machine will be rebooted.
+= =====================================================================
+
+Example::
+
+     echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
 percpu_pagelist_high_fraction
 =============================
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v10 6/6] selftests/mm: add hwpoison-panic destructive test
From: Breno Leitao @ 2026-06-30 12:46 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260630-ecc_panic-v10-0-c6ed5b62eea2@debian.org>

Add a destructive selftest that verifies
vm.panic_on_unrecoverable_memory_failure actually panics when a
hwpoison error hits a kernel-owned page.

Three "kinds" of kernel-owned page can be targeted, selectable via
the script's first positional argument (default: rodata):

  rodata  - a PG_reserved page in the kernel rodata range, sourced
            from the "Kernel rodata" sub-resource of "System RAM" in
            /proc/iomem.  That entry is reported on every major
            architecture and guarantees the chosen PFN is backed by
            struct page (an online System RAM range, not a firmware
            hole), is PG_reserved, and is read-only -- so even if
            the panic fails to fire for some reason, the resulting
            PG_hwpoison marker on rodata does not corrupt writable
            kernel state.

  slab    - a slab page found by walking /proc/kpageflags for the
            first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
            / KPF_COMPOUND_TAIL clear).  Exercises the get_any_page()
            path on a non PG_reserved kernel-owned page and so
            catches regressions where get_any_page() collapses
            kernel-owned pages into a transient -EIO instead of
            -ENOTRECOVERABLE.

  pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.

PageLargeKmalloc, the fourth page type matched by
is_kernel_owned_page(), is intentionally not covered: it is a
PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
a PFN from userspace is not feasible.  The slab and pgtable
variants already exercise the same get_any_page() positive-check
branch.

The script enables the sysctl and writes the selected physical
address to /sys/devices/system/memory/hard_offline_page.  A
successful run crashes the kernel with

  Memory failure: <pfn>: unrecoverable page

A return from the inject means no panic fired.  Before reporting, the
script restores the sysctl and best-effort unpoisons the target PFN
through the hwpoison debugfs interface (hard_offline_page() injects
with MF_SW_SIMULATED, so the page stays unpoisonable), then re-reads
/proc/kpageflags: a PFN that is still the kernel-owned type it selected
is a genuine failure, while one that raced to a different type before
the inject is skipped as inconclusive.  Test outcome is therefore
observed externally (serial console, kdump) rather than from the
script's own exit code.

The script is intentionally NOT wired into run_vmtests.sh: every
successful run panics the kernel, which is incompatible with the
sequential "run each category in the same VM" model that
run_vmtests.sh assumes.  It is also not registered as a TEST_PROGS /
ksft_* wrapper so a default kselftest run does not opt itself into
a panic.  The script is meant to be executed manually inside a
disposable VM (e.g. virtme-ng), one variant per VM boot, and
requires RUN_DESTRUCTIVE=1 in the environment as a safety net.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/mm/Makefile          |   4 +
 tools/testing/selftests/mm/hwpoison-panic.sh | 255 +++++++++++++++++++++++++++
 2 files changed, 259 insertions(+)

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971c..ed321ae709dac 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -174,6 +174,10 @@ TEST_PROGS += ksft_userfaultfd.sh
 TEST_PROGS += ksft_vma_merge.sh
 TEST_PROGS += ksft_vmalloc.sh
 
+# Destructive: every successful run panics the kernel.  Installed and
+# kept executable, but not run from a default kselftest invocation.
+TEST_PROGS_EXTENDED += hwpoison-panic.sh
+
 TEST_FILES := test_vmalloc.sh
 TEST_FILES += test_hmm.sh
 TEST_FILES += va_high_addr_switch.sh
diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
new file mode 100755
index 0000000000000..d953d13673324
--- /dev/null
+++ b/tools/testing/selftests/mm/hwpoison-panic.sh
@@ -0,0 +1,255 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
+# error on a kernel-owned page and confirming the kernel panics.
+#
+# Three "kinds" of kernel-owned page can be targeted, selectable via the
+# first positional argument (default: rodata):
+#
+#   rodata  - a PG_reserved page in the kernel rodata range
+#             (sourced from /proc/iomem "Kernel rodata").  Exercises
+#             memory_failure() -> get_any_page() on a PageReserved page.
+#
+#   slab    - a slab page found via /proc/kpageflags (KPF_SLAB).
+#             Exercises memory_failure() -> get_any_page() on a non
+#             PG_reserved kernel-owned page.  This path is what catches
+#             regressions where get_any_page() collapses kernel-owned
+#             pages into a transient -EIO instead of -ENOTRECOVERABLE.
+#
+#   pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
+#             Same path as slab, different page type.
+#
+# This test is DESTRUCTIVE: a successful run crashes the kernel.  It is
+# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
+# serial console captured by the harness.  It is skipped unless the
+# caller opts in via RUN_DESTRUCTIVE=1.
+#
+# Test passes externally: the kernel must panic with
+#   "Memory failure: <pfn>: unrecoverable page"
+# A return from the inject means no panic fired: that is a failure,
+# unless the target PFN raced to a different page type before injection,
+# in which case the run is inconclusive and is skipped.
+#
+# Author: Breno Leitao <leitao@debian.org>
+
+set -u
+
+# KTAP output helpers (ktap_print_msg, ktap_skip_all, ktap_exit_fail_msg, ...).
+DIR="$(dirname "$(readlink -f "$0")")"
+# shellcheck source=../kselftest/ktap_helpers.sh
+source "${DIR}"/../kselftest/ktap_helpers.sh
+
+sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
+inject_path=/sys/devices/system/memory/hard_offline_page
+kpageflags_path=/proc/kpageflags
+unpoison_path=/sys/kernel/debug/hwpoison/unpoison-pfn
+
+# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
+KPF_SLAB=7
+KPF_COMPOUND_TAIL=16
+KPF_HWPOISON=19
+KPF_NOPAGE=20
+KPF_PGTABLE=26
+KPF_RESERVED=32
+
+pagesize=$(getconf PAGE_SIZE)
+
+kind=${1:-rodata}
+
+if [ "$(id -u)" -ne 0 ]; then
+	ktap_skip_all "must run as root"
+	exit "$KSFT_SKIP"
+fi
+
+if [ ! -w "$sysctl_path" ]; then
+	ktap_skip_all "$sysctl_path not present (kernel without the sysctl?)"
+	exit "$KSFT_SKIP"
+fi
+
+if [ ! -w "$inject_path" ]; then
+	ktap_skip_all "$inject_path not present (no MEMORY_HOTPLUG?)"
+	exit "$KSFT_SKIP"
+fi
+
+if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
+	ktap_skip_all "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
+	exit "$KSFT_SKIP"
+fi
+
+# Pick a PFN inside the kernel image rodata region of /proc/iomem.
+# This is preferred over a top-level "Reserved" entry because top-level
+# Reserved ranges are often firmware holes that have no backing struct
+# page; pfn_to_online_page() returns NULL on those and memory_failure()
+# bails out with -ENXIO before reaching the panic path.
+#
+# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
+# major architecture, which guarantees:
+#   - the PFN is backed by struct page (within an online memory range);
+#   - PG_reserved is set on the page (kernel image area);
+#   - the memory is read-only, so setting PG_hwpoison on it does not
+#     corrupt writable kernel state if the panic somehow does not fire.
+#
+# /proc/iomem entries look like (indented for sub-resources):
+#     "  02500000-02ffffff : Kernel rodata"
+pick_rodata_phys_addr() {
+	awk -v pagesize="$(getconf PAGE_SIZE)" '
+	# Convert a hex string to a number without relying on the gawk-only
+	# strtonum().  mawk lacks it and would otherwise spuriously skip
+	# this test on distros that ship mawk as /usr/bin/awk.
+	function hex2num(s,   n, i, c, v) {
+		n = 0
+		for (i = 1; i <= length(s); i++) {
+			c = tolower(substr(s, i, 1))
+			v = index("0123456789abcdef", c) - 1
+			if (v < 0)
+				return -1
+			n = n * 16 + v
+		}
+		return n
+	}
+	/: Kernel rodata[[:space:]]*$/ {
+		sub(/^[[:space:]]+/, "")
+		n = split($0, a, /[- ]/)
+		start = hex2num(a[1])
+		end   = hex2num(a[2])
+		if (end <= start)
+			next
+		# Page-align upward and emit the first byte of that page.
+		pfn = int((start + pagesize - 1) / pagesize)
+		printf "0x%x\n", pfn * pagesize
+		exit 0
+	}
+	' /proc/iomem
+}
+
+# Walk /proc/kpageflags and return the phys addr of the first PFN that
+# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
+# all clear (so we attack a real, non-tail, not-already-poisoned page).
+#
+# We skip the first 16 MiB of PFNs to step past low-memory special
+# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
+# exhibit the slab/pgtable type we are looking for.
+pick_kpageflags_phys_addr() {
+	local want_bit=$1
+	local pagesize skip_pfn
+
+	[ -r "$kpageflags_path" ] || return
+
+	pagesize=$(getconf PAGE_SIZE)
+	skip_pfn=$(((16 * 1024 * 1024) / pagesize))
+
+	od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
+	awk -v want_bit="$want_bit" \
+	    -v hwp_bit="$KPF_HWPOISON" \
+	    -v nopage_bit="$KPF_NOPAGE" \
+	    -v tail_bit="$KPF_COMPOUND_TAIL" \
+	    -v base_pfn="$skip_pfn" \
+	    -v pagesize="$pagesize" '
+	# Test whether bit "b" is set in the 16-hex-digit value "hex".
+	# Done with substring + per-digit lookup so we never rely on awk
+	# bitwise operators (mawk lacks them), 64-bit FP precision or the
+	# gawk-only strtonum().
+	function bit_set(hex, b,    di, bi, c, v) {
+		di = int(b / 4)
+		bi = b - di * 4
+		c = substr(hex, length(hex) - di, 1)
+		v = index("0123456789abcdef", tolower(c)) - 1
+		if (bi == 0) return (v % 2) == 1
+		if (bi == 1) return int(v / 2) % 2 == 1
+		if (bi == 2) return int(v / 4) % 2 == 1
+		return int(v / 8) % 2 == 1
+	}
+	{
+		gsub(/^[[:space:]]+/, "")
+		h = $1
+		if (bit_set(h, want_bit) &&
+		    !bit_set(h, hwp_bit) &&
+		    !bit_set(h, nopage_bit) &&
+		    !bit_set(h, tail_bit)) {
+			pfn = base_pfn + NR - 1
+			printf "0x%x\n", pfn * pagesize
+			exit 0
+		}
+	}
+	'
+}
+
+# Return 0 if /proc/kpageflags bit $2 is set for PFN $1, 1 if it is
+# clear, or 2 if the word cannot be read.  Used to re-confirm the target
+# page type after a non-panicking inject.
+kpageflags_bit_set() {
+	local word
+
+	word=$(od -An -tx8 -v -j "$(($1 * 8))" -N 8 "$kpageflags_path" 2>/dev/null | tr -d '[:space:]')
+	[ -n "$word" ] || return 2
+	(( (16#$word >> $2) & 1 ))
+}
+
+# Best-effort: drop the PG_hwpoison marker set by the inject so a failed
+# run does not leave a poisoned page behind.  hard_offline_page() injects
+# with MF_SW_SIMULATED, so the page stays unpoisonable through the
+# hwpoison debugfs interface (needs CONFIG_HWPOISON_INJECT + debugfs).
+try_unpoison() {
+	[ -w "$unpoison_path" ] || return 0
+	echo "$1" > "$unpoison_path" 2>/dev/null || true
+}
+
+case "$kind" in
+rodata)
+	phys_addr=$(pick_rodata_phys_addr)
+	recheck_bit=$KPF_RESERVED
+	missing_msg='no "Kernel rodata" entry in /proc/iomem'
+	;;
+slab)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
+	recheck_bit=$KPF_SLAB
+	missing_msg="no usable slab PFN found in $kpageflags_path"
+	;;
+pgtable)
+	phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
+	recheck_bit=$KPF_PGTABLE
+	missing_msg="no usable page-table PFN found in $kpageflags_path"
+	;;
+*)
+	ktap_exit_fail_msg "unknown kind '$kind' (expected: rodata|slab|pgtable)"
+	;;
+esac
+
+if [ -z "$phys_addr" ]; then
+	ktap_skip_all "$missing_msg"
+	exit "$KSFT_SKIP"
+fi
+
+ktap_print_msg "enabling $sysctl_path"
+prior=$(cat "$sysctl_path")
+echo 1 > "$sysctl_path" || ktap_exit_fail_msg "failed to enable sysctl"
+
+pfn=$((phys_addr / pagesize))
+ktap_print_msg "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (pfn 0x$(printf '%x' "$pfn"), kind=$kind)"
+ktap_print_msg "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
+
+# A successful run never returns from the inject -- it panics the kernel.
+# Reaching the code below therefore means no panic fired.  Note whether
+# the write itself succeeded, then put the machine back: restore the
+# sysctl and best-effort unpoison the page we just marked.
+if echo "$phys_addr" > "$inject_path"; then
+	verdict="inject returned without panic; sysctl ineffective"
+else
+	verdict="inject failed before reaching the panic path"
+fi
+
+echo "$prior" > "$sysctl_path"
+try_unpoison "$pfn"
+
+# The page type can change between selection and injection (e.g. a slab
+# or page-table page is freed and reused).  Only treat a missing panic as
+# a failure if the target PFN is still the kernel-owned type we aimed at;
+# if it raced to another type the run is inconclusive, so skip instead.
+kpageflags_bit_set "$pfn" "$recheck_bit"
+case $? in
+0)	ktap_exit_fail_msg "$verdict (page still $kind)" ;;
+1)	ktap_skip_all "target PFN no longer $kind; raced before inject, inconclusive"
+	exit "$KSFT_SKIP" ;;
+*)	ktap_exit_fail_msg "$verdict (could not reconfirm page type via $kpageflags_path)" ;;
+esac

-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Sean Christopherson @ 2026-06-30 13:06 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <6b1f0c77-f059-4f8d-8f46-443b944c59a0@intel.com>

On Tue, Jun 30, 2026, Xiaoyao Li wrote:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> >   arch/x86/include/asm/kvm_host.h | 4 +++-
> >   include/linux/kvm_host.h        | 2 +-
> >   2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> >   		       int tdp_max_root_level, int tdp_huge_page_level);
> > -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> > +#if defined(CONFIG_KVM_SW_PROTECTED_VM) ||	\
> > +	defined(CONFIG_KVM_INTEL_TDX) ||	\
> > +	defined(CONFIG_KVM_AMD_SEV)
> 
> Maybe we can just remove the #ifdef and make it always avaiable?

No, because common KVM keys off the macro to determine whether or not PRIVATE is
a supported attribute:

  #ifdef kvm_arch_has_private_mem
  static u64 kvm_supports_private_mem(struct kvm *kvm)
  {
	return !kvm || kvm_arch_has_private_mem(kvm);
  }
  #else
  #define kvm_supports_private_mem(kvm) false
  #endif

And also whether or not to provide the in-place conversion param (without PRIVATE,
conversions aren't supported in general):

  #ifdef kvm_arch_has_private_mem
  bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
  module_param(gmem_in_place_conversion, bool, 0444);
  EXPORT_SYMBOL_FOR_KVM_INTERNAL(gmem_in_place_conversion);
  #endif

I agree the #ifdeffery is ugly, but kvm_supports_private_mem() in particular
needs to evaluate to false if PRIVATE memory isn't supported.

^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Sean Christopherson @ 2026-06-30 13:27 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, aik@amd.com, andrew.jones@linux.dev,
	binbin.wu@linux.intel.com, brauner@kernel.org,
	chao.p.peng@linux.intel.com, david@kernel.org,
	jmattson@google.com, jthoughton@google.com, michael.roth@amd.com,
	oupton@kernel.org, pankaj.gupta@amd.com, qperret@google.com,
	Rick P Edgecombe, rientjes@google.com, shivankg@amd.com,
	steven.price@arm.com, tabba@google.com, willy@infradead.org,
	wyihan@google.com, forkloop@google.com, pratyush@kernel.org,
	suzuki.poulose@arm.com, aneesh.kumar@kernel.org,
	liam@infradead.org, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86@kernel.org, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-mm@kvack.org, linux-coco@lists.linux.dev
In-Reply-To: <akMoFqj/8Af2i/Al@yzhao56-desk.sh.intel.com>

On Tue, Jun 30, 2026, Yan Zhao wrote:
> On Tue, Jun 30, 2026 at 08:35:49AM +0800, Sean Christopherson wrote:
> > Gah, I thought I had sent this out this morning, long before Ackerley's response.
> > But I got distracted by a meeting and forgot to get back to this... *sigh*
> > 
> > Sending what I already wrote, even though there's a lot of overlap with Ackerley's
> > mail.
> > 
> > On Mon, Jun 29, 2026, Yan Zhao wrote:
> > > On Fri, Jun 26, 2026 at 08:28:32AM -0700, Ackerley Tng wrote:
> > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
> > > > > source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
> > > > > treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
> > > > >
> > > > 
> > > > I'd say the original uAPI perhaps just didn't document 0 as an
> > > > unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
> > > > was perhaps accidentally changed and no customer complained, I think we
> > > > can move forward with 0 as an invalid src_address? I wouldn't think
> > > > anyone relies on 0 intentionally being a valid address.
> > > > 
> > > > I could document that, if it helps?
> > > What about just documenting that 0 is an unsupported uaddr which will be
> > > re-purposed as an indicator to use the target pfn as the source, regardless of
> > > whether gmem_in_place_conversion is true? i.e.,
> > > 
> > > if (!src_page) 
> > > 	src_page = pfn_to_page(pfn);
> > 
> > Because KVM can't generally use the target page as the source without in-place
> > conversion, it's not supported today, and out-of-place conversion is being
> > deprecated.
> By "out-of-place conversion", do you mean using per-VM memory attribute
> conversion?

Yep, I couldn't come up with a better description.

> > > I don't get why the two scenarios should be treated differently:
> > > 1. gmem_in_place_conversion==true, shared memory is not from gmem 
> > > 2. gmem_in_place_conversion==false, shared memory is not from gmem
> > > 
> > > In both case, a 0 uaddr could be mapped to a valid page not from gmem.
> > 
> > That's immaterial.  KVM's ABI (that we're solidifying) is that an address of '0'
> > for the source means NULL.  The fact that userspace could have a valid mapping
> > at virtual address '0' is irrelevant.
> So, I'm wondering if we can document that 0 uaddr could always mean using target
> PFN.

I would document it as saying "no source page", and then state that a source page
is required if in-place conversion isn't enabled/supported/allowed.

> i.e., for both scenarios 1 and 2, al long as 0 uaddr is specified, we always
> use target PFN as source for in-place add.
> 
> > Again, just because something is technically possible doesn't mean it needs to
> > be supported by every piece of KVM's uAPI.
> > 
> > > So why not update the uAPI to handle both cases consistently? :)
> > 
> > Because retroactively adding support for out-of-place conversion is pointless
> > (requires a userspace update for a feature that's being deprecated), KVM can't
> > generally support using the source for out-of-place conversion (it's effectively
> > an obscure zero-page optimization), and IMO rejecting the out-of-place conversion
> > scenario is valuable for KVM developers, e.g. to help newcomers understand what
> > exactly is and isn't possible.
> Ok. You mean per-VM memory attribute is deprecating, and source page from !gmem
> backend is also deprecating, so we don't want to change uAPI for scenarios under
> gmem_in_place_conversion==false. Right?

Right.

> 
> > Side topic, isn't TDX broken if target page has already been added to the TD?
> > IIUC, kvm_tdp_mmu_map_private_pfn() will be a glorified nop due to the page
> > already having a valid S-EPT mapping, and so KVM will incorrectly allow a double
> Not sure if my understand out-of-place conversion correctly.
> Given target PFNs and GFNs are not duplicated, what would cause double add? :)

I was working through what would happen if userspace did KVM_TDX_INIT_MEM_REGION
on the same target page multiple times.

> 
> > add.  Ahhh, no, because KVM will return RET_PF_SPURIOUS and
> > kvm_tdp_mmu_map_private_pfn() will then return -EIO.
> My asking was if we could document uaddr always means using target PFN, since
> TDX's in-place add does not rely on gmem in-place conversion.

Yeah, I was on a tangent, ignore everything from "Side topic" on.

^ permalink raw reply

* Re: [PATCH v8 11/46] KVM: Consolidate private memory and guest_memfd ifdeffery in kvm_host.h
From: Xiaoyao Li @ 2026-06-30 13:59 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-11-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Move the kvm_arch_has_private_mem() stub and a few guest_memfd function
> definitions/declarations "down" in kvm_host.h to utilize existing #ifdefs,
> and so that related code is clustered together.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply

* Re: [RFC PATCH v2 1/4] rtla/osnoise: Add IPI tracking cmdline option
From: Valentin Schneider @ 2026-06-30 13:59 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Crystal Wood, John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <CAP4=nvRiOdiLA-O8RLBwY+db9__vV+i-Qw3uofOZzjONV+kzYQ@mail.gmail.com>

On 29/06/26 12:51, Tomas Glozar wrote:
> st 17. 6. 2026 v 15:18 odesílatel Valentin Schneider
> <vschneid@redhat.com> napsal:
>> @@ -305,6 +305,9 @@ static int opt_filter_cb(const struct option *opt, const char *arg, int unset)
>>         "the minimum delta to be considered a noise", \
>>         opt_llong_callback)
>>
>> +#define OSNOISE_OPT_IPI OPT_BOOLEAN('i', "ipi", &params->common.ipi, \
>> +       "track sources of IPIs")
>> +
>
> As IPI tracking is not a commonly used functionality, unlike e.g.
> "-p/--period", and -i is already a different option for timerlat tools
> (-i-/--irq), I'd suggest keeping just the long option, --ipi, like I
> did for --on-threshold/--on-end (on Arnaldo's suggestion based on his
> experience from perf [1]). This will make it clear to user the option
> means "IPI detection" and not something else beginning with the letter
> "i". We can always add a short option later if its use becomes common.
>
> [1] https://lore.kernel.org/linux-trace-kernel/aEmWyPqQw2Ly7Jlu@x1/
>

Makes sense to me!

>> [truncated]
>
> Tomas


^ permalink raw reply

* Re: [RFC PATCH v2 2/4] rtla/osnoise: Record IPI count in osnoise top
From: Valentin Schneider @ 2026-06-30 13:59 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Crystal Wood, John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <CAP4=nvRtF=Hq3rVS5jCwA-m_vJWEeROy=60T2PW4DYank-6L-w@mail.gmail.com>

On 29/06/26 14:56, Tomas Glozar wrote:
> st 17. 6. 2026 v 15:18 odesílatel Valentin Schneider
> <vschneid@redhat.com> napsal:
>> +/*
>> + * osnoise_ipi_cpu_handler - this is the handler for single CPU IPI events.
>> + */
>> +static int
>> +osnoise_ipi_cpu_handler(struct trace_seq *s, struct tep_record *record,
>> +                    struct tep_event *event, void *context)
>> +{
>> +       struct osnoise_tool *tool;
>> +       struct osnoise_params *params;
>> +       unsigned long long src_cpu, dst_cpu;
>> +       struct trace_instance *trace = context;
>> +
>> +       tool = container_of(trace, struct osnoise_tool, trace);
>> +       params = to_osnoise_params(tool->params);
>> +
>> +       src_cpu = record->cpu;
>> +       tep_get_field_val(s, event, "cpu", record, &dst_cpu, 1);
>> +
>> +       if (CPU_ISSET(dst_cpu, &params->common.monitored_cpus))
>> +               account_ipi(tool, src_cpu, dst_cpu);
>
> Do we need to retrieve and pass the src_cpu here? I get it if you plan
> on using it in the future, but as far as I understand, you are
> specifically tracking the destination CPU, not the source CPU. Same
> note applies to osnoise_ipi_cpumask_handler() below.
>

You're right, I fished out the src_cpu to have it available but it's not
being used ATM.

>> +
>> +       return 0;
>> +}
>> +
>> +static cpu_set_t cpumask_tmp_cpus;
>> +
>> +/*
>> + * osnoise_ipi_cpumask_handler - this is the handler for broadcasted IPI events.
>> + */
>> +static int
>> +osnoise_ipi_cpumask_handler(struct trace_seq *s, struct tep_record *record,
>> +                        struct tep_event *event, void *context)
>> +{
>> +       struct trace_instance *trace = context;
>> +       struct osnoise_tool *tool;
>> +       struct osnoise_params *params;
>> +       struct tep_format_field *field;
>> +       unsigned long long src_cpu;
>> +       cpu_set_t *event_cpus;
>> +       int len;
>> +
>> +       tool = container_of(trace, struct osnoise_tool, trace);
>> +       params = to_osnoise_params(tool->params);
>> +
>> +       src_cpu = record->cpu;
>> +
>> +       field = tep_find_field(event, "cpumask");
>> +       if (!field)
>> +               return 0;
>> +
>> +       event_cpus = tep_get_field_raw(s, event, "cpumask", record, &len, 1);
>> +       if (!event_cpus) {
>> +               err_msg("Failed to get cpumask field\n");
>> +               return 0;
>> +       }
>> +
>> +       CPU_AND(&cpumask_tmp_cpus, event_cpus, &params->common.monitored_cpus);
>> +
>> +       /*
>> +        * Computing the mask weight is overkill but there is no leaner option
>> +        * provided by glibc, e.g cpumask_first() or somesuch.
>> +        */
>> +       if (CPU_COUNT(&cpumask_tmp_cpus)) {
>> +               for (int cpu = 0; cpu < nr_cpus; cpu++) {
>> +                       if (CPU_ISSET(cpu, &cpumask_tmp_cpus))
>> +                               account_ipi(tool, src_cpu, cpu);
>> +               }
>> +       }
>
> Technically, the existing code already relies on the glibc cpumask
> implementation (cpu_set_t) matching the kernel "cpumask_t" type, as
> the "cpumask" field is the latter (per
> /sys/kernel/tracing/events/ipi/ipi_send_cpumask/format), not the
> former. So I wouldn't worry about the opaqueness of cpu_set_t much.
>

Right, AFAICT that's the "canonical" type for passing cpumasks around
between userspace and kernelspace. e.g. for sched_getaffinity():

manpage:

    int sched_getaffinity(pid_t pid, size_t cpusetsize,
                          cpu_set_t *mask);

kernelside:

    SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
                    unsigned long __user *, user_mask_ptr)
    {
            cpumask_var_t mask;
            sched_getaffinity(pid, mask);
            copy_to_user(user_mask_ptr, cpumask_bits(mask), ...)
    }

> Not sure how this is handled in other tracing tools that need to use
> cpumask, I'd have to look around a bit. It might even make sense to
> have a "tools" version of the cpumask functions like cpumask_first(),
> I guess, like we already do for e.g. lists and container_of.
>

I couldn't find anything in tools/testing/* other than the CPU_*() helpers.

>> +
>> +       return 0;
>> +}
>> +
>>  /*
>>   * osnoise_top_handler - this is the handler for osnoise tracer events
>>   */
>
> Nit: As this is extra functionality, it'd be more readable to have the
> IPI handling after the main top handler, so that someone not familiar
> with the source code will see the core logic first. That would also
> match IPI being displayed to the right of the other numbers in the top
> output.
>

Ack.

>> @@ -164,6 +251,8 @@ static void osnoise_top_header(struct osnoise_tool *top)
>>                 goto eol;
>>
>>         trace_seq_printf(s, "          IRQ      Softirq       Thread");
>> +       if (params->common.ipi)
>> +               trace_seq_printf(s, "          IPI");
>>
>>  eol:
>>         if (pretty)
>> @@ -218,7 +307,13 @@ static void osnoise_top_print(struct osnoise_tool *tool, int cpu)
>>
>>         trace_seq_printf(s, "%12llu ", cpu_data->irq_count);
>>         trace_seq_printf(s, "%12llu ", cpu_data->softirq_count);
>> -       trace_seq_printf(s, "%12llu\n", cpu_data->thread_count);
>> +       trace_seq_printf(s, "%12llu", cpu_data->thread_count);
>> +       if (!params->common.ipi) {
>> +               trace_seq_printf(s, "\n");
>> +               return;
>> +       }
>> +
>> +       trace_seq_printf(s, " %12llu\n", cpu_data->ipi_count);
>
> Maybe at this point it is worth it to print the "\n" in a separate
> statement, readability-wise:
>
>         trace_seq_printf(s, "%12llu ", cpu_data->irq_count);
>         trace_seq_printf(s, "%12llu ", cpu_data->softirq_count);
>         trace_seq_printf(s, "%12llu", cpu_data->thread_count);
>         if (params->common.ipi)
>                 trace_seq_printf(s, " %12llu", cpu_data->ipi_count);
>         trace_seq_printf(s, "\n");
>
> It would also make diffs nicer when adding new options.
>

Indeed, will do.

>> [truncated]
>
>
> Tomas


^ permalink raw reply

* Re: [RFC PATCH v2 4/4] rtla/osnoise: Leverage IPI event filters when tracing a subset of CPUs
From: Valentin Schneider @ 2026-06-30 13:59 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Crystal Wood, John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <CAP4=nvQ_x+LyDBoQWOBB4ERy8OwEp9LQzyLJJMHzYdrZzK4zyw@mail.gmail.com>

On 30/06/26 12:14, Tomas Glozar wrote:
> st 17. 6. 2026 v 15:18 odesílatel Valentin Schneider
> <vschneid@redhat.com> napsal:
>> @@ -406,6 +408,33 @@ struct osnoise_tool *osnoise_init_top(struct common_params *params)
>>                 goto out_err;
>>         }
>>
>> +       /*
>> +        * If tracing on a subset of possible CPUs, leverage the kernel filtering
>> +        * infrastructure to only generate events on traced CPUs.
>> +        */
>> +       if (params->cpus) {
>> +               char filter[MAX_PATH];
>> +
>> +               snprintf(filter, ARRAY_SIZE(filter), "cpu & CPUS{%s}\n", params->cpus);
>> +               retval = tracefs_event_file_write(tool->trace.inst,
>> +                                                 "ipi", "ipi_send_cpu", "filter",
>> +                                                 filter);
>> +               if (retval) {
>
> retval is the number of bytes written here, so this should be "retval
> < 0" like in trace_event_enable_filter() in trace.c. Same below.
>

According to the docstring:

 * Return 0 on success, and -1 on error.

but regardless yes that should be a '< 0' check to match existing code.

>> +                       err_msg("Could not set ipi_send_cpu CPU filter\n");
>> +                       goto out_err;
>
> It would be useful to have --ipi work even on older kernels that don't
> yet have your cpumask trace event filter patchset [1], for example, by
> printing a debug message that filtering is disabled and setting a flag
> instead of erroring out here. Then the code in
> osnoise_ipi_cpu_handler() can preserve the CPU_ISSET check if the flag
> is set.
>
> As --ipi is optional, we can choose to only support it on newer
> kernels, but it would be nice to have it working without the filter,
> too.
>
> [1] https://lore.kernel.org/linux-trace-kernel/20230707172155.70873-1-vschneid@redhat.com/T/#u
>

Makes sense, will do.

>> +               }
>> +
>> +
>> +               snprintf(filter, ARRAY_SIZE(filter), "cpumask & CPUS{%s}\n", params->cpus);
>> +               retval = tracefs_event_file_write(tool->trace.inst,
>> +                                                 "ipi", "ipi_send_cpumask", "filter",
>> +                                                 filter);
>> +               if (retval) {
>> +                       err_msg("Could not set ipi_send_cpumask CPU filter\n");
>> +                       goto out_err;
>> +               }
>
> Same two comments above apply here.
>
>> +       }
>> +
>>         tep_register_event_handler(tool->trace.tep, -1, "ipi", "ipi_send_cpu",
>>                                    osnoise_ipi_cpu_handler, NULL);
>>
>> --
>> 2.54.0
>>
>
> I was thinking that it might make sense to enable the filters also for
> the trace output instance. On the other hand, it would make it
> difficult to enable the event without the filter then, as specifying
> "-e ipi" or similar only re-enables the event but does not remove the
> filter. Maybe the better idea is to implement an option to filter any
> event enabled through -e/--event only to the measurement CPU, as a
> separate feature.
>

I had actually forgotten about applying the filters for the output
instance... I'll look into it.

> Tomas


^ permalink raw reply

* Re: [PATCH] ring-buffer: serialize read-page order with subbuffer resize
From: Steven Rostedt @ 2026-06-30 14:14 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Petr Pavlu,
	linux-trace-kernel, linux-kernel, stable,
	syzbot+2dd9d02f60775ce5c1fb
In-Reply-To: <20260628004653.28065-1-alhouseenyousef@gmail.com>

On Sun, 28 Jun 2026 02:46:53 +0200
Yousef Alhouseen <alhouseenyousef@gmail.com> wrote:

> ring_buffer_read_page() checks that its spare page has the current
> subbuffer order before taking cpu_buffer->reader_lock. A concurrent
> ring_buffer_subbuf_order_set() can change the order and replace the
> reader page after that check. The reader then copies a larger subbuffer
> into the old allocation, causing an out-of-bounds write.
> 
> Keep spare-page allocation and release under buffer->mutex, which already
> serializes order changes. Move the read-side order check under
> reader_lock, the lock used by resize when replacing per-CPU pages.
> 
> Fixes: f9b94daa542a ("ring-buffer: Set new size of the ring buffer sub page")
> Reported-by: syzbot+2dd9d02f60775ce5c1fb@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=2dd9d02f60775ce5c1fb
> Cc: stable@vger.kernel.org
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
>  kernel/trace/ring_buffer.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 56a328e94395..eed5d7cffdee 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -6950,6 +6950,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
>  	if (!cpumask_test_cpu(cpu, buffer->cpumask))
>  		return ERR_PTR(-ENODEV);
>  
> +	guard(mutex)(&buffer->mutex);
> +
>  	bpage = kzalloc_obj(*bpage);

First, do not grab locks around allocations unless the are really needed.
This is bad practice, as it extends the critical section and may even add
the allocation locking to the lock chain.

That said, just moving things around the current locks should work.

Like this (not compiled nor tested):

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 56a328e94395..8352f935a223 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6954,11 +6954,11 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
 	if (!bpage)
 		return ERR_PTR(-ENOMEM);
 
-	bpage->order = buffer->subbuf_order;
 	cpu_buffer = buffer->buffers[cpu];
 	local_irq_save(flags);
 	arch_spin_lock(&cpu_buffer->lock);
 
+	bpage->order = buffer->subbuf_order;
 	if (cpu_buffer->free_page) {
 		bpage->data = cpu_buffer->free_page;
 		cpu_buffer->free_page = NULL;
@@ -7007,13 +7007,13 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
 	 * is different from the subbuffer order of the buffer -
 	 * we can't reuse it
 	 */
-	if (page_ref_count(page) > 1 || data_page->order != buffer->subbuf_order)
+	if (page_ref_count(page) > 1)
 		goto out;
 
 	local_irq_save(flags);
 	arch_spin_lock(&cpu_buffer->lock);
 
-	if (!cpu_buffer->free_page) {
+	if (!cpu_buffer->free_page && data_page->order == buffer->subbuf_order)
 		cpu_buffer->free_page = dpage;
 		dpage = NULL;
 	}
@@ -7091,15 +7091,15 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	if (!data_page || !data_page->data)
 		return -1;
 
-	if (data_page->order != buffer->subbuf_order)
-		return -1;
-
 	dpage = data_page->data;
 	if (!dpage)
 		return -1;
 
 	guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock);
 
+	if (data_page->order != buffer->subbuf_order)
+		return -1;
+
 	reader = rb_get_reader_page(cpu_buffer);
 	if (!reader)
 		return -1;

-- Steve

^ permalink raw reply related

* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-06-30 14:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <akKf511eknMCE0AB@redhat.com>

On Mon, Jun 29, 2026 at 06:40:07PM +0200, Oleg Nesterov wrote:
> On 06/29, Jiri Olsa wrote:
> >
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -265,6 +265,10 @@ static bool is_prefix_bad(struct insn *insn)
> >
> >  		attr = inat_get_opcode_attribute(p);
> >  		switch (attr) {
> > +		case INAT_MAKE_PREFIX(INAT_PFX_CS):
> > +			if (insn->x86_64)
> > +				break;
> > +			fallthrough;
> >  		case INAT_MAKE_PREFIX(INAT_PFX_ES):
> >  		case INAT_MAKE_PREFIX(INAT_PFX_DS):
> >  		case INAT_MAKE_PREFIX(INAT_PFX_SS):
> >
> > or we could just skip it for nop10.. maybe that's better
> 
> Well, if you ask me I'd agree with the "maybe that's better" plan ;)
> I mean... I don't think that INAT_PFX_CS should be "special" in is_prefix_bad.
> 
> But, whatever you do - I agree, feel free to keep my r-b.

I ended up with the bigger change below, wdyt?

jirka


---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 92626fce06a9..521a120a0c78 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -266,6 +266,7 @@ static bool is_prefix_bad(struct insn *insn)
 		attr = inat_get_opcode_attribute(p);
 		switch (attr) {
 		case INAT_MAKE_PREFIX(INAT_PFX_ES):
+		case INAT_MAKE_PREFIX(INAT_PFX_CS):
 		case INAT_MAKE_PREFIX(INAT_PFX_DS):
 		case INAT_MAKE_PREFIX(INAT_PFX_SS):
 		case INAT_MAKE_PREFIX(INAT_PFX_LOCK):
@@ -275,15 +276,9 @@ static bool is_prefix_bad(struct insn *insn)
 	return false;
 }
 
-static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool x86_64)
+static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn)
 {
-	enum insn_mode m = x86_64 ? INSN_MODE_64 : INSN_MODE_32;
 	u32 volatile *good_insns;
-	int ret;
-
-	ret = insn_decode(insn, auprobe->insn, sizeof(auprobe->insn), m);
-	if (ret < 0)
-		return -ENOEXEC;
 
 	if (is_prefix_bad(insn))
 		return -ENOTSUPP;
@@ -292,7 +287,7 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
 	if (insn_masking_exception(insn))
 		return -ENOTSUPP;
 
-	if (x86_64)
+	if (insn->x86_64)
 		good_insns = good_insns_64;
 	else
 		good_insns = good_insns_32;
@@ -1620,16 +1615,26 @@ static int push_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
  */
 int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long addr)
 {
+	enum insn_mode m = is_64bit_mm(mm) ? INSN_MODE_64 : INSN_MODE_32;
 	u8 fix_ip_or_call = UPROBE_FIX_IP;
 	struct insn insn;
 	int ret;
 
-	ret = uprobe_init_insn(auprobe, &insn, is_64bit_mm(mm));
-	if (ret)
-		return ret;
+	ret = insn_decode(&insn, auprobe->insn, sizeof(auprobe->insn), m);
+	if (ret < 0)
+		return -ENOEXEC;
 
-	if (can_optimize(&insn, addr))
+	/*
+	 * No need to check instruction in uprobe_init_insn in case we
+	 * are on top of optimizable nop10.
+	 */
+	if (can_optimize(&insn, addr)) {
 		set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+	} else {
+		ret = uprobe_init_insn(auprobe, &insn);
+		if (ret)
+			return ret;
+	}
 
 	ret = branch_setup_xol_ops(auprobe, &insn);
 	if (ret != -ENOSYS)

^ permalink raw reply related

* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Xiaoyao Li @ 2026-06-30 15:22 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-7-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -				     unsigned long mask, unsigned long attrs);
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +					unsigned long mask, unsigned long attrs);
>   bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>   					struct kvm_gfn_range *range);
>   bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,

We have

  - kvm_pre_set_memory_attributes()
  - kvm_arch_pre_set_memory_attributes()
  - kvm_arch_post_set_memory_attributes()

left, do they need to be renamed as well?

then the interesting one is kvm_vm_set_mem_attributes(), which contains 
"vm" already while it means "vm ioctl". Do we need to rename it to
kvm_vm_set_vm_mem_attributes()?


^ permalink raw reply

* Re: [PATCH 07/30] mm/rmap: elide unnecessary static inline's in interval_tree.c
From: Gregory Price @ 2026-06-30 15:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Pedro Falcato, Rik van Riel,
	Harry Yoo, Jann Horn
In-Reply-To: <ed5fd5358382217a92f0a6afddcfaa030c933055.1782735110.git.ljs@kernel.org>

On Mon, Jun 29, 2026 at 01:23:18PM +0100, Lorenzo Stoakes wrote:
> It's not necessary to declare these functions static inline as they are
> contained within a single compilation unit.
> 
> This makes the anonymous interval tree code consistent with the newly
> updated file-backed interval tree code.
> 
> No functional change intended.
> 
> Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>

Reviewed-by: Gregory Price <gourry@gourry.net>


^ permalink raw reply

* Re: [PATCH 08/30] mm/rmap: rename vma_interval_tree_*() to mapping_interval_tree_*()
From: Gregory Price @ 2026-06-30 15:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Pedro Falcato, Rik van Riel,
	Harry Yoo, Jann Horn
In-Reply-To: <f95462457025370efd047b9dfb039e76bbddf58b.1782735110.git.ljs@kernel.org>

On Mon, Jun 29, 2026 at 01:23:19PM +0100, Lorenzo Stoakes wrote:
> The family of vma_interval_tree_() functions manipulate the
> address_space (which, of course, is generally referred to as 'mapping')
> reverse mapping, but are named the 'VMA' interval tree.
> 
> VMAs may be mapped by an anon_vma, an address_space, or both. Therefore
> calling the mapping interval tree a 'VMA' interval tree is rather
> confusing.
> 
> This is also inconsistent with the anon_vma_interval_tree_*() functions
> which explicitly reference the rmap object to which they pertain.
> 
> Rename the vma_interval_tree_*() functions to mapping_interval_tree_*() to
> correct this.
> 
> No functional change intended.
> 
> Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>

obligatory "naming is hard", this patch helps, thank you.

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply

* Re: [PATCH 09/30] mm/rmap: parameterise anon_vma_interval_tree_*() by anon_vma
From: Gregory Price @ 2026-06-30 15:46 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Pedro Falcato, Rik van Riel,
	Harry Yoo, Jann Horn
In-Reply-To: <1c1df7b905ef340cbf2effef769a4e770a8e0eb1.1782735110.git.ljs@kernel.org>

On Mon, Jun 29, 2026 at 01:23:20PM +0100, Lorenzo Stoakes wrote:
> Similar to what we did with mapping_interval_tree*(), let's declare
> anon_vma_interval_tree*() in terms of anon_vma rather than rb_root_cached.
> 
> In each case the rb tree referenced is &anon_vma->rb_root, so just pass
> anon_vma and the functions can figure this out themselves.
> 
> Additionally, rename 'node' to 'avc', 'index' to 'pgoff_start', and 'last'
> to 'pgoff_last' to make clear what is being passed.
>

would it be possible to split the pure rename changes out from the
changed function declarations?  It's hard to pick out this as something
that needs to be looked at as more than just a %s/x/y/

> +void anon_vma_interval_tree_insert(struct anon_vma_chain *avc,
> +				   struct anon_vma *anon_vma)
...
> -	__anon_vma_interval_tree_insert(node, root);
> +	__anon_vma_interval_tree_insert(avc, &anon_vma->rb_root);

an annoying request, sorry

~Gregory

^ permalink raw reply

* Re: [PATCH 09/30] mm/rmap: parameterise anon_vma_interval_tree_*() by anon_vma
From: Lorenzo Stoakes @ 2026-06-30 15:49 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Pedro Falcato, Rik van Riel,
	Harry Yoo, Jann Horn
In-Reply-To: <akPk5o_gHD1SxX_0@gourry-fedora-PF4VCD3F>

On Tue, Jun 30, 2026 at 11:46:46AM -0400, Gregory Price wrote:
> On Mon, Jun 29, 2026 at 01:23:20PM +0100, Lorenzo Stoakes wrote:
> > Similar to what we did with mapping_interval_tree*(), let's declare
> > anon_vma_interval_tree*() in terms of anon_vma rather than rb_root_cached.
> >
> > In each case the rb tree referenced is &anon_vma->rb_root, so just pass
> > anon_vma and the functions can figure this out themselves.
> >
> > Additionally, rename 'node' to 'avc', 'index' to 'pgoff_start', and 'last'
> > to 'pgoff_last' to make clear what is being passed.
> >
>
> would it be possible to split the pure rename changes out from the
> changed function declarations?  It's hard to pick out this as something
> that needs to be looked at as more than just a %s/x/y/

Hmmm do I have to? :P I mean sure I can on a respin potentially, but it is a
pretty trivial change? Just mechnically as above.

>
> > +void anon_vma_interval_tree_insert(struct anon_vma_chain *avc,
> > +				   struct anon_vma *anon_vma)
> ...
> > -	__anon_vma_interval_tree_insert(node, root);
> > +	__anon_vma_interval_tree_insert(avc, &anon_vma->rb_root);
>
> an annoying request, sorry

:)) well it's ok I've made enough annoying requests of my own on review :)

>
> ~Gregory

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 09/30] mm/rmap: parameterise anon_vma_interval_tree_*() by anon_vma
From: Gregory Price @ 2026-06-30 15:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Pedro Falcato, Rik van Riel,
	Harry Yoo, Jann Horn
In-Reply-To: <akPlUrNWzl1ZPw1S@lucifer>

On Tue, Jun 30, 2026 at 04:49:45PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 30, 2026 at 11:46:46AM -0400, Gregory Price wrote:
> > On Mon, Jun 29, 2026 at 01:23:20PM +0100, Lorenzo Stoakes wrote:
> > > Similar to what we did with mapping_interval_tree*(), let's declare
> > > anon_vma_interval_tree*() in terms of anon_vma rather than rb_root_cached.
> > >
> > > In each case the rb tree referenced is &anon_vma->rb_root, so just pass
> > > anon_vma and the functions can figure this out themselves.
> > >
> > > Additionally, rename 'node' to 'avc', 'index' to 'pgoff_start', and 'last'
> > > to 'pgoff_last' to make clear what is being passed.
> > >
> >
> > would it be possible to split the pure rename changes out from the
> > changed function declarations?  It's hard to pick out this as something
> > that needs to be looked at as more than just a %s/x/y/
> 
> Hmmm do I have to? :P 

I mean, no :]

> I mean sure I can on a respin potentially, but it is a
> pretty trivial change? Just mechnically as above.
> 

And yeah certainly not worth a respin.  Just learning some of the
friction points of reviewing as I spend a little more time doing it
every day.

~Gregory

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox