[PATCH 0/2] Fixes for recovery for machine check in TDX/SEAM non-root mode

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fixes for recovery for machine check in TDX/SEAM non-root mode
@ 2025-06-18 12:08 Adrian Hunter
  2025-06-18 12:08 ` [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors " Adrian Hunter
  2025-06-18 12:08 ` [PATCH 2/2] KVM: TDX: Do not clear poisoned pages Adrian Hunter
  0 siblings, 2 replies; 35+ messages in thread
From: Adrian Hunter @ 2025-06-18 12:08 UTC (permalink / raw)
  To: Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

Hi

Here are 2 small fixes related to recovery for machine check in TDX/SEAM
non-root mode.

The issues were noticed as part of work to determine the conditions under
which TDX private memory needs to be cleared after being reclaimed.
For guests with a large amount of memory, clearing all private pages during
VM shutdown can take minutes, so we are looking at when that can be
skipped.  A future patch will deal with that.

One thing that was investigated was the effect of deliberately corrupting a
TDX guest private page by writing to it on the host, and then reading it
on the guest, which results in a machine check as expected, but revealed
the issue addressed in patch 1.

Patch 2 follows on and ensures the poisoned page is not touched.

There are 2 outstanding issues:

1. It is assumed that once the TDX VM is shutdown that the memory is
returned to the allocator.  That is true at present, but may not be in the
future.  Consider, for example, patch set "New KVM ioctl to link a gmem
inode to a new gmem file" :

      https://lore.kernel.org/r/cover.1747368092.git.afranji@google.com/

2. Currently, KVM TDX does not cater for the TDX VM to enter a FATAL error
state, where the only operation permitted is to tear down the VM.  KVM just
carries on, hitting various errors, but in particular, memory reclaim fails
because it is not following the teardown procedure, and all guest private
memory is leaked.

Adrian Hunter (3):
      x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
      KVM: TDX: Do not clear poisoned pages

 arch/x86/kernel/cpu/mce/core.c | 3 ++-
 arch/x86/kvm/vmx/tdx.c         | 8 ++++----
 2 files changed, 6 insertions(+), 5 deletions(-)

Regards
Adrian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 12:08 [PATCH 0/2] Fixes for recovery for machine check in TDX/SEAM non-root mode Adrian Hunter
@ 2025-06-18 12:08 ` Adrian Hunter
  2025-06-18 12:36   ` Xiaoyao Li
                     ` (2 more replies)
  2025-06-18 12:08 ` [PATCH 2/2] KVM: TDX: Do not clear poisoned pages Adrian Hunter
  1 sibling, 3 replies; 35+ messages in thread
From: Adrian Hunter @ 2025-06-18 12:08 UTC (permalink / raw)
  To: Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
valid physical address bits within the machine check bank address register.

This is particularly needed in the case of errors in TDX/SEAM non-root mode
because the reported address contains the TDX KeyID.  Refer to TDX and
TME-MK documentation for more information about KeyIDs.

Commit 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM
non-root mode") uses the address to mark the affected page as poisoned, but
omits to use the aforementioned mask.

Mask the address with MCI_ADDR_PHYSADDR so that the page can be found.

Fixes: 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e9b3c5d4a52e..76c4634c6a5f 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		 * be added to free list when the guest is terminated.
 		 */
 		if (mce_usable_address(m)) {
-			struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
+			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
+			struct page *p = pfn_to_online_page(pfn);
 
 			if (p)
 				SetPageHWPoison(p);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-18 12:08 [PATCH 0/2] Fixes for recovery for machine check in TDX/SEAM non-root mode Adrian Hunter
  2025-06-18 12:08 ` [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors " Adrian Hunter
@ 2025-06-18 12:08 ` Adrian Hunter
  2025-06-18 12:39   ` Xiaoyao Li
                     ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Adrian Hunter @ 2025-06-18 12:08 UTC (permalink / raw)
  To: Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

Skip clearing a private page if it is marked as poisoned.

The machine check architecture may have the capability to recover from
memory corruption in SEAM non-root (i.e. TDX VM guest) mode.  In that
case the page is marked as poisoned, and the TDX Module puts the TDX VM
into a FATAL error state where the only operation permitted is to tear it
down.

During tear down, reclaimed pages are cleared which, in some cases,  helps
to avoid integrity violation or TD bit mismatch detection when later being
read using a shared HKID.

However a poisoned page will never be allocated again, so clearing is not
necessary, and in any case poisoned pages should not be touched.

Note that while it is possible that memory corruption arises from integrity
violation which could be cleared by MOVDIR64B, that is not necessarily the
cause of the machine check.

Suggested-by: Kai Huang <kai.huang@intel.com>
Fixes: 8d032b683c299 ("KVM: TDX: create/destroy VM structure")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 457f91b95147..f4263f7a3924 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page)
 	void *dest = page_to_virt(page);
 	unsigned long i;

-	/*
-	 * The page could have been poisoned.  MOVDIR64B also clears
-	 * the poison bit so the kernel can safely use the page again.
-	 */
+	/* Machine check handler may have poisoned the page */
+	if (PageHWPoison(page))
+		return;
+
 	for (i = 0; i < PAGE_SIZE; i += 64)
 		movdir64b(dest + i, zero_page);
 	/*
-- 
2.48.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 12:08 ` [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors " Adrian Hunter
@ 2025-06-18 12:36   ` Xiaoyao Li
  2025-06-18 14:55   ` Dave Hansen
  2025-06-18 23:20   ` Huang, Kai
  2 siblings, 0 replies; 35+ messages in thread
From: Xiaoyao Li @ 2025-06-18 12:36 UTC (permalink / raw)
  To: Adrian Hunter, Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

On 6/18/2025 8:08 PM, Adrian Hunter wrote:
> Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
> check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
> valid physical address bits within the machine check bank address register.
> 
> This is particularly needed in the case of errors in TDX/SEAM non-root mode
> because the reported address contains the TDX KeyID.  Refer to TDX and
> TME-MK documentation for more information about KeyIDs.
> 
> Commit 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM
> non-root mode") uses the address to mark the affected page as poisoned, but
> omits to use the aforementioned mask.
> 
> Mask the address with MCI_ADDR_PHYSADDR so that the page can be found.
> 
> Fixes: 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
> Cc: stable@vger.kernel.org

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>   arch/x86/kernel/cpu/mce/core.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index e9b3c5d4a52e..76c4634c6a5f 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>   		 * be added to free list when the guest is terminated.
>   		 */
>   		if (mce_usable_address(m)) {
> -			struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
> +			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
> +			struct page *p = pfn_to_online_page(pfn);
>   
>   			if (p)
>   				SetPageHWPoison(p);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-18 12:08 ` [PATCH 2/2] KVM: TDX: Do not clear poisoned pages Adrian Hunter
@ 2025-06-18 12:39   ` Xiaoyao Li
  2025-06-18 14:58   ` Dave Hansen
  2025-06-18 23:09   ` Huang, Kai
  2 siblings, 0 replies; 35+ messages in thread
From: Xiaoyao Li @ 2025-06-18 12:39 UTC (permalink / raw)
  To: Adrian Hunter, Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

On 6/18/2025 8:08 PM, Adrian Hunter wrote:
> Skip clearing a private page if it is marked as poisoned.
> 
> The machine check architecture may have the capability to recover from
> memory corruption in SEAM non-root (i.e. TDX VM guest) mode.  In that
> case the page is marked as poisoned, and the TDX Module puts the TDX VM
> into a FATAL error state where the only operation permitted is to tear it
> down.
> 
> During tear down, reclaimed pages are cleared which, in some cases,  helps
> to avoid integrity violation or TD bit mismatch detection when later being
> read using a shared HKID.
> 
> However a poisoned page will never be allocated again, so clearing is not
> necessary, and in any case poisoned pages should not be touched.
> 
> Note that while it is possible that memory corruption arises from integrity
> violation which could be cleared by MOVDIR64B, that is not necessarily the
> cause of the machine check.
> 
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Fixes: 8d032b683c299 ("KVM: TDX: create/destroy VM structure")
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 457f91b95147..f4263f7a3924 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page)
>   	void *dest = page_to_virt(page);
>   	unsigned long i;
>   
> -	/*
> -	 * The page could have been poisoned.  MOVDIR64B also clears
> -	 * the poison bit so the kernel can safely use the page again.
> -	 */
> +	/* Machine check handler may have poisoned the page */

This doesn't read correct. The page is not poisoned by the machine check 
handler. Machine check handler just marks the page as poisoned.

With it fixed,

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> +	if (PageHWPoison(page))
> +		return;
> +
>   	for (i = 0; i < PAGE_SIZE; i += 64)
>   		movdir64b(dest + i, zero_page);
>   	/*


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 12:08 ` [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors " Adrian Hunter
  2025-06-18 12:36   ` Xiaoyao Li
@ 2025-06-18 14:55   ` Dave Hansen
  2025-06-19 11:57     ` Adrian Hunter
  2025-06-18 23:20   ` Huang, Kai
  2 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-18 14:55 UTC (permalink / raw)
  To: Adrian Hunter, Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

On 6/18/25 05:08, Adrian Hunter wrote:
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  		 * be added to free list when the guest is terminated.
>  		 */
>  		if (mce_usable_address(m)) {
> -			struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
> +			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
> +			struct page *p = pfn_to_online_page(pfn);

If ->addr isn't really an address that software can do much with,
shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()?

Maybe we should break it up into address and KeyID _there_.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-18 12:08 ` [PATCH 2/2] KVM: TDX: Do not clear poisoned pages Adrian Hunter
  2025-06-18 12:39   ` Xiaoyao Li
@ 2025-06-18 14:58   ` Dave Hansen
  2025-06-25 14:33     ` Vishal Annapurve
  2025-06-18 23:09   ` Huang, Kai
  2 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-18 14:58 UTC (permalink / raw)
  To: Adrian Hunter, Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

On 6/18/25 05:08, Adrian Hunter wrote:
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page)
>  	void *dest = page_to_virt(page);
>  	unsigned long i;
>  
> -	/*
> -	 * The page could have been poisoned.  MOVDIR64B also clears
> -	 * the poison bit so the kernel can safely use the page again.
> -	 */
> +	/* Machine check handler may have poisoned the page */
> +	if (PageHWPoison(page))
> +		return;

I think the old comment needs to stay in some form.

There are two kinds of poisons here: One from an integrity mismatch and
the other because the hardware decided the memory is bad. MOVDIR64B
clears the integrity one, but not the hardware one obviously.

Could we make that clear in the comment, please?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-18 12:08 ` [PATCH 2/2] KVM: TDX: Do not clear poisoned pages Adrian Hunter
  2025-06-18 12:39   ` Xiaoyao Li
  2025-06-18 14:58   ` Dave Hansen
@ 2025-06-18 23:09   ` Huang, Kai
  2 siblings, 0 replies; 35+ messages in thread
From: Huang, Kai @ 2025-06-18 23:09 UTC (permalink / raw)
  To: Luck, Tony, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	mingo@redhat.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, Yamahata, Isaku,
	linux-edac@vger.kernel.org, hpa@zytor.com, Annapurve, Vishal,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org

On Wed, 2025-06-18 at 15:08 +0300, Hunter, Adrian wrote:
> Skip clearing a private page if it is marked as poisoned.
> 
> The machine check architecture may have the capability to recover from
							 ^
"to recover" -> "to allow the kernel to recover"?

> memory corruption in SEAM non-root (i.e. TDX VM guest) mode.  In that
> case the page is marked as poisoned, and the TDX Module puts the TDX VM

"marked as poisoned" -> "marked as poisoned in the kernel"?  Since next half
of this sentence immediately talks about TDX module behaviour.

> into a FATAL error state where the only operation permitted is to tear it
> down.
> 
> During tear down, reclaimed pages are cleared which, in some cases,  helps
								      ^

Double writespace in middle of sentence.

> to avoid integrity violation or TD bit mismatch detection when later being
> read using a shared HKID.
> 
> However a poisoned page will never be allocated again, so clearing is not
> necessary, and in any case poisoned pages should not be touched.
> 
> Note that while it is possible that memory corruption arises from integrity
> violation which could be cleared by MOVDIR64B, that is not necessarily the
> cause of the machine check.
> 
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Fixes: 8d032b683c299 ("KVM: TDX: create/destroy VM structure")
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

With comments from Xiaoyao/Dave fixed,

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 12:08 ` [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors " Adrian Hunter
  2025-06-18 12:36   ` Xiaoyao Li
  2025-06-18 14:55   ` Dave Hansen
@ 2025-06-18 23:20   ` Huang, Kai
  2025-06-18 23:39     ` Luck, Tony
  2 siblings, 1 reply; 35+ messages in thread
From: Huang, Kai @ 2025-06-18 23:20 UTC (permalink / raw)
  To: Luck, Tony, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	mingo@redhat.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, Yamahata, Isaku,
	linux-edac@vger.kernel.org, hpa@zytor.com, Annapurve, Vishal,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org

On Wed, 2025-06-18 at 15:08 +0300, Hunter, Adrian wrote:
> Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
> check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
> valid physical address bits within the machine check bank address register.
> 
> This is particularly needed in the case of errors in TDX/SEAM non-root mode
> because the reported address contains the TDX KeyID.  
> 

Just wondering, do you know whether this is documented anywhere?  If it is,
I think it should be helpful if you can refer that in the changelog.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 23:20   ` Huang, Kai
@ 2025-06-18 23:39     ` Luck, Tony
  2025-06-18 23:46       ` Luck, Tony
  2025-06-18 23:53       ` Huang, Kai
  0 siblings, 2 replies; 35+ messages in thread
From: Luck, Tony @ 2025-06-18 23:39 UTC (permalink / raw)
  To: Huang, Kai, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	mingo@redhat.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, Yamahata, Isaku,
	linux-edac@vger.kernel.org, hpa@zytor.com, Annapurve, Vishal,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org

> > Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
> > check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
> > valid physical address bits within the machine check bank address register.
> >
> > This is particularly needed in the case of errors in TDX/SEAM non-root mode
> > because the reported address contains the TDX KeyID.
> >
>
> Just wondering, do you know whether this is documented anywhere?  If it is,
> I think it should be helpful if you can refer that in the changelog.

It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR
with the footnote in the diagram:

  "Useful bits in this field depend on the address methodology in use when the
   the register state is saved."

Maybe there is something more explicit in documentation for memory encryption?

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 23:39     ` Luck, Tony
@ 2025-06-18 23:46       ` Luck, Tony
  2025-06-18 23:57         ` Huang, Kai
  2025-06-18 23:53       ` Huang, Kai
  1 sibling, 1 reply; 35+ messages in thread
From: Luck, Tony @ 2025-06-18 23:46 UTC (permalink / raw)
  To: Huang, Kai, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	mingo@redhat.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, Yamahata, Isaku,
	linux-edac@vger.kernel.org, hpa@zytor.com, Annapurve, Vishal,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org

> It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR
> with the footnote in the diagram:
>
>   "Useful bits in this field depend on the address methodology in use when the
>    the register state is saved."
>
> Maybe there is something more explicit in documentation for memory encryption?


Section 5.1 in 
https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf

shows how the upper bits of the physical address are used for the :KeyID"

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 23:39     ` Luck, Tony
  2025-06-18 23:46       ` Luck, Tony
@ 2025-06-18 23:53       ` Huang, Kai
  1 sibling, 0 replies; 35+ messages in thread
From: Huang, Kai @ 2025-06-18 23:53 UTC (permalink / raw)
  To: Luck, Tony, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, kirill.shutemov@linux.intel.com,
	binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, mingo@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, tglx@linutronix.de,
	linux-edac@vger.kernel.org, hpa@zytor.com, Annapurve, Vishal,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org

On Wed, 2025-06-18 at 23:39 +0000, Luck, Tony wrote:
> > > Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
> > > check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
> > > valid physical address bits within the machine check bank address register.
> > > 
> > > This is particularly needed in the case of errors in TDX/SEAM non-root mode
> > > because the reported address contains the TDX KeyID.
> > > 
> > 
> > Just wondering, do you know whether this is documented anywhere?  If it is,
> > I think it should be helpful if you can refer that in the changelog.
> 
> It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR
> with the footnote in the diagram:
> 
>   "Useful bits in this field depend on the address methodology in use when the
>    the register state is saved."

Thanks for the info.

> 
> Maybe there is something more explicit in documentation for memory encryption?

I didn't find any.  The TDX module base architecture spec (16.2.3. Memory
Integrity Error Logging, Machine Checks and Unbreakable Shutdowns) says
below:

  The poison memory address, at a granularity no finer than 32 bytes, is
  logged in IA32_MCi_ADDRESS MSRs.

It doesn't explicitly say KeyID is appended.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 23:46       ` Luck, Tony
@ 2025-06-18 23:57         ` Huang, Kai
  0 siblings, 0 replies; 35+ messages in thread
From: Huang, Kai @ 2025-06-18 23:57 UTC (permalink / raw)
  To: Luck, Tony, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, kirill.shutemov@linux.intel.com,
	binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, mingo@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, tglx@linutronix.de,
	linux-edac@vger.kernel.org, hpa@zytor.com, Annapurve, Vishal,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org

On Wed, 2025-06-18 at 23:46 +0000, Luck, Tony wrote:
> > It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR
> > with the footnote in the diagram:
> > 
> >   "Useful bits in this field depend on the address methodology in use when the
> >    the register state is saved."
> > 
> > Maybe there is something more explicit in documentation for memory encryption?
> 
> 
> Section 5.1 in 
> https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf
> 
> shows how the upper bits of the physical address are used for the :KeyID"
> 
> -Tony

Yeah. So I guess it's somehow implied the KeyID bits, which are "useful
bits", are also recorded in IA32_MCi_ADDR.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-18 14:55   ` Dave Hansen
@ 2025-06-19 11:57     ` Adrian Hunter
  2025-06-27 15:23       ` Adrian Hunter
  0 siblings, 1 reply; 35+ messages in thread
From: Adrian Hunter @ 2025-06-19 11:57 UTC (permalink / raw)
  To: Dave Hansen, Tony Luck, pbonzini, seanjc
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

On 18/06/2025 17:55, Dave Hansen wrote:
> On 6/18/25 05:08, Adrian Hunter wrote:
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>>  		 * be added to free list when the guest is terminated.
>>  		 */
>>  		if (mce_usable_address(m)) {
>> -			struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
>> +			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
>> +			struct page *p = pfn_to_online_page(pfn);
> 
> If ->addr isn't really an address that software can do much with,
> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()?

Would that mean no one would know if the mce addr had KeyID bits or not?

> 
> Maybe we should break it up into address and KeyID _there_.

Could we deal with any tidy-ups in separate patches?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-18 14:58   ` Dave Hansen
@ 2025-06-25 14:33     ` Vishal Annapurve
  2025-06-25 16:25       ` Adrian Hunter
  0 siblings, 1 reply; 35+ messages in thread
From: Vishal Annapurve @ 2025-06-25 14:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Adrian Hunter, Tony Luck, pbonzini, seanjc, Borislav Petkov,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H Peter Anvin,
	linux-edac, linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao, chao.gao

On Wed, Jun 18, 2025 at 7:58 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/18/25 05:08, Adrian Hunter wrote:
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page)
> >       void *dest = page_to_virt(page);
> >       unsigned long i;
> >
> > -     /*
> > -      * The page could have been poisoned.  MOVDIR64B also clears
> > -      * the poison bit so the kernel can safely use the page again.
> > -      */
> > +     /* Machine check handler may have poisoned the page */
> > +     if (PageHWPoison(page))
> > +             return;

IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not
going to cause any trouble.

This check should be (unlikely(PageHWPoison(page)) and even better
probably should be omitted altogether if there are no side effects of
direct store to hwpoisoned pages.

>
> I think the old comment needs to stay in some form.
>
> There are two kinds of poisons here: One from an integrity mismatch and
> the other because the hardware decided the memory is bad. MOVDIR64B
> clears the integrity one, but not the hardware one obviously.

To ensure I understand correctly, Am I correct in saying: movdir64b
clearing the integrity poison is just hardware clearing the poison
bit, software will still treat that page as poisoned?

>
> Could we make that clear in the comment, please?
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 14:33     ` Vishal Annapurve
@ 2025-06-25 16:25       ` Adrian Hunter
  2025-06-25 16:31         ` Dave Hansen
                           ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Adrian Hunter @ 2025-06-25 16:25 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Tony Luck, pbonzini, seanjc, Borislav Petkov, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H Peter Anvin, linux-edac,
	linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao, chao.gao, Dave Hansen

On 25/06/2025 17:33, Vishal Annapurve wrote:
> On Wed, Jun 18, 2025 at 7:58 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 6/18/25 05:08, Adrian Hunter wrote:
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page)
>>>       void *dest = page_to_virt(page);
>>>       unsigned long i;
>>>
>>> -     /*
>>> -      * The page could have been poisoned.  MOVDIR64B also clears
>>> -      * the poison bit so the kernel can safely use the page again.
>>> -      */
>>> +     /* Machine check handler may have poisoned the page */
>>> +     if (PageHWPoison(page))
>>> +             return;
> 
> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not
> going to cause any trouble.

No.  PageHWPoison(page) means the page should not be touched.  It must
be freed back to the allocator where it will never be allocated again.

> 
> This check should be (unlikely(PageHWPoison(page)) and even better

'unlikely' would be fine

> probably should be omitted altogether if there are no side effects of
> direct store to hwpoisoned pages.
> 
>>
>> I think the old comment needs to stay in some form.
>>
>> There are two kinds of poisons here: One from an integrity mismatch and
>> the other because the hardware decided the memory is bad. MOVDIR64B
>> clears the integrity one, but not the hardware one obviously.
> 
> To ensure I understand correctly, Am I correct in saying: movdir64b
> clearing the integrity poison is just hardware clearing the poison
> bit, software will still treat that page as poisoned?

Typically an integrity violation would have caused a machine check
and the machine check handler would have marked the page
SetPageHWPoison(page).

So we really end up with only 2 cases:
1. page is fine and PageHWPoison(page) is false
2. page may have had an integrity violation or a hardware error
(we can't tell which), and PageHWPoison(page) is true


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 16:25       ` Adrian Hunter
@ 2025-06-25 16:31         ` Dave Hansen
  2025-06-25 16:42           ` Adrian Hunter
  2025-06-25 16:42         ` Edgecombe, Rick P
  2025-06-25 22:32         ` Huang, Kai
  2 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-25 16:31 UTC (permalink / raw)
  To: Adrian Hunter, Vishal Annapurve
  Cc: Tony Luck, pbonzini, seanjc, Borislav Petkov, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H Peter Anvin, linux-edac,
	linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao, chao.gao

On 6/25/25 09:25, Adrian Hunter wrote:
>> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not
>> going to cause any trouble.
> No.  PageHWPoison(page) means the page should not be touched.  It must
> be freed back to the allocator where it will never be allocated again.

What's the end-user-visible effect if the page is touched in this
specific function in this specific way (with movdir64b)?

In other words, what does this patch do for end users?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 16:31         ` Dave Hansen
@ 2025-06-25 16:42           ` Adrian Hunter
  2025-06-25 16:57             ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Adrian Hunter @ 2025-06-25 16:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tony Luck, pbonzini, seanjc, Borislav Petkov, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H Peter Anvin, linux-edac,
	linux-kernel, kvm, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao, Vishal Annapurve, rick.p.edgecombe

On 25/06/2025 19:31, Dave Hansen wrote:
> On 6/25/25 09:25, Adrian Hunter wrote:
>>> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not
>>> going to cause any trouble.
>> No.  PageHWPoison(page) means the page should not be touched.  It must
>> be freed back to the allocator where it will never be allocated again.
> 
> What's the end-user-visible effect if the page is touched in this
> specific function in this specific way (with movdir64b)?
> 
> In other words, what does this patch do for end users?

We have another patch that clarifies that.  It turns out we need
to clear pages (MOVDIR64B) only if the platform has the so-called
partial-write errata X86_BUG_TDX_PW_MCE.

It was decided to deal with that issue separately, but maybe it
needs to bundled in the same patch set, now that the discussion
has moved in that direction?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 16:25       ` Adrian Hunter
  2025-06-25 16:31         ` Dave Hansen
@ 2025-06-25 16:42         ` Edgecombe, Rick P
  2025-06-25 22:32         ` Huang, Kai
  2 siblings, 0 replies; 35+ messages in thread
From: Edgecombe, Rick P @ 2025-06-25 16:42 UTC (permalink / raw)
  To: Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Luck, Tony,
	dave.hansen@linux.intel.com, Zhao, Yan Y, Hansen, Dave,
	linux-kernel@vger.kernel.org, seanjc@google.com, Chatre, Reinette,
	pbonzini@redhat.com, mingo@redhat.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, linux-edac@vger.kernel.org, hpa@zytor.com,
	bp@alien8.de, tony.lindgren@linux.intel.com, Gao, Chao,
	x86@kernel.org

On Wed, 2025-06-25 at 19:25 +0300, Adrian Hunter wrote:
> > To ensure I understand correctly, Am I correct in saying: movdir64b
> > clearing the integrity poison is just hardware clearing the poison
> > bit, software will still treat that page as poisoned?
> 
> Typically an integrity violation would have caused a machine check
> and the machine check handler would have marked the page
> SetPageHWPoison(page).
> 
> So we really end up with only 2 cases:
> 1. page is fine and PageHWPoison(page) is false
> 2. page may have had an integrity violation or a hardware error
> (we can't tell which), and PageHWPoison(page) is true

Could an access in userspace take an #MC, and have the kernel set the direct map
for the page to NP and kill process? Then another process with access to the
gmem fd tries to map it as private, and take a non-integrity related SEAMMODE
#MC and ends up here? It does seem safer to just avoid touching it. But I bet
there are many cases like this, and we can't check poison everywhere. 

If guestmemfd turns into more of a persistent allocator, instead of a per-VM
thing, it probably needs to do some of the checks that the page allocator does,
like poison checks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 16:42           ` Adrian Hunter
@ 2025-06-25 16:57             ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2025-06-25 16:57 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Tony Luck, pbonzini, seanjc, Borislav Petkov, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H Peter Anvin, linux-edac,
	linux-kernel, kvm, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao, Vishal Annapurve, rick.p.edgecombe

On 6/25/25 09:42, Adrian Hunter wrote:
> On 25/06/2025 19:31, Dave Hansen wrote:
>> On 6/25/25 09:25, Adrian Hunter wrote:
>>>> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not
>>>> going to cause any trouble.
>>> No.  PageHWPoison(page) means the page should not be touched.  It must
>>> be freed back to the allocator where it will never be allocated again.
>>
>> What's the end-user-visible effect if the page is touched in this
>> specific function in this specific way (with movdir64b)?
>>
>> In other words, what does this patch do for end users?
> 
> We have another patch that clarifies that.  It turns out we need
> to clear pages (MOVDIR64B) only if the platform has the so-called
> partial-write errata X86_BUG_TDX_PW_MCE.
> 
> It was decided to deal with that issue separately, but maybe it
> needs to bundled in the same patch set, now that the discussion
> has moved in that direction?

It sure sounds like it to me.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 16:25       ` Adrian Hunter
  2025-06-25 16:31         ` Dave Hansen
  2025-06-25 16:42         ` Edgecombe, Rick P
@ 2025-06-25 22:32         ` Huang, Kai
  2025-06-25 22:38           ` Dave Hansen
  2 siblings, 1 reply; 35+ messages in thread
From: Huang, Kai @ 2025-06-25 22:32 UTC (permalink / raw)
  To: Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Luck, Tony,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Zhao, Yan Y, Hansen, Dave, linux-kernel@vger.kernel.org,
	seanjc@google.com, Chatre, Reinette, pbonzini@redhat.com,
	mingo@redhat.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, linux-edac@vger.kernel.org, hpa@zytor.com,
	Edgecombe, Rick P, bp@alien8.de, Gao, Chao, x86@kernel.org


> 2. page may have had an integrity violation or a hardware error
> (we can't tell which), and PageHWPoison(page) is true

Right.  I think the point of avoiding MOVDIR64B to such page is we cannot
tell whether it is a hardware error or not.  If it is a hardware error,
touching it using MOVDIR64B may cause additional #MC which will panic kernel
since now the #MC happens in the kernel context.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 22:32         ` Huang, Kai
@ 2025-06-25 22:38           ` Dave Hansen
  2025-06-26  1:19             ` Huang, Kai
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-25 22:38 UTC (permalink / raw)
  To: Huang, Kai, Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Luck, Tony,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Zhao, Yan Y, linux-kernel@vger.kernel.org, seanjc@google.com,
	Chatre, Reinette, pbonzini@redhat.com, mingo@redhat.com,
	tglx@linutronix.de, kirill.shutemov@linux.intel.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku,
	linux-edac@vger.kernel.org, hpa@zytor.com, Edgecombe, Rick P,
	bp@alien8.de, Gao, Chao, x86@kernel.org

On 6/25/25 15:32, Huang, Kai wrote:
>> 2. page may have had an integrity violation or a hardware error
>> (we can't tell which), and PageHWPoison(page) is true
> Right.  I think the point of avoiding MOVDIR64B to such page is we cannot
> tell whether it is a hardware error or not.  If it is a hardware error,
> touching it using MOVDIR64B may cause additional #MC which will panic kernel
> since now the #MC happens in the kernel context.

First and foremost, does the code path in question in this case touch
userspace pages? Or pages that are only "kernel context" in the first place.

Second, if we can't tell the difference between integrity violation or a
hardware error and this code needs to clear "integrity violation" poison
then won't this change just fundamentally break the erratum workaround
in the first place?!?!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-25 22:38           ` Dave Hansen
@ 2025-06-26  1:19             ` Huang, Kai
  2025-06-26 15:31               ` Luck, Tony
  0 siblings, 1 reply; 35+ messages in thread
From: Huang, Kai @ 2025-06-26  1:19 UTC (permalink / raw)
  To: Hansen, Dave, Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, Luck, Tony,
	tony.lindgren@linux.intel.com, Chatre, Reinette,
	seanjc@google.com, pbonzini@redhat.com, tglx@linutronix.de,
	Yamahata, Isaku, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	hpa@zytor.com, Edgecombe, Rick P, bp@alien8.de, Gao, Chao,
	x86@kernel.org

On Wed, 2025-06-25 at 15:38 -0700, Dave Hansen wrote:
> On 6/25/25 15:32, Huang, Kai wrote:
> > > 2. page may have had an integrity violation or a hardware error
> > > (we can't tell which), and PageHWPoison(page) is true
> > Right.  I think the point of avoiding MOVDIR64B to such page is we cannot
> > tell whether it is a hardware error or not.  If it is a hardware error,
> > touching it using MOVDIR64B may cause additional #MC which will panic kernel
> > since now the #MC happens in the kernel context.
> 
> First and foremost, does the code path in question in this case touch
> userspace pages? Or pages that are only "kernel context" in the first place.

The #MC happens when in SEAM non-root mode, i.e., when TDX guest tries to
read it using its TDX private keyID, so practically it is a userspace page.

But for such #MC it doesn't matter because the machine check handler handles
#MC from SEAM non-root separately:

do_machine_check(...)
{
	...

        } else if (m->mcgstatus & MCG_STATUS_SEAM_NR) {
                /*
                 * Saved RIP on stack makes it look like the machine check 
                 * was taken in the kernel on the instruction following    
                 * the entry to SEAM mode. But MCG_STATUS_SEAM_NR indicates
                 * that the machine check was taken inside SEAM non-root   
                 * mode.  CPU core has already marked that guest as dead.  
                 * It is OK for the kernel to resume execution at the      
                 * apparent point of the machine check as the fault did    
                 * not occur there. Mark the page as poisoned so it won't  
                 * be added to free list when the guest is terminated.     
                 */             
                if (mce_usable_address(m)) {                               
                        struct page *p = pfn_to_online_page(m->addr >>
PAGE_SHIFT);                      
                                                                                                         
                        if (p)                                             
                                SetPageHWPoison(p);                        
                }
	...
}

However if the kernel touch the page again using MOVDIR64B, the further #MC
won't have MCG_STATUS_SEAM_NR bit set (because it doesn't happen in SEAM
non-root), therefore it will be treated as a normal kernel #MC which will
result in kernel panic.

> 
> Second, if we can't tell the difference between integrity violation or a
> hardware error and this code needs to clear "integrity violation" poison
> then won't this change just fundamentally break the erratum workaround
> in the first place?!?!

The existing upstream code clears "integrity violation" unconditionally,
which IIUC doesn't work if the page is poisoned due to "hardware error". 
This patch avoids clearing integrity violation to avoid this, only if the
page is marked as poisoned.

The erratum workaround (using MOVDIR64B to clear) works if the page is not
poisoned.  I don't think it will break the erratum workaround.

The whole assumption is if the page is marked as poisoned, the kernel will
never used it again.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-26  1:19             ` Huang, Kai
@ 2025-06-26 15:31               ` Luck, Tony
  2025-06-26 22:20                 ` Huang, Kai
  0 siblings, 1 reply; 35+ messages in thread
From: Luck, Tony @ 2025-06-26 15:31 UTC (permalink / raw)
  To: Huang, Kai, Hansen, Dave, Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, hpa@zytor.com, Edgecombe, Rick P,
	bp@alien8.de, Gao, Chao, x86@kernel.org

> However if the kernel touch the page again using MOVDIR64B, the further #MC
> won't have MCG_STATUS_SEAM_NR bit set (because it doesn't happen in SEAM
> non-root), therefore it will be treated as a normal kernel #MC which will
> result in kernel panic.

Intel CPUs signal #MC when an instruction that is trying to consume poison data
is about to retire.

Stores to memory do not consume poison, so will not signal a #MC.

In the MOVDIR64B case an entire cache line is stored in a single atomic
write. This will clear the poison state of the cacheline (assuming that the
poison was due to an integrity error, memory error injection, I/O error etc.
If the DIMM is bad and has stuck bits, then the memory may still fail ECC
check on the next read.)

Using smaller stores to overwrite the cache line will not clear poison. The
cacheline is read from memory to some cache level, the small store updates
some bytes in the line, but the poison flag remains. Note that this doesn't
trigger #MC because the poison data is not being consumed, it still isn't
architecturally visible in some register, memory, or I/O device.

You may still see a UCNA signature signaled with CMCI from the memory
controller if either case resulted in a speculative prefetch of the poisoned
cache line.

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-26 15:31               ` Luck, Tony
@ 2025-06-26 22:20                 ` Huang, Kai
  2025-06-26 22:33                   ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Huang, Kai @ 2025-06-26 22:20 UTC (permalink / raw)
  To: Luck, Tony, Hansen, Dave, Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, seanjc@google.com, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com,
	linux-edac@vger.kernel.org, hpa@zytor.com, Chatre, Reinette,
	pbonzini@redhat.com, Edgecombe, Rick P, bp@alien8.de, Gao, Chao,
	x86@kernel.org

On Thu, 2025-06-26 at 15:31 +0000, Luck, Tony wrote:
> > However if the kernel touch the page again using MOVDIR64B, the further #MC
> > won't have MCG_STATUS_SEAM_NR bit set (because it doesn't happen in SEAM
> > non-root), therefore it will be treated as a normal kernel #MC which will
> > result in kernel panic.
> 
> Intel CPUs signal #MC when an instruction that is trying to consume poison data
> is about to retire.
> 
> Stores to memory do not consume poison, so will not signal a #MC.
> 
> In the MOVDIR64B case an entire cache line is stored in a single atomic
> write. This will clear the poison state of the cacheline (assuming that the
> poison was due to an integrity error, memory error injection, I/O error etc.
> If the DIMM is bad and has stuck bits, then the memory may still fail ECC
> check on the next read.)
> 
> Using smaller stores to overwrite the cache line will not clear poison. The
> cacheline is read from memory to some cache level, the small store updates
> some bytes in the line, but the poison flag remains. Note that this doesn't
> trigger #MC because the poison data is not being consumed, it still isn't
> architecturally visible in some register, memory, or I/O device.
> 
> You may still see a UCNA signature signaled with CMCI from the memory
> controller if either case resulted in a speculative prefetch of the poisoned
> cache line.
> 
> -Tony

Thanks for the info.  :-)

So it seems MOVDIR64B to a bad memory won't necessarily trigger #MC when the
written is performed.

But IMHO we may should just have a simple policy that when a page is marked
as poisoned, it should never be touched again.  It's only one page anyway
(for one TD) so losing that doesn't seem bad to me.  If we want to clear the
poisoned page, then perhaps we should mark that page to be not-poisoned
again.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-26 22:20                 ` Huang, Kai
@ 2025-06-26 22:33                   ` Dave Hansen
  2025-06-27  0:56                     ` Huang, Kai
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-26 22:33 UTC (permalink / raw)
  To: Huang, Kai, Luck, Tony, Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, seanjc@google.com, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com,
	linux-edac@vger.kernel.org, hpa@zytor.com, Chatre, Reinette,
	pbonzini@redhat.com, Edgecombe, Rick P, bp@alien8.de, Gao, Chao,
	x86@kernel.org

On 6/26/25 15:20, Huang, Kai wrote:
> But IMHO we may should just have a simple policy that when a page is marked
> as poisoned, it should never be touched again.  It's only one page anyway
> (for one TD) so losing that doesn't seem bad to me.  If we want to clear the
> poisoned page, then perhaps we should mark that page to be not-poisoned
> again.

The simplest policy is to do nothing.

The kernel only has 29 places that check PageHWPoison(). I'd guess that
roughly half of those are the memory-failure.c infrastructure and
bare-minimum code to handle poison, like not allowing pages to go back
into the allocator.

There are something like 5,000 lines of code in the kernel that deal
with a literal 'struct page'. 29 checks for ~5,000 sites is pretty
minuscule. We obviously don't have a policy that every place that uses
'struct page' needs to check for poison. We also don't even have a
policy where writes to or reads from a page check for poison.

Why is this TDX code so special that PageHWPoison() needs to be checked.
For instance:

$ grep -r PageHWPoison arch/x86/
arch/x86/kernel/cpu/mce/core.c:	SetPageHWPoison(p);
arch/x86/kernel/cpu/mce/core.c:	SetPageHWPoison(p);

In other words, this would be the *ONLY* arch/x86 site. Why?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] KVM: TDX: Do not clear poisoned pages
  2025-06-26 22:33                   ` Dave Hansen
@ 2025-06-27  0:56                     ` Huang, Kai
  0 siblings, 0 replies; 35+ messages in thread
From: Huang, Kai @ 2025-06-27  0:56 UTC (permalink / raw)
  To: Luck, Tony, Hansen, Dave, Hunter, Adrian, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
	Chatre, Reinette, seanjc@google.com,
	tony.lindgren@linux.intel.com, tglx@linutronix.de,
	Yamahata, Isaku, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	linux-edac@vger.kernel.org, hpa@zytor.com, mingo@redhat.com,
	Edgecombe, Rick P, kirill.shutemov@linux.intel.com, bp@alien8.de,
	x86@kernel.org, Gao, Chao

On Thu, 2025-06-26 at 15:33 -0700, Hansen, Dave wrote:
> On 6/26/25 15:20, Huang, Kai wrote:
> > But IMHO we may should just have a simple policy that when a page is marked
> > as poisoned, it should never be touched again.  It's only one page anyway
> > (for one TD) so losing that doesn't seem bad to me.  If we want to clear the
> > poisoned page, then perhaps we should mark that page to be not-poisoned
> > again.
> 
> The simplest policy is to do nothing.
> 
> The kernel only has 29 places that check PageHWPoison(). I'd guess that
> roughly half of those are the memory-failure.c infrastructure and
> bare-minimum code to handle poison, like not allowing pages to go back
> into the allocator.
> 
> There are something like 5,000 lines of code in the kernel that deal
> with a literal 'struct page'. 29 checks for ~5,000 sites is pretty
> minuscule. We obviously don't have a policy that every place that uses
> 'struct page' needs to check for poison. We also don't even have a
> policy where writes to or reads from a page check for poison.

It was my understanding that if page is marked as poisoned the kernel should
not touch that again.  I thought the kernel should have already implemented
in this way, like not allowing pages to go back to the allocator, and the
places that use 'struct page' you mentioned should already know the page is
not poisoned.

That being said it's just my guess, so my bad.

> 
> Why is this TDX code so special that PageHWPoison() needs to be checked.
> For instance:
> 
> $ grep -r PageHWPoison arch/x86/
> arch/x86/kernel/cpu/mce/core.c:	SetPageHWPoison(p);
> arch/x86/kernel/cpu/mce/core.c:	SetPageHWPoison(p);
> 
> In other words, this would be the *ONLY* arch/x86 site. Why?

This is the case that I know the kernel could touch poisoned page.  And I
didn't know writing to hardware error memory is fine, but I thought we
should just skip it for safety.

But per Tony it should be fine to write to it, so I am fine to not have the
PageHWPoison() check here.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-19 11:57     ` Adrian Hunter
@ 2025-06-27 15:23       ` Adrian Hunter
  2025-06-27 15:25         ` Dave Hansen
  2025-06-27 16:28         ` Luck, Tony
  0 siblings, 2 replies; 35+ messages in thread
From: Adrian Hunter @ 2025-06-27 15:23 UTC (permalink / raw)
  To: Dave Hansen, Tony Luck
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao, pbonzini, seanjc

On 19/06/2025 14:57, Adrian Hunter wrote:
> On 18/06/2025 17:55, Dave Hansen wrote:
>> On 6/18/25 05:08, Adrian Hunter wrote:
>>> --- a/arch/x86/kernel/cpu/mce/core.c
>>> +++ b/arch/x86/kernel/cpu/mce/core.c
>>> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>>>  		 * be added to free list when the guest is terminated.
>>>  		 */
>>>  		if (mce_usable_address(m)) {
>>> -			struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
>>> +			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
>>> +			struct page *p = pfn_to_online_page(pfn);
>>
>> If ->addr isn't really an address that software can do much with,
>> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()?
> 
> Would that mean no one would know if the mce addr had KeyID bits or not?

Current design, to keep the bits in mce addr, is from Tony's patch:

	x86/mce: Mask out non-address bits from machine check bank
	https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a01ec97dc066009dd89e43bfcf55644f2dd6d19

Assuming that is not altered, a tidy-up is still possible like:

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6c77c03139f7..b469b7a7ecfa 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -386,4 +386,14 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c)		{ }
 
 unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len);
 
+static inline unsigned long mce_addr_to_phys(u64 mce_addr)
+{
+	return mce_addr & MCI_ADDR_PHYSADDR;
+}
+
+static inline unsigned long mce_addr_to_pfn(u64 mce_addr)
+{
+	return mce_addr_to_phys(mce_addr) >> PAGE_SHIFT;
+}
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 76c4634c6a5f..e9e8c377790f 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -642,7 +642,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 	    mce->severity != MCE_DEFERRED_SEVERITY)
 		return NOTIFY_DONE;
 
-	pfn = (mce->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
+	pfn = mce_addr_to_pfn(mce->addr);
 	if (!memory_failure(pfn, 0)) {
 		set_mce_nospec(pfn);
 		mce->kflags |= MCE_HANDLED_UC;
@@ -1412,7 +1412,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	pfn = (p->mce_addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
+	pfn = mce_addr_to_pfn(p->mce_addr);
 	ret = memory_failure(pfn, flags);
 	if (!ret) {
 		set_mce_nospec(pfn);
@@ -1441,7 +1441,7 @@ static void kill_me_never(struct callback_head *cb)
 
 	p->mce_count = 0;
 	pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
-	pfn = (p->mce_addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
+	pfn = mce_addr_to_pfn(p->mce_addr);
 	if (!memory_failure(pfn, 0))
 		set_mce_nospec(pfn);
 }
@@ -1665,7 +1665,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		 * be added to free list when the guest is terminated.
 		 */
 		if (mce_usable_address(m)) {
-			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
+			unsigned long pfn = mce_addr_to_pfn(m->addr);
 			struct page *p = pfn_to_online_page(pfn);
 
 			if (p)
diff --git a/drivers/cxl/core/mce.c b/drivers/cxl/core/mce.c
index ff8d078c6ca1..f3c4d6a5f159 100644
--- a/drivers/cxl/core/mce.c
+++ b/drivers/cxl/core/mce.c
@@ -24,7 +24,7 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
 	if (!endpoint)
 		return NOTIFY_DONE;
 
-	spa = mce->addr & MCI_ADDR_PHYSADDR;
+	spa = mce_addr_to_phys(mce->addr);
 
 	pfn = spa >> PAGE_SHIFT;
 	if (!pfn_valid(pfn))
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index c9ade45c1a99..83fcec743ea7 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -732,7 +732,7 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
 
 	memset(&res, 0, sizeof(res));
 	res.mce  = mce;
-	res.addr = mce->addr & MCI_ADDR_PHYSADDR;
+	res.addr = mce_addr_to_phys(mce->addr);
 	if (!pfn_to_online_page(res.addr >> PAGE_SHIFT) && !arch_is_platform_page(res.addr)) {
 		pr_err("Invalid address 0x%llx in IA32_MC%d_ADDR\n", mce->addr, mce->bank);
 		return NOTIFY_DONE;


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-27 15:23       ` Adrian Hunter
@ 2025-06-27 15:25         ` Dave Hansen
  2025-06-27 16:24           ` Luck, Tony
  2025-06-27 16:28         ` Luck, Tony
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-27 15:25 UTC (permalink / raw)
  To: Adrian Hunter, Tony Luck
  Cc: vannapurve, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H Peter Anvin, linux-edac, linux-kernel, kvm,
	rick.p.edgecombe, kirill.shutemov, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao, pbonzini, seanjc

On 6/27/25 08:23, Adrian Hunter wrote:
> On 19/06/2025 14:57, Adrian Hunter wrote:
>> On 18/06/2025 17:55, Dave Hansen wrote:
>>> On 6/18/25 05:08, Adrian Hunter wrote:
>>>> --- a/arch/x86/kernel/cpu/mce/core.c
>>>> +++ b/arch/x86/kernel/cpu/mce/core.c
>>>> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>>>>  		 * be added to free list when the guest is terminated.
>>>>  		 */
>>>>  		if (mce_usable_address(m)) {
>>>> -			struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
>>>> +			unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
>>>> +			struct page *p = pfn_to_online_page(pfn);
>>> If ->addr isn't really an address that software can do much with,
>>> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()?
>> Would that mean no one would know if the mce addr had KeyID bits or not?

Uhhh, just store the KeyID separately. Have

	mce->addr
and
	mce->keyid

Problem solved.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-27 15:25         ` Dave Hansen
@ 2025-06-27 16:24           ` Luck, Tony
  2025-06-27 16:33             ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Luck, Tony @ 2025-06-27 16:24 UTC (permalink / raw)
  To: Hansen, Dave, Hunter, Adrian
  Cc: Annapurve, Vishal, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86@kernel.org, H Peter Anvin,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	kirill.shutemov@linux.intel.com, Huang, Kai, Chatre, Reinette,
	Li, Xiaoyao, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y,
	Gao, Chao, pbonzini@redhat.com, seanjc@google.com

>>> Would that mean no one would know if the mce addr had KeyID bits or not?
>
> Uhhh, just store the KeyID separately. Have
>
>	mce->addr
> and
>	mce->keyid
>
> Problem solved.

As the old saying goes: "Solve one problem and create three new ones".

"struct mce" is in user visible in arch/x86/include/uapi/asm/mce.h
(via /dev/mcelog).

So while we can add new fields (at the end of the structure) some thought
about the implications are needed.

We've been sending a combined key+address in the "mce->addr" to
user space for a while. Has anyone built infrastructure on top of that?

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-27 15:23       ` Adrian Hunter
  2025-06-27 15:25         ` Dave Hansen
@ 2025-06-27 16:28         ` Luck, Tony
  1 sibling, 0 replies; 35+ messages in thread
From: Luck, Tony @ 2025-06-27 16:28 UTC (permalink / raw)
  To: Hunter, Adrian, Hansen, Dave
  Cc: Annapurve, Vishal, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86@kernel.org, H Peter Anvin,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	kirill.shutemov@linux.intel.com, Huang, Kai, Chatre, Reinette,
	Li, Xiaoyao, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y,
	Gao, Chao, pbonzini@redhat.com, seanjc@google.com

> >> If ->addr isn't really an address that software can do much with,
> >> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()?
> >
> > Would that mean no one would know if the mce addr had KeyID bits or not?
>
> Current design, to keep the bits in mce addr, is from Tony's patch:
>
>       x86/mce: Mask out non-address bits from machine check bank
>       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a01ec97dc066009dd89e43bfcf55644f2dd6d19
>
> Assuming that is not altered, a tidy-up is still possible like:
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 6c77c03139f7..b469b7a7ecfa 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -386,4 +386,14 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c)           { }
>
>  unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len);
>
> +static inline unsigned long mce_addr_to_phys(u64 mce_addr)
> +{
> +     return mce_addr & MCI_ADDR_PHYSADDR;
> +}
> +
> +static inline unsigned long mce_addr_to_pfn(u64 mce_addr)
> +{
> +     return mce_addr_to_phys(mce_addr) >> PAGE_SHIFT;
> +}
> +
>  #endif /* _ASM_X86_MCE_H */

I like this. Can you write up a patch with a commit message please?

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-27 16:24           ` Luck, Tony
@ 2025-06-27 16:33             ` Dave Hansen
  2025-07-30 10:54               ` Adrian Hunter
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2025-06-27 16:33 UTC (permalink / raw)
  To: Luck, Tony, Hunter, Adrian
  Cc: Annapurve, Vishal, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86@kernel.org, H Peter Anvin,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	kirill.shutemov@linux.intel.com, Huang, Kai, Chatre, Reinette,
	Li, Xiaoyao, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y,
	Gao, Chao, pbonzini@redhat.com, seanjc@google.com

On 6/27/25 09:24, Luck, Tony wrote:
> We've been sending a combined key+address in the "mce->addr" to
> user space for a while. Has anyone built infrastructure on top of that?

I'm not sure they can do anything useful with an address that has the
KeyID in the first place. The partitioning scheme is in an MSR, so
they'd need to be doing silly gymnastics to even decode the address.

Userspace can deal with the KeyID not being in the address. It's been
the default for ages. So, if we take it back out, I'd expect it fixes
more things than it breaks.

So, yeah, we should carefully consider it. But it still 100% looks like
the right thing to me to detangle the KeyID and physical address in the ABI.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-06-27 16:33             ` Dave Hansen
@ 2025-07-30 10:54               ` Adrian Hunter
  2025-07-30 11:57                 ` Huang, Kai
  2025-07-30 14:20                 ` Vishal Annapurve
  0 siblings, 2 replies; 35+ messages in thread
From: Adrian Hunter @ 2025-07-30 10:54 UTC (permalink / raw)
  To: Dave Hansen, Luck, Tony, Annapurve, Vishal
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	x86@kernel.org, H Peter Anvin, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Edgecombe, Rick P, kirill.shutemov@linux.intel.com, Huang, Kai,
	Chatre, Reinette, Li, Xiaoyao, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y,
	Gao, Chao, pbonzini@redhat.com, seanjc@google.com

On 27/06/2025 19:33, Dave Hansen wrote:
> On 6/27/25 09:24, Luck, Tony wrote:
>> We've been sending a combined key+address in the "mce->addr" to
>> user space for a while. Has anyone built infrastructure on top of that?
> 
> I'm not sure they can do anything useful with an address that has the
> KeyID in the first place. The partitioning scheme is in an MSR, so
> they'd need to be doing silly gymnastics to even decode the address.
> 
> Userspace can deal with the KeyID not being in the address. It's been
> the default for ages. So, if we take it back out, I'd expect it fixes
> more things than it breaks.
> 
> So, yeah, we should carefully consider it. But it still 100% looks like
> the right thing to me to detangle the KeyID and physical address in the ABI.

Coming back to this after a bit of a break.

It feels unlikely to me that any users are expecting KeyID in mce->addr.

Looking at user space programs like mcelog and rasdaemon, gives the
impression that mce->addr contains only an address.

The UAPI header file describes addr as "Bank's MCi_ADDR MSR", but what
mce_read_aux() does tends to contradict that, especially for AMD
SMCA.

But there are also additional places where it seems like MCI_ADDR_PHYSADDR
is missing:

	tdx_dump_mce_info()
		paddr_is_tdx_private()
			__seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args)
				TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero

	skx_mce_output_error()
		edac_mc_handle_error()
			expects page_frame_number, so without KeyID

The KeyID is probably only useful for potentially identifying the TD, but
given that the TD incurs a FATAL error, that may be obvious anyway.

So removing the KeyID from mce->addr looks like the right thing to do.

Note AFAICT there are 3 kernel APIs that deal with the MCE address:

	Device /dev/mcelog which outputs struct mce
	Tracepoint mce:mce_record which outputs members from struct mce
	Tracepoint ras:mc_event where the kernel constructs the address
	from page_frame_number implying that KeyID should not be present

I guess it would be sensible to ask what customers think.

Vishal, do you know anyone at Google who deals with handling machine
check information, and who might have an opinion on this?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-07-30 10:54               ` Adrian Hunter
@ 2025-07-30 11:57                 ` Huang, Kai
  2025-07-30 14:20                 ` Vishal Annapurve
  1 sibling, 0 replies; 35+ messages in thread
From: Huang, Kai @ 2025-07-30 11:57 UTC (permalink / raw)
  To: Luck, Tony, Hunter, Adrian, Hansen, Dave, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, mingo@redhat.com, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	linux-edac@vger.kernel.org, hpa@zytor.com, seanjc@google.com,
	pbonzini@redhat.com, Edgecombe, Rick P, bp@alien8.de, Gao, Chao,
	x86@kernel.org

On Wed, 2025-07-30 at 13:54 +0300, Adrian Hunter wrote:
> But there are also additional places where it seems like MCI_ADDR_PHYSADDR
> is missing:
> 
> 	tdx_dump_mce_info()
> 		paddr_is_tdx_private()
> 			__seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args)
> 				TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero

This is only called in mce_panic() path, which basically means the #MC is
fatal, e.g., happens in kernel context.

The intention of this is to catch any #MC due to kernel bug (i.e.,
software issue, but not hardware error) which does partial write to TDX
private memory and read at a later time, and report a more precise
information to the user to point out this could be due to "possible kernel
bug".  See changelog of 70060463cb2b ("x86/mce: Differentiate real
hardware #MCs from TDX erratum ones").

In other words, for this case the address reported via MCI_ADDR_PHYSADDR
should not contain any KeyID bits since the kernel always uses keyID 0 to
read.

I believe the KeyID bits will only be appended to the physical address
reported in MCI_ADDR_PHYSADDR when the #MC was triggered from TDX guest,
i.e., when the CPU was accessing memory using TDX KeyID.  Such #MC is not
fatal and won't call into mce_panic().

That being said, for tdx_dump_mce_info(), while explicitly masking out
keyID bits in MCI_ADDR_PHYSADDR obviously doesn't hurt (or arguably better
in some way), it is not necessary AFAICT.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors in TDX/SEAM non-root mode
  2025-07-30 10:54               ` Adrian Hunter
  2025-07-30 11:57                 ` Huang, Kai
@ 2025-07-30 14:20                 ` Vishal Annapurve
  1 sibling, 0 replies; 35+ messages in thread
From: Vishal Annapurve @ 2025-07-30 14:20 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, Luck, Tony, Borislav Petkov, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86@kernel.org, H Peter Anvin,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	kirill.shutemov@linux.intel.com, Huang, Kai, Chatre, Reinette,
	Li, Xiaoyao, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Zhao, Yan Y,
	Gao, Chao, pbonzini@redhat.com, seanjc@google.com

On Wed, Jul 30, 2025 at 3:55 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 27/06/2025 19:33, Dave Hansen wrote:
> > On 6/27/25 09:24, Luck, Tony wrote:
> >> We've been sending a combined key+address in the "mce->addr" to
> >> user space for a while. Has anyone built infrastructure on top of that?
> >
> > I'm not sure they can do anything useful with an address that has the
> > KeyID in the first place. The partitioning scheme is in an MSR, so
> > they'd need to be doing silly gymnastics to even decode the address.
> >
> > Userspace can deal with the KeyID not being in the address. It's been
> > the default for ages. So, if we take it back out, I'd expect it fixes
> > more things than it breaks.
> >
> > So, yeah, we should carefully consider it. But it still 100% looks like
> > the right thing to me to detangle the KeyID and physical address in the ABI.
>
> Coming back to this after a bit of a break.
>
> It feels unlikely to me that any users are expecting KeyID in mce->addr.
>
> Looking at user space programs like mcelog and rasdaemon, gives the
> impression that mce->addr contains only an address.
>
> The UAPI header file describes addr as "Bank's MCi_ADDR MSR", but what
> mce_read_aux() does tends to contradict that, especially for AMD
> SMCA.
>
> But there are also additional places where it seems like MCI_ADDR_PHYSADDR
> is missing:
>
>         tdx_dump_mce_info()
>                 paddr_is_tdx_private()
>                         __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args)
>                                 TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero
>
>         skx_mce_output_error()
>                 edac_mc_handle_error()
>                         expects page_frame_number, so without KeyID
>
> The KeyID is probably only useful for potentially identifying the TD, but
> given that the TD incurs a FATAL error, that may be obvious anyway.
>
> So removing the KeyID from mce->addr looks like the right thing to do.
>
> Note AFAICT there are 3 kernel APIs that deal with the MCE address:
>
>         Device /dev/mcelog which outputs struct mce
>         Tracepoint mce:mce_record which outputs members from struct mce
>         Tracepoint ras:mc_event where the kernel constructs the address
>         from page_frame_number implying that KeyID should not be present
>
> I guess it would be sensible to ask what customers think.
>
> Vishal, do you know anyone at Google who deals with handling machine
> check information, and who might have an opinion on this?
>

I think it's safe to assume Google hasn't built any infra in the
userspace that needs KeyID bits in the mce address. That being said,
Dave's suggestion to "detangle the KeyID and physical address in the
ABI" makes sense to me.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-07-30 14:20 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 12:08 [PATCH 0/2] Fixes for recovery for machine check in TDX/SEAM non-root mode Adrian Hunter
2025-06-18 12:08 ` [PATCH 1/2] x86/mce: Fix missing address mask in recovery for errors " Adrian Hunter
2025-06-18 12:36   ` Xiaoyao Li
2025-06-18 14:55   ` Dave Hansen
2025-06-19 11:57     ` Adrian Hunter
2025-06-27 15:23       ` Adrian Hunter
2025-06-27 15:25         ` Dave Hansen
2025-06-27 16:24           ` Luck, Tony
2025-06-27 16:33             ` Dave Hansen
2025-07-30 10:54               ` Adrian Hunter
2025-07-30 11:57                 ` Huang, Kai
2025-07-30 14:20                 ` Vishal Annapurve
2025-06-27 16:28         ` Luck, Tony
2025-06-18 23:20   ` Huang, Kai
2025-06-18 23:39     ` Luck, Tony
2025-06-18 23:46       ` Luck, Tony
2025-06-18 23:57         ` Huang, Kai
2025-06-18 23:53       ` Huang, Kai
2025-06-18 12:08 ` [PATCH 2/2] KVM: TDX: Do not clear poisoned pages Adrian Hunter
2025-06-18 12:39   ` Xiaoyao Li
2025-06-18 14:58   ` Dave Hansen
2025-06-25 14:33     ` Vishal Annapurve
2025-06-25 16:25       ` Adrian Hunter
2025-06-25 16:31         ` Dave Hansen
2025-06-25 16:42           ` Adrian Hunter
2025-06-25 16:57             ` Dave Hansen
2025-06-25 16:42         ` Edgecombe, Rick P
2025-06-25 22:32         ` Huang, Kai
2025-06-25 22:38           ` Dave Hansen
2025-06-26  1:19             ` Huang, Kai
2025-06-26 15:31               ` Luck, Tony
2025-06-26 22:20                 ` Huang, Kai
2025-06-26 22:33                   ` Dave Hansen
2025-06-27  0:56                     ` Huang, Kai
2025-06-18 23:09   ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).