Linux EDAC development
 help / color / mirror / Atom feed
* [RFC] AMD VM crashing on deferred memory error injection
@ 2026-02-09 16:36 William Roche
  2026-02-09 17:36 ` Borislav Petkov
  2026-02-09 21:08 ` Yazen Ghannam
  0 siblings, 2 replies; 11+ messages in thread
From: William Roche @ 2026-02-09 16:36 UTC (permalink / raw)
  To: Ghannam, Yazen, Tony Luck, bp, Thomas Gleixner, mingo,
	dave.hansen, x86, hpa
  Cc: Allen, John, linux-edac, linux-kernel, Jane Chu

Hello,

I'd like to bring to your attention a consequence of the integration of
this set of commits early into the 6.19 kernel:

   2025-11-04 14:55 [PATCH v8 0/8] AMD MCA interrupts rework
  
https://lore.kernel.org/all/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com/

Yazen Ghannam (7):
       x86/mce: Unify AMD THR handler with MCA Polling
       x86/mce: Unify AMD DFR handler with MCA Polling
       x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
       x86/mce/amd: Support SMCA Corrected Error Interrupt
       x86/mce/amd: Remove redundant reset_block()
       x86/mce/amd: Define threshold restart function for banks
       x86/mce: Save and use APEI corrected threshold limit


An AMD Qemu VM running this kernel is no longer able to deal with the
injection of a deferred memory error, and crashes with:

[  333.420854] mce: MSR access error: WRMSR to 0xc0002098 (tried to 
write 0x0000000000000000) at rIP: 0xffffffff8229894d 
(mce_wrmsrq+0x1d/0x60)
[  333.428105] Call Trace: 
  

[  333.429566]  <IRQ> 
  

[  333.430745]  amd_clear_bank+0x6e/0x70 
  

[  333.432828]  machine_check_poll+0x228/0x2e0 
  

[  333.435068]  ? __pfx_mce_timer_fn+0x10/0x10 
  

[  333.437241]  mce_timer_fn+0xb1/0x130 
  

[  333.438966]  ? __pfx_mce_timer_fn+0x10/0x10 
  

[  333.441380]  call_timer_fn+0x26/0x120 
  

[  333.443518]  __run_timers+0x202/0x290 
  

[  333.445763]  run_timer_softirq+0x49/0x100 
  

[  333.447908]  handle_softirqs+0xeb/0x2c0 
  

[  333.449863]  __irq_exit_rcu+0xda/0x100 
  

[  333.452065]  sysvec_apic_timer_interrupt+0x71/0x90 
  

[  333.454846]  </IRQ> 
  

[  333.456192]  <TASK> 
  

[  333.457520]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  333.460355] RIP: 0010:pv_native_safe_halt+0xf/0x20
[  333.463203] Code: 20 d0 e9 5f 99 e6 fe 0f 1f 40 00 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 ee 18 00 fb 
f4 <e9> 37 990
[  333.472816] RSP: 0018:ffffffff83403e78 EFLAGS: 00000246
[  333.475848] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[  333.479481] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[  333.483492] RBP: ffffffff83412980 R08: 0000000000000000 R09: 
0000000000000000
[  333.487503] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[  333.491482] R13: 0000000000000000 R14: 0000000000000000 R15: 
00000000000947d0
[  333.495258]  default_idle+0x9/0x30
[  333.497283]  default_idle_call+0x28/0x100
[  333.499641]  cpuidle_idle_call+0x12e/0x180
[  333.502087]  do_idle+0x77/0xb0
[  333.503914]  cpu_startup_entry+0x29/0x30
[  333.506337]  rest_init+0xcc/0xd0
[  333.508296]  start_kernel+0x4df/0x4e0
[  333.510491]  x86_64_start_reservations+0x32/0x40
[  333.513101]  x86_64_start_kernel+0xce/0xd0
[  333.515433]  common_startup_64+0x13e/0x141
[  333.517920]  </TASK>
[  333.519468] Kernel panic - not syncing: MCA architectural violation!


The problem appeared with the addition of clearing MCA_DESTAT for all
deferred errors in the amd_clear_bank() function by this kernel commit:

     7cb735d7c0cb  x86/mce: Unify AMD DFR handler with MCA Polling

+       /* Clear MCA_DESTAT for all deferred errors even those logged in 
MCA_STATUS. */
+       if (m->status & MCI_STATUS_DEFERRED)
+               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);


Where a Qemu AMD implementation of MCE injection for deferred errors
relies on machine_check_poll() picking up these errors.
As indicated in Qemu change:
     4b77512b2782  i386: Fix MCE support for AMD hosts
https://lore.kernel.org/qemu-devel/20240603193622.47156-2-john.allen@amd.com/


When a Qemu process receives the SIGBUS information from the host, it
generates a virtual MCE to be dealt by the VM kernel machine_check_poll().
But clearing MCA_DESTAT doesn't seem to be allowed and triggers an
exception. Which looks like a kernel & AMD SMCA contract mismatch (?)

So should we consider that the Qemu platform has to allow the change or
is the kernel missing guards around clearing this MCA bank after
injected UEs on this platform ?


FYI, to reproduce the problem:
. I used a QEMU Standard PC q35:

qemu-system-x86_64 --version
QEMU emulator version 10.2.50 (v10.2.0-1085-gcd5a79dc98)
Copyright (c) 2003-2026 Fabrice Bellard and the QEMU Project developers

qemu-system-x86_64 -smp 4 -m 20G -enable-kvm -cpu host -usb \
	-device usb-tablet -serial mon:stdio -M q35 \
	-nic user,model=e1000,hostfwd=tcp::60022-:22 -nographic \
	-drive file=disk.qcow2,cache=none

. Inject an error into this VM running a 6.19.0-rc1 or more recent kernel.
 From the host:
# modprobe hwpoison-inject
# echo <pfn> > /sys/kernel/debug/hwpoison/corrupt-pfn

Wait 5 minutes until the deferred error is handled by the VM kernel, and
the VM than crashes with the above stack trace...


. But removing the reset of MCA_DESTAT in the kernel amd_clear_bank()
function or adding this simple test makes the system work again as
before:


diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index d9f9ee7db5c8..86b3070fbb40 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -860,7 +860,7 @@ void amd_clear_bank(struct mce *m)
         amd_reset_thr_limit(m->bank);

         /* Clear MCA_DESTAT for all deferred errors even those logged 
in MCA_STATUS. */
-       if (m->status & MCI_STATUS_DEFERRED)
+       if (m->status & MCI_STATUS_DEFERRED && !(m->status & 
MCI_STATUS_POISON))
                 mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);

         /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */



According to me, this small kernel fix relies too much on a Qemu AMD
specific implementation detail.

Would you have a more appropriate fix to suggest please ?

Thanks in advance for your feedback.
William.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-09 16:36 [RFC] AMD VM crashing on deferred memory error injection William Roche
@ 2026-02-09 17:36 ` Borislav Petkov
  2026-02-09 17:38   ` Luck, Tony
  2026-02-09 21:08 ` Yazen Ghannam
  1 sibling, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2026-02-09 17:36 UTC (permalink / raw)
  To: William Roche
  Cc: Ghannam, Yazen, Tony Luck, Thomas Gleixner, mingo, dave.hansen,
	x86, hpa, Allen, John, linux-edac, linux-kernel, Jane Chu

On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
> An AMD Qemu VM running this kernel is no longer able to deal with the
> injection of a deferred memory error, and crashes with:
> 
> [  333.420854] mce: MSR access error: WRMSR to 0xc0002098 (tried to write
> 0x0000000000000000) at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
> [  333.428105] Call Trace:

Works as advertized - KVM is not allowing the MSR write.

This enablement is not meant for VM use. Why do we care about injecting hw
errors in a guest?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-09 17:36 ` Borislav Petkov
@ 2026-02-09 17:38   ` Luck, Tony
  2026-02-09 17:53     ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Luck, Tony @ 2026-02-09 17:38 UTC (permalink / raw)
  To: Borislav Petkov, William Roche
  Cc: Ghannam, Yazen, Thomas Gleixner, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Allen, John, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jane Chu

> This enablement is not meant for VM use. Why do we care about injecting hw
> errors in a guest?

The guest may be able to just kill a process and keep running.

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-09 17:38   ` Luck, Tony
@ 2026-02-09 17:53     ` Borislav Petkov
  0 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2026-02-09 17:53 UTC (permalink / raw)
  To: Luck, Tony
  Cc: William Roche, Ghannam, Yazen, Thomas Gleixner, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Allen, John, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jane Chu

On Mon, Feb 09, 2026 at 05:38:58PM +0000, Luck, Tony wrote:
> > This enablement is not meant for VM use. Why do we care about injecting hw
> > errors in a guest?
> 
> The guest may be able to just kill a process and keep running.

I have heard about injecting errors into qemu/kvm perhaps a decade ago and
nothing ever since. Either it has been working perfectly since then or no one
cares until now.

So the guest "may" be able to do a lot of things - question is, do we support
it and how do we test for it in the future so that it doesn't break.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-09 16:36 [RFC] AMD VM crashing on deferred memory error injection William Roche
  2026-02-09 17:36 ` Borislav Petkov
@ 2026-02-09 21:08 ` Yazen Ghannam
  2026-02-09 21:18   ` Yazen Ghannam
  1 sibling, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2026-02-09 21:08 UTC (permalink / raw)
  To: William Roche
  Cc: Tony Luck, bp, Thomas Gleixner, mingo, dave.hansen, x86, hpa,
	Allen, John, linux-edac, linux-kernel, Jane Chu

On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
> Hello,
> 
> I'd like to bring to your attention a consequence of the integration of
> this set of commits early into the 6.19 kernel:
> 
>   2025-11-04 14:55 [PATCH v8 0/8] AMD MCA interrupts rework
> https://lore.kernel.org/all/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com/
> 
> Yazen Ghannam (7):
>       x86/mce: Unify AMD THR handler with MCA Polling
>       x86/mce: Unify AMD DFR handler with MCA Polling
>       x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
>       x86/mce/amd: Support SMCA Corrected Error Interrupt
>       x86/mce/amd: Remove redundant reset_block()
>       x86/mce/amd: Define threshold restart function for banks
>       x86/mce: Save and use APEI corrected threshold limit
> 
> 
> An AMD Qemu VM running this kernel is no longer able to deal with the
> injection of a deferred memory error, and crashes with:
> 
> [  333.420854] mce: MSR access error: WRMSR to 0xc0002098 (tried to write
> 0x0000000000000000) at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
> [  333.428105] Call Trace:
> 
> [  333.429566]  <IRQ>
> 
> [  333.430745]  amd_clear_bank+0x6e/0x70
> 
> [  333.432828]  machine_check_poll+0x228/0x2e0
> 
> [  333.435068]  ? __pfx_mce_timer_fn+0x10/0x10
> 
> [  333.437241]  mce_timer_fn+0xb1/0x130
> 
> [  333.438966]  ? __pfx_mce_timer_fn+0x10/0x10
> 
> [  333.441380]  call_timer_fn+0x26/0x120
> 
> [  333.443518]  __run_timers+0x202/0x290
> 
> [  333.445763]  run_timer_softirq+0x49/0x100
> 
> [  333.447908]  handle_softirqs+0xeb/0x2c0
> 
> [  333.449863]  __irq_exit_rcu+0xda/0x100
> 
> [  333.452065]  sysvec_apic_timer_interrupt+0x71/0x90
> 
> [  333.454846]  </IRQ>
> 
> [  333.456192]  <TASK>
> 
> [  333.457520]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  333.460355] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [  333.463203] Code: 20 d0 e9 5f 99 e6 fe 0f 1f 40 00 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 ee 18 00 fb f4 <e9>
> 37 990
> [  333.472816] RSP: 0018:ffffffff83403e78 EFLAGS: 00000246
> [  333.475848] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [  333.479481] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> [  333.483492] RBP: ffffffff83412980 R08: 0000000000000000 R09:
> 0000000000000000
> [  333.487503] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [  333.491482] R13: 0000000000000000 R14: 0000000000000000 R15:
> 00000000000947d0
> [  333.495258]  default_idle+0x9/0x30
> [  333.497283]  default_idle_call+0x28/0x100
> [  333.499641]  cpuidle_idle_call+0x12e/0x180
> [  333.502087]  do_idle+0x77/0xb0
> [  333.503914]  cpu_startup_entry+0x29/0x30
> [  333.506337]  rest_init+0xcc/0xd0
> [  333.508296]  start_kernel+0x4df/0x4e0
> [  333.510491]  x86_64_start_reservations+0x32/0x40
> [  333.513101]  x86_64_start_kernel+0xce/0xd0
> [  333.515433]  common_startup_64+0x13e/0x141
> [  333.517920]  </TASK>
> [  333.519468] Kernel panic - not syncing: MCA architectural violation!
> 
> 
> The problem appeared with the addition of clearing MCA_DESTAT for all
> deferred errors in the amd_clear_bank() function by this kernel commit:
> 
>     7cb735d7c0cb  x86/mce: Unify AMD DFR handler with MCA Polling
> 
> +       /* Clear MCA_DESTAT for all deferred errors even those logged in
> MCA_STATUS. */
> +       if (m->status & MCI_STATUS_DEFERRED)
> +               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
> 
> 
> Where a Qemu AMD implementation of MCE injection for deferred errors
> relies on machine_check_poll() picking up these errors.
> As indicated in Qemu change:
>     4b77512b2782  i386: Fix MCE support for AMD hosts
> https://lore.kernel.org/qemu-devel/20240603193622.47156-2-john.allen@amd.com/
> 
> 
> When a Qemu process receives the SIGBUS information from the host, it
> generates a virtual MCE to be dealt by the VM kernel machine_check_poll().
> But clearing MCA_DESTAT doesn't seem to be allowed and triggers an
> exception. Which looks like a kernel & AMD SMCA contract mismatch (?)
> 
> So should we consider that the Qemu platform has to allow the change or
> is the kernel missing guards around clearing this MCA bank after
> injected UEs on this platform ?
> 
> 
> FYI, to reproduce the problem:
> . I used a QEMU Standard PC q35:
> 
> qemu-system-x86_64 --version
> QEMU emulator version 10.2.50 (v10.2.0-1085-gcd5a79dc98)
> Copyright (c) 2003-2026 Fabrice Bellard and the QEMU Project developers
> 
> qemu-system-x86_64 -smp 4 -m 20G -enable-kvm -cpu host -usb \
> 	-device usb-tablet -serial mon:stdio -M q35 \
> 	-nic user,model=e1000,hostfwd=tcp::60022-:22 -nographic \
> 	-drive file=disk.qcow2,cache=none
> 
> . Inject an error into this VM running a 6.19.0-rc1 or more recent kernel.
> From the host:
> # modprobe hwpoison-inject
> # echo <pfn> > /sys/kernel/debug/hwpoison/corrupt-pfn
> 
> Wait 5 minutes until the deferred error is handled by the VM kernel, and
> the VM than crashes with the above stack trace...
> 
> 
> . But removing the reset of MCA_DESTAT in the kernel amd_clear_bank()
> function or adding this simple test makes the system work again as
> before:
> 
> 
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index d9f9ee7db5c8..86b3070fbb40 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -860,7 +860,7 @@ void amd_clear_bank(struct mce *m)
>         amd_reset_thr_limit(m->bank);
> 
>         /* Clear MCA_DESTAT for all deferred errors even those logged in
> MCA_STATUS. */
> -       if (m->status & MCI_STATUS_DEFERRED)
> +       if (m->status & MCI_STATUS_DEFERRED && !(m->status &
> MCI_STATUS_POISON))
>                 mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
> 
>         /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */
> 
> 
> 
> According to me, this small kernel fix relies too much on a Qemu AMD
> specific implementation detail.
> 
> Would you have a more appropriate fix to suggest please ?
> 
> Thanks in advance for your feedback.
> William.

Thanks William for the report and details.

Clearing "STATUS" registers is a normal part of MCA handling.

We seem to allow clearing the regular "MCi_STATUS" register. I assume
this gets trapped/ignored by the hypervisor.

I expect we need to do the same behavior for the "MCA_DESTAT" register.

I'll do some research here, but please do share any pointers you may
have.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-09 21:08 ` Yazen Ghannam
@ 2026-02-09 21:18   ` Yazen Ghannam
  2026-02-11  1:42     ` William Roche
  0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2026-02-09 21:18 UTC (permalink / raw)
  To: William Roche
  Cc: Tony Luck, bp, Thomas Gleixner, mingo, dave.hansen, x86, hpa,
	Allen, John, linux-edac, linux-kernel, Jane Chu

On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
> On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:

[...]

> > According to me, this small kernel fix relies too much on a Qemu AMD
> > specific implementation detail.
> > 
> > Would you have a more appropriate fix to suggest please ?
> > 
> > Thanks in advance for your feedback.
> > William.
> 
> Thanks William for the report and details.
> 
> Clearing "STATUS" registers is a normal part of MCA handling.
> 
> We seem to allow clearing the regular "MCi_STATUS" register. I assume
> this gets trapped/ignored by the hypervisor.
> 
> I expect we need to do the same behavior for the "MCA_DESTAT" register.
> 
> I'll do some research here, but please do share any pointers you may
> have.

Sorry for the rapid reply, but I think this is where we need an update.

Linux:
arch/x86/kvm/x86.c : set_msr_mce()

Please note the comment:
"All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."

We should include the MCA_DESTAT register range here.

What do you think?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-09 21:18   ` Yazen Ghannam
@ 2026-02-11  1:42     ` William Roche
  2026-02-11 16:34       ` Yazen Ghannam
  0 siblings, 1 reply; 11+ messages in thread
From: William Roche @ 2026-02-11  1:42 UTC (permalink / raw)
  To: Yazen Ghannam, Tony Luck, bp
  Cc: Thomas Gleixner, mingo, dave.hansen, x86, hpa, Allen, John,
	linux-edac, linux-kernel, Jane Chu

On 2/9/26 22:18, Yazen Ghannam wrote:
> On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
>> On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
> 
> [...]
> 
>>> According to me, this small kernel fix relies too much on a Qemu AMD
>>> specific implementation detail.
>>>
>>> Would you have a more appropriate fix to suggest please ?
>>>
>>> Thanks in advance for your feedback.
>>> William.
>>
>> Thanks William for the report and details.
>>
>> Clearing "STATUS" registers is a normal part of MCA handling.
>>
>> We seem to allow clearing the regular "MCi_STATUS" register. I assume
>> this gets trapped/ignored by the hypervisor.
>>
>> I expect we need to do the same behavior for the "MCA_DESTAT" register.
>>
>> I'll do some research here, but please do share any pointers you may
>> have.

Yazen, I'm simply trying to find an answer in the AMD64 Architecture 
Programmer's Manual, Volume 2: System Programming, 24593

This documents indicates (In chapter 9.3.3.4 MCA Deferred Error Status 
Register) that:
"When the deferred error has been processed by the deferred error 
handler, MCA_DESTAT should be
cleared. If MCA_STATUS also contains a deferred error, MCA_STATUS should 
be cleared."

So I would imagine that allowing the reset of MCA_DESTAT the same way as 
MCA_STATUS should be what the platform has to allow (or ignore).

> 
> Sorry for the rapid reply, but I think this is where we need an update.
> 
> Linux:
> arch/x86/kvm/x86.c : set_msr_mce()
> 
> Please note the comment:
> "All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."
> 
> We should include the MCA_DESTAT register range here.
> 
> What do you think?

But before trying to update the set_msr_mce() function, I don't think
that KVM keeps track of an MSR_AMD64_SMCA_MCx_DESTAT set of registers.
I can see mce_banks (for ctl, status, addr and misc) and mci_ctl2_banks
locations in struct kvm_vcpu_arch, but I don't see a location for SMCA
banks like MCA_DESTAT MSRs.

So if we make kvm ignore this update instead of raising a #GP error,
would it be a valid solution ?

Thanks,
William.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-11  1:42     ` William Roche
@ 2026-02-11 16:34       ` Yazen Ghannam
  2026-02-12 15:36         ` William Roche
  0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2026-02-11 16:34 UTC (permalink / raw)
  To: William Roche
  Cc: Tony Luck, bp, Thomas Gleixner, mingo, dave.hansen, x86, hpa,
	Allen, John, linux-edac, linux-kernel, Jane Chu

On Wed, Feb 11, 2026 at 02:42:07AM +0100, William Roche wrote:
> On 2/9/26 22:18, Yazen Ghannam wrote:
> > On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
> > > On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
> > 
> > [...]
> > 
> > > > According to me, this small kernel fix relies too much on a Qemu AMD
> > > > specific implementation detail.
> > > > 
> > > > Would you have a more appropriate fix to suggest please ?
> > > > 
> > > > Thanks in advance for your feedback.
> > > > William.
> > > 
> > > Thanks William for the report and details.
> > > 
> > > Clearing "STATUS" registers is a normal part of MCA handling.
> > > 
> > > We seem to allow clearing the regular "MCi_STATUS" register. I assume
> > > this gets trapped/ignored by the hypervisor.
> > > 
> > > I expect we need to do the same behavior for the "MCA_DESTAT" register.
> > > 
> > > I'll do some research here, but please do share any pointers you may
> > > have.
> 
> Yazen, I'm simply trying to find an answer in the AMD64 Architecture
> Programmer's Manual, Volume 2: System Programming, 24593
> 
> This documents indicates (In chapter 9.3.3.4 MCA Deferred Error Status
> Register) that:
> "When the deferred error has been processed by the deferred error handler,
> MCA_DESTAT should be
> cleared. If MCA_STATUS also contains a deferred error, MCA_STATUS should be
> cleared."
> 
> So I would imagine that allowing the reset of MCA_DESTAT the same way as
> MCA_STATUS should be what the platform has to allow (or ignore).
> 

Yes, that's what I gathered too.

> > 
> > Sorry for the rapid reply, but I think this is where we need an update.
> > 
> > Linux:
> > arch/x86/kvm/x86.c : set_msr_mce()
> > 
> > Please note the comment:
> > "All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."
> > 
> > We should include the MCA_DESTAT register range here.
> > 
> > What do you think?
> 
> But before trying to update the set_msr_mce() function, I don't think
> that KVM keeps track of an MSR_AMD64_SMCA_MCx_DESTAT set of registers.
> I can see mce_banks (for ctl, status, addr and misc) and mci_ctl2_banks
> locations in struct kvm_vcpu_arch, but I don't see a location for SMCA
> banks like MCA_DESTAT MSRs.
> 
> So if we make kvm ignore this update instead of raising a #GP error,
> would it be a valid solution ?
> 

Yes, I think so. And the details depend on how much of the platform
needs to be emulated.

Some ideas in increasing order of complexity:

1) Ignore this register write.

2) Do a basic validity check.
   Allow "write 0 to MCA_DESTAT" and #GP for any other value.
   Don't need to save MCA_DESTAT values.

3) Replicate the full platform behavior akin to MCi_STATUS.
   Would need to save MCA_DESTAT values and do a "HWCR" bit check.

I really don't think we want #3. This would useful for "register-based
error injection/simulation"r. But that use case wouldn't do much with the
MCA_DESTAT register without all the related Deferred error interrupt
infrastructure.

So I say the choice is between #1 and #2.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-11 16:34       ` Yazen Ghannam
@ 2026-02-12 15:36         ` William Roche
  2026-02-12 19:30           ` Yazen Ghannam
  0 siblings, 1 reply; 11+ messages in thread
From: William Roche @ 2026-02-12 15:36 UTC (permalink / raw)
  To: Yazen Ghannam
  Cc: Tony Luck, bp, Thomas Gleixner, mingo, dave.hansen, x86, hpa,
	Allen, John, linux-edac, linux-kernel, Jane Chu

On 2/11/26 17:34, Yazen Ghannam wrote:
> On Wed, Feb 11, 2026 at 02:42:07AM +0100, William Roche wrote:
>> On 2/9/26 22:18, Yazen Ghannam wrote:
>>> On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
>>> [...]
>>> Linux:
>>> arch/x86/kvm/x86.c : set_msr_mce()
>>>
>>> Please note the comment:
>>> "All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."
>>>
>>> We should include the MCA_DESTAT register range here.
>>>
>>> What do you think?
>>
>> But before trying to update the set_msr_mce() function, I don't think
>> that KVM keeps track of an MSR_AMD64_SMCA_MCx_DESTAT set of registers.
>> I can see mce_banks (for ctl, status, addr and misc) and mci_ctl2_banks
>> locations in struct kvm_vcpu_arch, but I don't see a location for SMCA
>> banks like MCA_DESTAT MSRs.
>>
>> So if we make kvm ignore this update instead of raising a #GP error,
>> would it be a valid solution ?
>>
> 
> Yes, I think so. And the details depend on how much of the platform
> needs to be emulated.
> 
> Some ideas in increasing order of complexity:
> 
> 1) Ignore this register write.
> 
> 2) Do a basic validity check.
>     Allow "write 0 to MCA_DESTAT" and #GP for any other value.
>     Don't need to save MCA_DESTAT values.
> 
> 3) Replicate the full platform behavior akin to MCi_STATUS.
>     Would need to save MCA_DESTAT values and do a "HWCR" bit check.
> 
> I really don't think we want #3. This would useful for "register-based
> error injection/simulation"r. But that use case wouldn't do much with the
> MCA_DESTAT register without all the related Deferred error interrupt
> infrastructure.
> 
> So I say the choice is between #1 and #2.


Thinking more about the problem introduced by your commit, I realized
that only SMCA systems have MCA_DESTAT registers. So we should not
allow access to this register from a non SMCA machine.
  And Qemu AMD VM is an example of a non SMCA machine !

So according to me, modifying the hypervisor kvm to allow the access
to MCA_DESTAT is clearly not the right move.

We probably should implement an entire SMCA stack for Qemu, but this
is another topic...
For the moment, Borislav Petklov was right when he said that kvm works
as advertised. The problem that your fix introduced is that is tries to
access SMCA only registers on non SMCA machine.

Do you agree on this aspect ?

If yes, than the correct change is to test if we are on an SMCA machine
before accessing this register, like:

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 3f1dda355307..8664ba048a62 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -875,14 +875,18 @@ void amd_clear_bank(struct mce *m)
  {
         amd_reset_thr_limit(m->bank);

-       /* Clear MCA_DESTAT for all deferred errors even those logged in 
MCA_STATUS. */
-       if (m->status & MCI_STATUS_DEFERRED)
-               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
-
-       /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */
-       if (m->kflags & MCE_CHECK_DFR_REGS)
-               return;
+       if (mce_flags.smca) {
+               /*
+                * Clear MCA_DESTAT for all deferred errors even those
+                * logged in MCA_STATUS.
+                */
+               if (m->status & MCI_STATUS_DEFERRED)
+                       mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);

+               /* Don't clear MCA_STATUS if MCA_DESTAT was used 
exclusively. */
+               if (m->kflags & MCE_CHECK_DFR_REGS)
+                       return;
+       }
         mce_wrmsrq(mca_msr_reg(m->bank, MCA_STATUS), 0);
  }


I haven't noticed any obvious other non SMCA limitation in the other
changes of this series, but if you think about any other case, we can
probably fix all of them together.

If you agree with this change I can submit it as a formal PATCH.

Thanks in advance for your feedback.
William.



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-12 15:36         ` William Roche
@ 2026-02-12 19:30           ` Yazen Ghannam
  2026-02-13 16:55             ` William Roche
  0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2026-02-12 19:30 UTC (permalink / raw)
  To: William Roche
  Cc: Tony Luck, bp, Thomas Gleixner, mingo, dave.hansen, x86, hpa,
	Allen, John, linux-edac, linux-kernel, Jane Chu

On Thu, Feb 12, 2026 at 04:36:47PM +0100, William Roche wrote:
> On 2/11/26 17:34, Yazen Ghannam wrote:
> > On Wed, Feb 11, 2026 at 02:42:07AM +0100, William Roche wrote:
> > > On 2/9/26 22:18, Yazen Ghannam wrote:
> > > > On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
> > > > [...]
> > > > Linux:
> > > > arch/x86/kvm/x86.c : set_msr_mce()
> > > > 
> > > > Please note the comment:
> > > > "All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."
> > > > 
> > > > We should include the MCA_DESTAT register range here.
> > > > 
> > > > What do you think?
> > > 
> > > But before trying to update the set_msr_mce() function, I don't think
> > > that KVM keeps track of an MSR_AMD64_SMCA_MCx_DESTAT set of registers.
> > > I can see mce_banks (for ctl, status, addr and misc) and mci_ctl2_banks
> > > locations in struct kvm_vcpu_arch, but I don't see a location for SMCA
> > > banks like MCA_DESTAT MSRs.
> > > 
> > > So if we make kvm ignore this update instead of raising a #GP error,
> > > would it be a valid solution ?
> > > 
> > 
> > Yes, I think so. And the details depend on how much of the platform
> > needs to be emulated.
> > 
> > Some ideas in increasing order of complexity:
> > 
> > 1) Ignore this register write.
> > 
> > 2) Do a basic validity check.
> >     Allow "write 0 to MCA_DESTAT" and #GP for any other value.
> >     Don't need to save MCA_DESTAT values.
> > 
> > 3) Replicate the full platform behavior akin to MCi_STATUS.
> >     Would need to save MCA_DESTAT values and do a "HWCR" bit check.
> > 
> > I really don't think we want #3. This would useful for "register-based
> > error injection/simulation"r. But that use case wouldn't do much with the
> > MCA_DESTAT register without all the related Deferred error interrupt
> > infrastructure.
> > 
> > So I say the choice is between #1 and #2.
> 
> 
> Thinking more about the problem introduced by your commit, I realized
> that only SMCA systems have MCA_DESTAT registers. So we should not
> allow access to this register from a non SMCA machine.
>  And Qemu AMD VM is an example of a non SMCA machine !
> 

So the SMCA CPUID bit is not advertised in this model?

> So according to me, modifying the hypervisor kvm to allow the access
> to MCA_DESTAT is clearly not the right move.
> 
> We probably should implement an entire SMCA stack for Qemu, but this
> is another topic...
> For the moment, Borislav Petklov was right when he said that kvm works
> as advertised. The problem that your fix introduced is that is tries to
> access SMCA only registers on non SMCA machine.
> 
> Do you agree on this aspect ?
> 

Yes, I agree.

AMD systems generally have a Read-as-Zero/Writes-Ignored behavior when
accessing unimplemented MCA registers. But this requires the system to
recognize the register space.

In this case, the register space is totally unknown to the system, so it
responds with a #GP.

> If yes, than the correct change is to test if we are on an SMCA machine
> before accessing this register, like:
> 
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 3f1dda355307..8664ba048a62 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -875,14 +875,18 @@ void amd_clear_bank(struct mce *m)
>  {
>         amd_reset_thr_limit(m->bank);
> 
> -       /* Clear MCA_DESTAT for all deferred errors even those logged in
> MCA_STATUS. */
> -       if (m->status & MCI_STATUS_DEFERRED)
> -               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
> -
> -       /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */
> -       if (m->kflags & MCE_CHECK_DFR_REGS)
> -               return;
> +       if (mce_flags.smca) {
> +               /*
> +                * Clear MCA_DESTAT for all deferred errors even those
> +                * logged in MCA_STATUS.
> +                */
> +               if (m->status & MCI_STATUS_DEFERRED)
> +                       mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
> 
> +               /* Don't clear MCA_STATUS if MCA_DESTAT was used
> exclusively. */
> +               if (m->kflags & MCE_CHECK_DFR_REGS)
> +                       return;
> +       }
>         mce_wrmsrq(mca_msr_reg(m->bank, MCA_STATUS), 0);
>  }
> 
> 
> I haven't noticed any obvious other non SMCA limitation in the other
> changes of this series, but if you think about any other case, we can
> probably fix all of them together.
> 
> If you agree with this change I can submit it as a formal PATCH.
> 

I think this change is fair. It could be minimized further by adding the
SMCA check to the status bit check for the WRMSR step.

	if (mce_flags.smca && (m->status & MCI_STATUS_DEFERRED))
		mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] AMD VM crashing on deferred memory error injection
  2026-02-12 19:30           ` Yazen Ghannam
@ 2026-02-13 16:55             ` William Roche
  0 siblings, 0 replies; 11+ messages in thread
From: William Roche @ 2026-02-13 16:55 UTC (permalink / raw)
  To: Yazen Ghannam
  Cc: Tony Luck, bp, Thomas Gleixner, mingo, dave.hansen, x86, hpa,
	Allen, John, linux-edac, linux-kernel, Jane Chu

On 2/12/26 20:30, Yazen Ghannam wrote:
> On Thu, Feb 12, 2026 at 04:36:47PM +0100, William Roche wrote:
>> [...]
>> Thinking more about the problem introduced by your commit, I realized
>> that only SMCA systems have MCA_DESTAT registers. So we should not
>> allow access to this register from a non SMCA machine.
>>   And Qemu AMD VM is an example of a non SMCA machine !
>>
> 
> So the SMCA CPUID bit is not advertised in this model?

No it's not advertised -- only succor and overflow-recov.
As implemented by John Allen to support relaying deferred errors:
https://lore.kernel.org/qemu-devel/20240603193622.47156-1-john.allen@amd.com/


>> So according to me, modifying the hypervisor kvm to allow the access
>> to MCA_DESTAT is clearly not the right move.
>>
>> We probably should implement an entire SMCA stack for Qemu, but this
>> is another topic...
>> For the moment, Borislav Petklov was right when he said that kvm works
>> as advertised. The problem that your fix introduced is that is tries to
>> access SMCA only registers on non SMCA machine.
>>
>> Do you agree on this aspect ?
>>
> 
> Yes, I agree.

Thanks


> AMD systems generally have a Read-as-Zero/Writes-Ignored behavior when
> accessing unimplemented MCA registers. But this requires the system to
> recognize the register space.
> 
> In this case, the register space is totally unknown to the system, so it
> responds with a #GP.

I understand what you mean about the platform permissiveness. This could
be a valid change of the kvm layer, allowing kernel code that would
mistakenly access SMCA registers on non-SMCA virtual machine not to
panic. Of course if all AMD hardware would work like that, I agree that
aligning the virtual AMD platforms should be done -- but as you
indicated, recognizing the register space to consider for
Read-as-Zero/Writes-Ignored is still unclear (to me).
Even if we manage to do that for an updated KVM layer, a non-SMCA VM
running on top of an older host kvm layer would still panic if its
kernel accesses SMCA registers. So I'm convinced that making sure the
AMD kernel accessing SMCA registers only on SMCA machines is the best
approach.

> 
>> If yes, than the correct change is to test if we are on an SMCA machine
>> before accessing this register, like:

I'm preparing a PATCH proposal to submit in a moment.

>>
>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>> index 3f1dda355307..8664ba048a62 100644
>> --- a/arch/x86/kernel/cpu/mce/amd.c
>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>> @@ -875,14 +875,18 @@ void amd_clear_bank(struct mce *m)
>>   {
>>          amd_reset_thr_limit(m->bank);
>>
>> -       /* Clear MCA_DESTAT for all deferred errors even those logged in
>> MCA_STATUS. */
>> -       if (m->status & MCI_STATUS_DEFERRED)
>> -               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
>> -
>> -       /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */
>> -       if (m->kflags & MCE_CHECK_DFR_REGS)
>> -               return;
>> +       if (mce_flags.smca) {
>> +               /*
>> +                * Clear MCA_DESTAT for all deferred errors even those
>> +                * logged in MCA_STATUS.
>> +                */
>> +               if (m->status & MCI_STATUS_DEFERRED)
>> +                       mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
>>
>> +               /* Don't clear MCA_STATUS if MCA_DESTAT was used
>> exclusively. */
>> +               if (m->kflags & MCE_CHECK_DFR_REGS)
>> +                       return;
>> +       }
>>          mce_wrmsrq(mca_msr_reg(m->bank, MCA_STATUS), 0);
>>   }
>>
>>
>> I haven't noticed any obvious other non SMCA limitation in the other
>> changes of this series, but if you think about any other case, we can
>> probably fix all of them together.
>>
>> If you agree with this change I can submit it as a formal PATCH.
>>
> 
> I think this change is fair. It could be minimized further by adding the
> SMCA check to the status bit check for the WRMSR step.
> 
> 	if (mce_flags.smca && (m->status & MCI_STATUS_DEFERRED))
> 		mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);


Yes this is the minimal fix in our case, but I think that more clearly
separating the SMCA operations better shows the differences between SMCA
and non-SMCA.

I'll copy you on the PATCH proposal, maybe you can review it.

Thanks again for your detailed feedback.

Best regards,
William.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-13 16:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 16:36 [RFC] AMD VM crashing on deferred memory error injection William Roche
2026-02-09 17:36 ` Borislav Petkov
2026-02-09 17:38   ` Luck, Tony
2026-02-09 17:53     ` Borislav Petkov
2026-02-09 21:08 ` Yazen Ghannam
2026-02-09 21:18   ` Yazen Ghannam
2026-02-11  1:42     ` William Roche
2026-02-11 16:34       ` Yazen Ghannam
2026-02-12 15:36         ` William Roche
2026-02-12 19:30           ` Yazen Ghannam
2026-02-13 16:55             ` William Roche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox