* [PATCH v2 0/1] AMD VM crashing on deferred memory error injection @ 2026-02-18 16:30 “William Roche 2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche 2026-03-12 14:23 ` [PATCH v2 0/1] AMD VM crashing on deferred memory error injection William Roche 0 siblings, 2 replies; 12+ messages in thread From: “William Roche @ 2026-02-18 16:30 UTC (permalink / raw) To: yazen.ghannam, tony.luck, bp, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel Cc: John.Allen, jane.chu, william.roche From: William Roche <william.roche@oracle.com> Thank you very much Yazen for your review and all the suggestions! v2 changes: - Commit title changed to: x86/mce/amd: Fix VM crash during deferred error handling - Commit message with capitalized QEMU and KVM as well as the imperative statement suggested by Yazen - "CC stable" tag placed after "Signed-off-by" (The documentation asks for "the sign-off area" without more details) - blank line added to separate SCMA code block and the update of MCA_STATUS. -- After the integration of the following commit: 7cb735d7c0cb x86/mce: Unify AMD DFR handler with MCA Polling AMD Qemu VM started to crash when dealing with deferred memory error injection with a stack trace like: mce: MSR access error: WRMSR to 0xc0002098 (tried to write 0x0000000000000000) at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60) amd_clear_bank+0x6e/0x70 machine_check_poll+0x228/0x2e0 ? __pfx_mce_timer_fn+0x10/0x10 mce_timer_fn+0xb1/0x130 ? __pfx_mce_timer_fn+0x10/0x10 call_timer_fn+0x26/0x120 __run_timers+0x202/0x290 run_timer_softirq+0x49/0x100 handle_softirqs+0xeb/0x2c0 __irq_exit_rcu+0xda/0x100 sysvec_apic_timer_interrupt+0x71/0x90 [...] Kernel panic - not syncing: MCA architectural violation! See the discussion at: https://lore.kernel.org/all/48d8e1c8-1eb9-49cc-8de8-78077f29c203@oracle.com/ We identified a problem with SMCA specific registers access from non-SMCA platforms like a QEMU/KVM machine. This patch is checkpatch.pl clean. Unit test of memory error injection works fine with it. William Roche (1): x86/mce/amd: Fix VM crash during deferred error handling arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) -- 2.47.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-02-18 16:30 [PATCH v2 0/1] AMD VM crashing on deferred memory error injection “William Roche @ 2026-02-18 16:30 ` “William Roche 2026-03-12 14:42 ` Borislav Petkov 2026-03-12 14:23 ` [PATCH v2 0/1] AMD VM crashing on deferred memory error injection William Roche 1 sibling, 1 reply; 12+ messages in thread From: “William Roche @ 2026-02-18 16:30 UTC (permalink / raw) To: yazen.ghannam, tony.luck, bp, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel Cc: John.Allen, jane.chu, william.roche From: William Roche <william.roche@oracle.com> A non Scalable MCA system may prevent access to SMCA specific registers like MCA_DESTAT. This is the case of QEMU/KVM VMs, where the kernel has to check for the SMCA feature before accessing MCA_DESTAT. Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling") Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: stable@vger.kernel.org --- arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 3f1dda355307..7b9932f13bca 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -875,13 +875,18 @@ void amd_clear_bank(struct mce *m) { amd_reset_thr_limit(m->bank); - /* Clear MCA_DESTAT for all deferred errors even those logged in MCA_STATUS. */ - if (m->status & MCI_STATUS_DEFERRED) - mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0); + if (mce_flags.smca) { + /* + * Clear MCA_DESTAT for all deferred errors even those + * logged in MCA_STATUS. + */ + if (m->status & MCI_STATUS_DEFERRED) + mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0); - /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */ - if (m->kflags & MCE_CHECK_DFR_REGS) - return; + /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */ + if (m->kflags & MCE_CHECK_DFR_REGS) + return; + } mce_wrmsrq(mca_msr_reg(m->bank, MCA_STATUS), 0); } -- 2.47.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche @ 2026-03-12 14:42 ` Borislav Petkov 2026-03-12 15:11 ` William Roche 0 siblings, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2026-03-12 14:42 UTC (permalink / raw) To: “William Roche Cc: yazen.ghannam, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On Wed, Feb 18, 2026 at 04:30:25PM +0000, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > A non Scalable MCA system may prevent access to SMCA specific registers "may prevent"? Please explain in the commit message the whole scenario how you're triggering this in detail. > like MCA_DESTAT. This is the case of QEMU/KVM VMs, where the kernel > has to check for the SMCA feature before accessing MCA_DESTAT. > > Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling") > Signed-off-by: William Roche <william.roche@oracle.com> > Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> > Cc: stable@vger.kernel.org AFAIR, you're injecting errors. This is not really a critical fix that warrants this going to stable. > --- > arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------ > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c > index 3f1dda355307..7b9932f13bca 100644 > --- a/arch/x86/kernel/cpu/mce/amd.c > +++ b/arch/x86/kernel/cpu/mce/amd.c > @@ -875,13 +875,18 @@ void amd_clear_bank(struct mce *m) > { > amd_reset_thr_limit(m->bank); > > - /* Clear MCA_DESTAT for all deferred errors even those logged in MCA_STATUS. */ > - if (m->status & MCI_STATUS_DEFERRED) > - mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0); > + if (mce_flags.smca) { All this code should not run in a VM. So why does it? What is the use case we're supposed to support here? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-12 14:42 ` Borislav Petkov @ 2026-03-12 15:11 ` William Roche 2026-03-12 16:04 ` Borislav Petkov 0 siblings, 1 reply; 12+ messages in thread From: William Roche @ 2026-03-12 15:11 UTC (permalink / raw) To: Borislav Petkov Cc: yazen.ghannam, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu Thank you for your reply, On 3/12/26 15:42, Borislav Petkov wrote: > On Wed, Feb 18, 2026 at 04:30:25PM +0000, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> A non Scalable MCA system may prevent access to SMCA specific registers > > "may prevent"? > > Please explain in the commit message the whole scenario how you're triggering > this in detail. > From the kernel point of view (regardless if it is running on bare metal or in a VM), access to these registers registers is provided by the platform: either the Hardware or the emulation framework. Yazen indicated on Feb 12 that "AMD systems generally have a Read-as-Zero/Writes-Ignored behavior when accessing unimplemented MCA registers", but you rightly indicated on Feb 9 that "KVM works as advertized" and so prevents access to unimplemented SMCA specific registers. That's the reason why I had to say "may". This access crashes on AMD VMs and "may" work on AMD hardware according to Yazen. >> like MCA_DESTAT. This is the case of QEMU/KVM VMs, where the kernel >> has to check for the SMCA feature before accessing MCA_DESTAT. >> >> Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling") >> Signed-off-by: William Roche <william.roche@oracle.com> >> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> >> Cc: stable@vger.kernel.org > > AFAIR, you're injecting errors. This is not really a critical fix that > warrants this going to stable. Errors are injected into VMs by the hypervisor when real memory hardware errors occur on the system that impact the VM address space. This is not only a test, this is real life mechanism. With the fix 7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes on Deferred errors, where it used to be able to deal with them before this commit. That's the reason why we need this additional fix. > >> --- >> arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------ >> 1 file changed, 11 insertions(+), 6 deletions(-) >> >> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c >> index 3f1dda355307..7b9932f13bca 100644 >> --- a/arch/x86/kernel/cpu/mce/amd.c >> +++ b/arch/x86/kernel/cpu/mce/amd.c >> @@ -875,13 +875,18 @@ void amd_clear_bank(struct mce *m) >> { >> amd_reset_thr_limit(m->bank); >> >> - /* Clear MCA_DESTAT for all deferred errors even those logged in MCA_STATUS. */ >> - if (m->status & MCI_STATUS_DEFERRED) >> - mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0); >> + if (mce_flags.smca) { > > All this code should not run in a VM. So why does it? Why do you say that this code should not run in a VM ? Error injection mechanism has been running for several years with QEMU/KVM. I must be missing something here. Please let me know. > > What is the use case we're supposed to support here? > Dealing with real life deferred memory errors impacting VMs address space. I hope this clarifies the need for this new kernel fix. Thanks again, William. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-12 15:11 ` William Roche @ 2026-03-12 16:04 ` Borislav Petkov 2026-03-12 22:44 ` William Roche 0 siblings, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2026-03-12 16:04 UTC (permalink / raw) To: William Roche Cc: yazen.ghannam, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On Thu, Mar 12, 2026 at 04:11:10PM +0100, William Roche wrote: > From the kernel point of view (regardless if it is running on bare metal or > in a VM), access to these registers registers is provided by the platform: > either the Hardware or the emulation framework. Except the emulation doesn't emulate the platform properly. We test on real hw. If your hypervisor doesn't do that properly then that's not really upstream kernel's problem. > Errors are injected into VMs by the hypervisor when real memory hardware > errors occur on the system that impact the VM address space. And? Why? What's the recovery action scenario for having errors injected into guests? Where is that documented? Why does the upstream kernel need to care? Basically I'm asking you for the use case in order to determine whether that use case is valid for the *upstream* kernel to support. > This is not only a test, this is real life mechanism. With the fix > 7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes > on Deferred errors, where it used to be able to deal with them before this > commit. Because we don't know of your use case. So when we do upstream development how can we test your case? Before that, is that case even worth testing? I hope I'm making sense here. The MCA and other low-level hw code works on baremetal as that's its main target. If it is supposed to work in VMs, then there better be a proper use case which we are willing to support and we can *actually* *test*. If not, you can keep this "fix" in your guest kernels and everyone's happy. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-12 16:04 ` Borislav Petkov @ 2026-03-12 22:44 ` William Roche 2026-03-13 20:10 ` Borislav Petkov 2026-03-13 20:26 ` Yazen Ghannam 0 siblings, 2 replies; 12+ messages in thread From: William Roche @ 2026-03-12 22:44 UTC (permalink / raw) To: Borislav Petkov, yazen.ghannam Cc: tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu [-- Attachment #1: Type: text/plain, Size: 7495 bytes --] Thank you for taking the time to explain your worries about the context of this fix integration, and I do hope my feedback can help to convince you. On 3/12/26 17:04, Borislav Petkov wrote: > On Thu, Mar 12, 2026 at 04:11:10PM +0100, William Roche wrote: >> From the kernel point of view (regardless if it is running on bare metal or >> in a VM), access to these registers registers is provided by the platform: >> either the Hardware or the emulation framework. > > Except the emulation doesn't emulate the platform properly. We test on real > hw. If your hypervisor doesn't do that properly then that's not really > upstream kernel's problem. There are several aspects that are worth considering here: First, I totally agree that the emulation has to emulate properly ! :) The problem we are facing is to consider non-SMCA platform reaction to updating and SMCA specific register. And is the QEMU/KVM VM reaction as a non-SCMA machine a valid case ? In this VM case, the MSR handling emulation is done by KVM which doesn't implement a "permissive" access to unimplemented registers. I also agreed with you when you said that it is working as advertised. Now if emulating an AMD platform requires to provide a "permissive" access to a specific set of registers, the fix would not be absolutely necessary. But I may have missed a specification about that. And if such a thing exists, it would also be all kernels (including upstream) responsibility to take that into account. Yazen may help us on this aspect: Could you please let us know if there is an AMD specification for accessing SMCA registers on non SMCA machines ? Now if we had a valid case of an existing non-SMCA AMD hardware that could crash on updating an SMCA register, the fix would be needed not only for the VM case. Yazen, could you also please tell us if an existing non-SMCA AMD hardware could crash on updating an SMCA register ? The commit 7cb735d7c0cb [x86/mce: Unify AMD DFR handler with MCA Polling] written by Yazen, introduced an upstream kernel problem on non-SMCA platforms that has been revealed by the emulation framework on AMD. That's the reason why I think it should be fixed in upstream too. And Yazen himself agrees with that. > >> Errors are injected into VMs by the hypervisor when real memory hardware >> errors occur on the system that impact the VM address space. > > And? The injected error is received by the VM kernel to deal with it. > Why? The VM kernel executes the same mechanisms used on bare metal in that case. As Tony said on Feb 9: The guest may be able to just kill a process and keep running. > > What's the recovery action scenario for having errors injected into guests? Just the same as running on real HW. > Where is that documented? Why does the upstream kernel need to care? Sorry I don't have a kernel documentation pointer about that, but the MCE relay mechanism sure is an Hypervisor functionality. > > Basically I'm asking you for the use case in order to determine whether that > use case is valid for the *upstream* kernel to support. Yes, of course, see below. > >> This is not only a test, this is real life mechanism. With the fix >> 7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes >> on Deferred errors, where it used to be able to deal with them before this >> commit. > > Because we don't know of your use case. So when we do upstream development how > can we test your case? > I have a procedure to verify the behavior: It consists of running the upstream kernel in a VM (on an AMD platform) and injecting a memory error from the hardware platform to this VM to mimic a real hardware error being reported to the platform Kernel. To do so: Run Qemu as root (to help with the address translation). The VM runs the upstream kernel. Run the small attached program in the VM as root, so that it gives a guest physical address of one of its mapped memory page. [root@VM]# ./mce_process_react_x86 Setting Early kill... Ok Data pages at 0xXXXXXXX physically 0xYYYYY000 -> DON'T Press enter ! (just leave the process wait here) Ask the emulator (QEMU in this case) to give the host physical address of the guest physical page: (qemu) gpa2hpa 0xYYYYY000 Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000 From the host physical address get the pfn value (removing the last 3 zeros of the address) to poison. On the host, use hwpoison kernel module: [root@host]# modprobe hwpoison_inject and inject an error to the targeted pfn: [root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn Than wait until the Asynchronous error generated reaches the VM (it can take up to 5 minutes on AMD virtualization) to see the VM kernel deal with it. Without this suggested fix, the VM kernel panics, with the stack trace I gave: mce: MSR access error: WRMSR to 0xc0002098 (tried to write 0x0000000000000000) at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60) amd_clear_bank+0x6e/0x70 machine_check_poll+0x228/0x2e0 ? __pfx_mce_timer_fn+0x10/0x10 mce_timer_fn+0xb1/0x130 ? __pfx_mce_timer_fn+0x10/0x10 call_timer_fn+0x26/0x120 __run_timers+0x202/0x290 run_timer_softirq+0x49/0x100 handle_softirqs+0xeb/0x2c0 __irq_exit_rcu+0xda/0x100 sysvec_apic_timer_interrupt+0x71/0x90 [...] Kernel panic - not syncing: MCA architectural violation! With the fix the VM Kernel deals with the error: [root@VM]# ./mce_process_react_x86 Setting Early kill... Ok Data pages at 0x7fa0f9b25000 physically 0x172929000 (qemu) gpa2hpa 0x172929000 Host physical address for 0x172929000 (pc.ram) is 0x237129000 -> Injecting the error with: [root@host]# echo 0x237129 > /sys/kernel/debug/hwpoison/corrupt-pfn -> The VM monitor indicates: qemu-kvm: warning: Guest MCE Memory Error at QEMU addr 0x7f3ae2729000 and GUEST addr 0x172929000 of type BUS_MCEERR_AO injected -> A few minutes later, the VM console shows: localhost login: [ 332.973864] mce: [Hardware Error]: Machine check events logged [ 332.976795] Memory failure: 0x172929: Sending SIGBUS to mce_process_rea:5607 due to hardware memory corruption [ 332.977832] Memory failure: 0x172929: recovery action for dirty LRU page: Recovered [ 355.056785] MCE: Killing mce_process_rea:5607 due to hardware memory corruption fault at 0x7fa0f9b25000 -> The process shows: Signal 7 received: BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000 Signal 7 received: BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000 Exit from the signal handler on BUS_MCEERR_AR -> Works as expected: AO error is relayed by the VM kernel to the application running. > Before that, is that case even worth testing? If we accept that relayed MCEs is supported by the upstream kernel running in the VM, than yes. > > I hope I'm making sense here. The MCA and other low-level hw code works on > baremetal as that's its main target. If it is supposed to work in VMs, then > there better be a proper use case which we are willing to support and we can > *actually* *test*. The above detailed procedure can maybe help with this aspect, even if it is virtualization oriented. As I do hope that upstream kernel supports memory error handling in a VM. But Yazen's answers about non-SMCA hardware can also help to decide what to do with this fix. > > If not, you can keep this "fix" in your guest kernels and everyone's happy. > > Thx. I hope my explanations helped to better understand the context. Thanks, William. [-- Attachment #2: mce_process_react.c --] [-- Type: text/x-csrc, Size: 4517 bytes --] #include <sys/types.h> #include <sys/prctl.h> #include <sys/mman.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <signal.h> #include <string.h> #define PAGEMAP_ENTRY 8 #define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y #define GET_PFN(X) X & 0x7FFFFFFFFFFFFF const int __endian_bit = 1; #define is_bigendian() ( (*(char*)&__endian_bit) == 0 ) static long pgsz; /* * Set the early kill mode reaction state to MCE error. */ static void early_reaction() { printf("Setting Early kill... "); if (prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) == 0) printf("Ok\n"); else printf("Failure !\n"); } /* * Return the physical address associated to a given local virtual address, * or -1 in case of an error. */ static uint64_t physical_address(uint64_t virt_addr) { char path_buf [0x100]; FILE * f; uint64_t read_val, file_offset, pfn = 0; unsigned char c_buf[PAGEMAP_ENTRY]; pid_t my_pid = getpid(); int status, i; sprintf(path_buf, "/proc/%u/pagemap", my_pid); f = fopen(path_buf, "rb"); if(!f){ printf("Error! Cannot open %s\n", path_buf); return (uint64_t)-1; } file_offset = virt_addr / (uint64_t)pgsz * PAGEMAP_ENTRY; status = fseek(f, (long)file_offset, SEEK_SET); if(status){ perror("Failed to do fseek!"); fclose(f); return (uint64_t)-1; } for(i=0; i < PAGEMAP_ENTRY; i++){ int c = getc(f); if(c==EOF){ fclose(f); return (uint64_t)-1; } if(is_bigendian()) c_buf[i] = (unsigned char)c; else c_buf[PAGEMAP_ENTRY - i - 1] = (unsigned char)c; } fclose(f); read_val = 0; for(i=0; i < PAGEMAP_ENTRY; i++){ read_val = (read_val << 8) + c_buf[i]; } if(GET_BIT(read_val, 63)) { pfn = GET_PFN(read_val); } else { printf("Page not present !\n"); } if(GET_BIT(read_val, 62)) printf("Page swapped\n"); if (pfn == 0) return (uint64_t)-1; return pfn * (uint64_t)pgsz; } /* * SIGBUS handler to display the given information. */ static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) { printf("Signal %d received: ", signum); printf("%s on vaddr: %p\n", (siginfo->si_code == 4? "BUS_MCEERR_AR":"BUS_MCEERR_AO"), siginfo->si_addr); if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */ fprintf(stderr, "Exit from the signal handler on BUS_MCEERR_AR\n"); _exit(1); } } int main(int argc, char ** argv) { struct sigaction my_sigaction; uint64_t virt_addr = 0, phys_addr; void *local_pnt; // Need to have the CAP_SYS_ADMIN capability to get PFNs values in pagemap. if (getuid() != 0) { fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]); exit(EXIT_FAILURE); } // attach our SIGBUS handler. memset(&my_sigaction, 0, sizeof(my_sigaction)); my_sigaction.sa_sigaction = sigbus_action; my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER; sigemptyset(&my_sigaction.sa_mask); if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) { perror("Signal handler attach failed"); exit(EXIT_FAILURE); } pgsz = sysconf(_SC_PAGESIZE); if (pgsz == -1) { perror("sysconf(_SC_PAGESIZE)"); exit(EXIT_FAILURE); } early_reaction(); // Allocate a private page. local_pnt = mmap(NULL, pgsz, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0); if (local_pnt == MAP_FAILED) { fprintf(stderr, "Memory Allocation failed !\n"); exit(EXIT_FAILURE); } virt_addr = (uint64_t)local_pnt; // Dirty / map the page. sprintf((char *)local_pnt, "My page\n"); phys_addr = physical_address(virt_addr); if (phys_addr == -1) { fprintf(stderr, "Virtual address translation 0x%llx failed\n", (unsigned long long)virt_addr); exit(EXIT_FAILURE); } printf("\nData pages at 0x%llx physically 0x%llx\n", (unsigned long long)virt_addr, (unsigned long long)phys_addr); fflush(stdout); printf("\nPress ENTER to continue\n"); fgetc(stdin); // read the string at the beginning of page. printf("%s", (char *)local_pnt); phys_addr = physical_address(virt_addr); if (phys_addr == -1) { fprintf(stderr, "Virtual address translation 0x%llx failed\n", (unsigned long long)virt_addr); } else { printf("\nData pages at 0x%llx physically 0x%llx\n", (unsigned long long)virt_addr, (unsigned long long)phys_addr); } return 0; } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-12 22:44 ` William Roche @ 2026-03-13 20:10 ` Borislav Petkov 2026-03-16 15:27 ` William Roche 2026-03-13 20:26 ` Yazen Ghannam 1 sibling, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2026-03-13 20:10 UTC (permalink / raw) To: William Roche Cc: yazen.ghannam, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote: > Yazen, could you also please tell us if an existing non-SMCA AMD hardware > could crash on updating an SMCA register ? So, the situation is this: if software needs to access a MCA_DESTATUS MSR - which is part of AMD's MCA extensions - then software needs to check the smca bit. So your patch is correct. The justification about it is not. It should talk about how software should touch that MSR *only* *after* having checked mce_flags.smca. Because, it doesn't matter what KVM does or whoever - we all adhere to the hw spec. Because technically speaking, this code should blow up on non-SMCA machines too because they do support deferred errors (Bulldozer for example) but they will #GP on access to the MCA_DESTATUS MSRs as those are reserved there. So please rewrite your commit message to state that. And then you can talk about what the real-life situation is which caught this. As to your use case - thanks for explaining it. If this is something which people run, then it would be wonderful if we had a simple test script in the kernel which verifies new changes don't break it and so that we can run it periodically as part of testing. HTH. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-13 20:10 ` Borislav Petkov @ 2026-03-16 15:27 ` William Roche 0 siblings, 0 replies; 12+ messages in thread From: William Roche @ 2026-03-16 15:27 UTC (permalink / raw) To: Borislav Petkov Cc: yazen.ghannam, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On 3/13/26 21:10, Borislav Petkov wrote: > On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote: >> Yazen, could you also please tell us if an existing non-SMCA AMD hardware >> could crash on updating an SMCA register ? > > So, the situation is this: if software needs to access a MCA_DESTATUS MSR > - which is part of AMD's MCA extensions - then software needs to check the > smca bit. > > So your patch is correct. The justification about it is not. > > It should talk about how software should touch that MSR *only* *after* having > checked mce_flags.smca. > Ok, I understand your point. > Because, it doesn't matter what KVM does or whoever - we all adhere to the hw > spec. > > Because technically speaking, this code should blow up on non-SMCA machines > too because they do support deferred errors (Bulldozer for example) but they > will #GP on access to the MCA_DESTATUS MSRs as those are reserved there. This is a little more complicated as Yazen raised the situation in his answer. But I agree that SMCA specific registers are reserved and should not be accessed without checking that it is allowed to do so, first. > > So please rewrite your commit message to state that. And then you can talk > about what the real-life situation is which caught this. > Sure, I'm going to submit a new version of this patch using this new commit message: x86/mce/amd: Guard SMCA DESTAT access on non-SMCA machines Access to SMCA specific registers like MCA_DESTAT should only be done after having checked the smca bit. Avoiding a non-SMCA machine (like AMD QEMU/KVM VMs) crash during deferred error handling. Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling") Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: stable@vger.kernel.org > As to your use case - thanks for explaining it. If this is something which > people run, then it would be wonderful if we had a simple test script in the > kernel which verifies new changes don't break it and so that we can run it > periodically as part of testing. That would be great ! If there is a framework to create simple test script running the built kernel into a VM, I'd be happy to know about it and create the test we are talking about -- as a separate fix proposal. Thanks again for your feedback, William. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-12 22:44 ` William Roche 2026-03-13 20:10 ` Borislav Petkov @ 2026-03-13 20:26 ` Yazen Ghannam 2026-03-16 15:26 ` William Roche 1 sibling, 1 reply; 12+ messages in thread From: Yazen Ghannam @ 2026-03-13 20:26 UTC (permalink / raw) To: William Roche Cc: Borislav Petkov, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote: [...] > > Yazen may help us on this aspect: Could you please let us know if there is > an AMD specification for accessing SMCA registers on non SMCA machines ? > > > Now if we had a valid case of an existing non-SMCA AMD hardware that could > crash on updating an SMCA register, the fix would be needed not only for the > VM case. > > Yazen, could you also please tell us if an existing non-SMCA AMD hardware > could crash on updating an SMCA register ? > All the systems I have access to are Zen systems, and all Zen systems are SMCA systems. I'll try to find a older system to test (Bulldozer, etc.). [...] > > I have a procedure to verify the behavior: It consists of running the > upstream kernel in a VM (on an AMD platform) and injecting a memory error > from the hardware platform to this VM to mimic a real hardware error being > reported to the platform Kernel. > > To do so: > Run Qemu as root (to help with the address translation). > The VM runs the upstream kernel. > Run the small attached program in the VM as root, so that it gives a guest > physical address of one of its mapped memory page. > > [root@VM]# ./mce_process_react_x86 > Setting Early kill... Ok > > Data pages at 0xXXXXXXX physically 0xYYYYY000 > > -> DON'T Press enter ! (just leave the process wait here) > > Ask the emulator (QEMU in this case) to give the host physical address of > the guest physical page: > (qemu) gpa2hpa 0xYYYYY000 > Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000 > > From the host physical address get the pfn value (removing the last 3 zeros > of the address) to poison. > > On the host, use hwpoison kernel module: > [root@host]# modprobe hwpoison_inject > > and inject an error to the targeted pfn: > [root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn > > Than wait until the Asynchronous error generated reaches the VM (it can take > up to 5 minutes on AMD virtualization) to see the VM kernel deal with it. ...hint for below question. > > Without this suggested fix, the VM kernel panics, with the stack trace I > gave: > > mce: MSR access error: WRMSR to 0xc0002098 (tried to write > 0x0000000000000000) > at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60) > > amd_clear_bank+0x6e/0x70 > machine_check_poll+0x228/0x2e0 > ? __pfx_mce_timer_fn+0x10/0x10 > mce_timer_fn+0xb1/0x130 > ? __pfx_mce_timer_fn+0x10/0x10 > call_timer_fn+0x26/0x120 > __run_timers+0x202/0x290 > run_timer_softirq+0x49/0x100 > handle_softirqs+0xeb/0x2c0 > __irq_exit_rcu+0xda/0x100 > sysvec_apic_timer_interrupt+0x71/0x90 > [...] > Kernel panic - not syncing: MCA architectural violation! The code flow indicates that a Deferred error was found by MCA polling. I thought QEMU injects a #MC into the guest? William, do you encounter the issue if you disable MCA polling in the guest? To my knowledge, Deferred errors are reported starting with Zen/SMCA systems, even though the concept is found in older documentation. This is another reason for the implicit handling. I see in QEMU we set the DEFERRED status bit for BUS_MCEERR_AO errors. I don't recall why we did that. I'll need to review the old threads. I feel like the intent was to select bits to produce the desired outcome rather than faithfully replicate hardware behavior. Specifically, the DEFERRED status bit would prevent CE filtering condition in do_machine_check(). And it would trigger the AO flow in the guest rather than the AR flow if we set the UC status bit. Another example is we use the POISON status bit so the address is marked as "usable". A real DEFERRED error would never have the POISON status bit; they are mutually exclusive by definition. But there may be another hidden issue: handling the error through polling rather than #MC. I'm thinking this isn't intentional, and the recent Linux changes exposed this behavior. Thanks, Yazen ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-13 20:26 ` Yazen Ghannam @ 2026-03-16 15:26 ` William Roche 2026-03-19 14:25 ` Yazen Ghannam 0 siblings, 1 reply; 12+ messages in thread From: William Roche @ 2026-03-16 15:26 UTC (permalink / raw) To: Yazen Ghannam Cc: Borislav Petkov, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On 3/13/26 21:26, Yazen Ghannam wrote: > On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote: > > [...] > >> >> Yazen may help us on this aspect: Could you please let us know if there is >> an AMD specification for accessing SMCA registers on non SMCA machines ? >> >> >> Now if we had a valid case of an existing non-SMCA AMD hardware that could >> crash on updating an SMCA register, the fix would be needed not only for the >> VM case. >> >> Yazen, could you also please tell us if an existing non-SMCA AMD hardware >> could crash on updating an SMCA register ? >> > > All the systems I have access to are Zen systems, and all Zen systems > are SMCA systems. I'll try to find a older system to test (Bulldozer, > etc.). I don't think that it is needed anymore, if the bare metal doesn't show this case of AO errors dealt the same way (as discussed below). It looks to me like the QEMU/KVM VM case could be a specific case, exposed with your new change. > > [...] > >> >> I have a procedure to verify the behavior: It consists of running the >> upstream kernel in a VM (on an AMD platform) and injecting a memory error >> from the hardware platform to this VM to mimic a real hardware error being >> reported to the platform Kernel. >> >> To do so: >> Run Qemu as root (to help with the address translation). >> The VM runs the upstream kernel. >> Run the small attached program in the VM as root, so that it gives a guest >> physical address of one of its mapped memory page. >> >> [root@VM]# ./mce_process_react_x86 >> Setting Early kill... Ok >> >> Data pages at 0xXXXXXXX physically 0xYYYYY000 >> >> -> DON'T Press enter ! (just leave the process wait here) >> >> Ask the emulator (QEMU in this case) to give the host physical address of >> the guest physical page: >> (qemu) gpa2hpa 0xYYYYY000 >> Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000 >> >> From the host physical address get the pfn value (removing the last 3 zeros >> of the address) to poison. >> >> On the host, use hwpoison kernel module: >> [root@host]# modprobe hwpoison_inject >> >> and inject an error to the targeted pfn: >> [root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn >> >> Than wait until the Asynchronous error generated reaches the VM (it can take >> up to 5 minutes on AMD virtualization) to see the VM kernel deal with it. > > ...hint for below question. > >> >> Without this suggested fix, the VM kernel panics, with the stack trace I >> gave: >> >> mce: MSR access error: WRMSR to 0xc0002098 (tried to write >> 0x0000000000000000) >> at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60) >> >> amd_clear_bank+0x6e/0x70 >> machine_check_poll+0x228/0x2e0 >> ? __pfx_mce_timer_fn+0x10/0x10 >> mce_timer_fn+0xb1/0x130 >> ? __pfx_mce_timer_fn+0x10/0x10 >> call_timer_fn+0x26/0x120 >> __run_timers+0x202/0x290 >> run_timer_softirq+0x49/0x100 >> handle_softirqs+0xeb/0x2c0 >> __irq_exit_rcu+0xda/0x100 >> sysvec_apic_timer_interrupt+0x71/0x90 >> [...] >> Kernel panic - not syncing: MCA architectural violation! > > The code flow indicates that a Deferred error was found by MCA polling. This is right. > > I thought QEMU injects a #MC into the guest? The way AO error handling has been integrated to QEMU/KVM for the AMD VM case relies on machine_check_poll() > > William, do you encounter the issue if you disable MCA polling in the > guest? If I disable machine check polling (with mce=ignore_ce kernel option for example), the AO error is not seen in the VM anymore, and of course we don't crash because of it. > > To my knowledge, Deferred errors are reported starting with Zen/SMCA > systems, even though the concept is found in older documentation. This > is another reason for the implicit handling. > > I see in QEMU we set the DEFERRED status bit for BUS_MCEERR_AO errors. I > don't recall why we did that. I'll need to review the old threads. > > I feel like the intent was to select bits to produce the desired outcome > rather than faithfully replicate hardware behavior. Specifically, the > DEFERRED status bit would prevent CE filtering condition in > do_machine_check(). And it would trigger the AO flow in the guest rather > than the AR flow if we set the UC status bit. > > Another example is we use the POISON status bit so the address is marked > as "usable". A real DEFERRED error would never have the POISON status > bit; they are mutually exclusive by definition. That's the QEMU/KVM choice that was made about 2 years ago, and explained in the following comment of the *QEMU* fix: 4b77512b2782 i386: Fix MCE support for AMD hosts target/i386/kvm/kvm.c function kvm_mce_inject(): /* Setting the POISON bit for deferred errors indicates to the * guest kernel that the address provided by the MCE is valid * and usable which will ensure that the guest kernel will send * a SIGBUS_AO signal to the guest process. This allows for * more desirable behavior in the case that the guest process * with poisoned memory has set the MCE_KILL_EARLY prctl flag * which indicates that the process would prefer to handle or * shutdown due to the poisoned memory condition before the * memory has been accessed. * * While the POISON bit would not be set in a deferred error * sent from hardware, the bit is not meaningful for deferred * errors and can be reused in this scenario. */ status |= MCI_STATUS_DEFERRED | MCI_STATUS_POISON; > > But there may be another hidden issue: handling the error through > polling rather than #MC. I'm thinking this isn't intentional, and the > recent Linux changes exposed this behavior. You are right about "recent Linux changes exposed this behavior", but handling AO this way was intentional. With the suggested fix, we should cover this new exposed failure case. Now if we have a better way to deal with AO error handling on AMD VMs, it could be the subject of a separate thread (probably a Qemu thread). Our current suggested kernel fix would still be valid, even if it the code may not be exercised in the bare-metal case. > > Thanks, > Yazen Thank you very much Yazen for your help ! Cheers, William. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling 2026-03-16 15:26 ` William Roche @ 2026-03-19 14:25 ` Yazen Ghannam 0 siblings, 0 replies; 12+ messages in thread From: Yazen Ghannam @ 2026-03-19 14:25 UTC (permalink / raw) To: William Roche Cc: Borislav Petkov, tony.luck, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel, John.Allen, jane.chu On Mon, Mar 16, 2026 at 04:26:11PM +0100, William Roche wrote: [...] > > With the suggested fix, we should cover this new exposed failure case. > > Now if we have a better way to deal with AO error handling on AMD VMs, it > could be the subject of a separate thread (probably a Qemu thread). > Our current suggested kernel fix would still be valid, even if it the code > may not be exercised in the bare-metal case. > Yes, that's right. Enhancing the AO handling flow is separate discussion. Thanks, Yazen ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/1] AMD VM crashing on deferred memory error injection 2026-02-18 16:30 [PATCH v2 0/1] AMD VM crashing on deferred memory error injection “William Roche 2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche @ 2026-03-12 14:23 ` William Roche 1 sibling, 0 replies; 12+ messages in thread From: William Roche @ 2026-03-12 14:23 UTC (permalink / raw) To: yazen.ghannam, tony.luck, bp, tglx, mingo, dave.hansen, x86, hpa, linux-edac, linux-kernel Cc: John.Allen, jane.chu On 2/18/26 17:30, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > Thank you very much Yazen for your review and all the suggestions! > > v2 changes: > - Commit title changed to: > x86/mce/amd: Fix VM crash during deferred error handling > - Commit message with capitalized QEMU and KVM as well as the imperative > statement suggested by Yazen > - "CC stable" tag placed after "Signed-off-by" > (The documentation asks for "the sign-off area" without more details) > - blank line added to separate SCMA code block and the update of > MCA_STATUS. > > -- > > After the integration of the following commit: > 7cb735d7c0cb x86/mce: Unify AMD DFR handler with MCA Polling > > AMD Qemu VM started to crash when dealing with deferred memory error > injection with a stack trace like: > > mce: MSR access error: WRMSR to 0xc0002098 (tried to write 0x0000000000000000) > at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60) > > amd_clear_bank+0x6e/0x70 > machine_check_poll+0x228/0x2e0 > ? __pfx_mce_timer_fn+0x10/0x10 > mce_timer_fn+0xb1/0x130 > ? __pfx_mce_timer_fn+0x10/0x10 > call_timer_fn+0x26/0x120 > __run_timers+0x202/0x290 > run_timer_softirq+0x49/0x100 > handle_softirqs+0xeb/0x2c0 > __irq_exit_rcu+0xda/0x100 > sysvec_apic_timer_interrupt+0x71/0x90 > [...] > Kernel panic - not syncing: MCA architectural violation! > > See the discussion at: > https://lore.kernel.org/all/48d8e1c8-1eb9-49cc-8de8-78077f29c203@oracle.com/ > > We identified a problem with SMCA specific registers access from > non-SMCA platforms like a QEMU/KVM machine. > > This patch is checkpatch.pl clean. > Unit test of memory error injection works fine with it. > > > William Roche (1): > x86/mce/amd: Fix VM crash during deferred error handling > > arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------ > 1 file changed, 11 insertions(+), 6 deletions(-) > Hello, This fix has been reviewed by Yazen Ghannam. The code tested with QEMU/KVM virtual machines on AMD platforms. The commit that is fixed here (7cb735d7c0cb) is present in the stable branch linux-6.19.y. Could you please let me know if anything is missing to integrate this fix ? Thanks in advance for your feedback, William. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-03-19 14:27 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-18 16:30 [PATCH v2 0/1] AMD VM crashing on deferred memory error injection “William Roche 2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche 2026-03-12 14:42 ` Borislav Petkov 2026-03-12 15:11 ` William Roche 2026-03-12 16:04 ` Borislav Petkov 2026-03-12 22:44 ` William Roche 2026-03-13 20:10 ` Borislav Petkov 2026-03-16 15:27 ` William Roche 2026-03-13 20:26 ` Yazen Ghannam 2026-03-16 15:26 ` William Roche 2026-03-19 14:25 ` Yazen Ghannam 2026-03-12 14:23 ` [PATCH v2 0/1] AMD VM crashing on deferred memory error injection William Roche
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox