From: William Roche <william.roche@oracle.com>
To: Borislav Petkov <bp@alien8.de>, yazen.ghannam@amd.com
Cc: tony.luck@intel.com, tglx@kernel.org, mingo@redhat.com,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
John.Allen@amd.com, jane.chu@oracle.com
Subject: Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling
Date: Thu, 12 Mar 2026 23:44:04 +0100 [thread overview]
Message-ID: <ff7008cd-db47-4a0a-a74f-dcf3e6ab1891@oracle.com> (raw)
In-Reply-To: <20260312160453.GDabLkJfhslCLXZntv@fat_crate.local>
[-- Attachment #1: Type: text/plain, Size: 7495 bytes --]
Thank you for taking the time to explain your worries about the context
of this fix integration, and I do hope my feedback can help to convince you.
On 3/12/26 17:04, Borislav Petkov wrote:
> On Thu, Mar 12, 2026 at 04:11:10PM +0100, William Roche wrote:
>> From the kernel point of view (regardless if it is running on bare metal or
>> in a VM), access to these registers registers is provided by the platform:
>> either the Hardware or the emulation framework.
>
> Except the emulation doesn't emulate the platform properly. We test on real
> hw. If your hypervisor doesn't do that properly then that's not really
> upstream kernel's problem.
There are several aspects that are worth considering here:
First, I totally agree that the emulation has to emulate properly ! :)
The problem we are facing is to consider non-SMCA platform reaction to
updating and SMCA specific register.
And is the QEMU/KVM VM reaction as a non-SCMA machine a valid case ?
In this VM case, the MSR handling emulation is done by KVM which doesn't
implement a "permissive" access to unimplemented registers. I also
agreed with you when you said that it is working as advertised.
Now if emulating an AMD platform requires to provide a "permissive"
access to a specific set of registers, the fix would not be absolutely
necessary. But I may have missed a specification about that. And if such
a thing exists, it would also be all kernels (including upstream)
responsibility to take that into account.
Yazen may help us on this aspect: Could you please let us know if there
is an AMD specification for accessing SMCA registers on non SMCA machines ?
Now if we had a valid case of an existing non-SMCA AMD hardware that
could crash on updating an SMCA register, the fix would be needed not
only for the VM case.
Yazen, could you also please tell us if an existing non-SMCA AMD
hardware could crash on updating an SMCA register ?
The commit 7cb735d7c0cb [x86/mce: Unify AMD DFR handler with MCA
Polling] written by Yazen,
introduced an upstream kernel problem on non-SMCA platforms that has
been revealed by the emulation framework on AMD. That's the reason why I
think it should be fixed in upstream too. And Yazen himself agrees with
that.
>
>> Errors are injected into VMs by the hypervisor when real memory hardware
>> errors occur on the system that impact the VM address space.
>
> And?
The injected error is received by the VM kernel to deal with it.
> Why?
The VM kernel executes the same mechanisms used on bare metal in that case.
As Tony said on Feb 9: The guest may be able to just kill a process and
keep running.
>
> What's the recovery action scenario for having errors injected into guests?
Just the same as running on real HW.
> Where is that documented? Why does the upstream kernel need to care?
Sorry I don't have a kernel documentation pointer about that, but the
MCE relay mechanism sure is an Hypervisor functionality.
>
> Basically I'm asking you for the use case in order to determine whether that
> use case is valid for the *upstream* kernel to support.
Yes, of course, see below.
>
>> This is not only a test, this is real life mechanism. With the fix
>> 7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes
>> on Deferred errors, where it used to be able to deal with them before this
>> commit.
>
> Because we don't know of your use case. So when we do upstream development how
> can we test your case?
>
I have a procedure to verify the behavior: It consists of running the
upstream kernel in a VM (on an AMD platform) and injecting a memory
error from the hardware platform to this VM to mimic a real hardware
error being reported to the platform Kernel.
To do so:
Run Qemu as root (to help with the address translation).
The VM runs the upstream kernel.
Run the small attached program in the VM as root, so that it gives a
guest physical address of one of its mapped memory page.
[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok
Data pages at 0xXXXXXXX physically 0xYYYYY000
-> DON'T Press enter ! (just leave the process wait here)
Ask the emulator (QEMU in this case) to give the host physical address
of the guest physical page:
(qemu) gpa2hpa 0xYYYYY000
Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000
From the host physical address get the pfn value (removing the last 3
zeros of the address) to poison.
On the host, use hwpoison kernel module:
[root@host]# modprobe hwpoison_inject
and inject an error to the targeted pfn:
[root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn
Than wait until the Asynchronous error generated reaches the VM (it can
take up to 5 minutes on AMD virtualization) to see the VM kernel deal
with it.
Without this suggested fix, the VM kernel panics, with the stack trace I
gave:
mce: MSR access error: WRMSR to 0xc0002098 (tried to write
0x0000000000000000)
at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
amd_clear_bank+0x6e/0x70
machine_check_poll+0x228/0x2e0
? __pfx_mce_timer_fn+0x10/0x10
mce_timer_fn+0xb1/0x130
? __pfx_mce_timer_fn+0x10/0x10
call_timer_fn+0x26/0x120
__run_timers+0x202/0x290
run_timer_softirq+0x49/0x100
handle_softirqs+0xeb/0x2c0
__irq_exit_rcu+0xda/0x100
sysvec_apic_timer_interrupt+0x71/0x90
[...]
Kernel panic - not syncing: MCA architectural violation!
With the fix the VM Kernel deals with the error:
[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok
Data pages at 0x7fa0f9b25000 physically 0x172929000
(qemu) gpa2hpa 0x172929000
Host physical address for 0x172929000 (pc.ram) is 0x237129000
-> Injecting the error with:
[root@host]# echo 0x237129 > /sys/kernel/debug/hwpoison/corrupt-pfn
-> The VM monitor indicates:
qemu-kvm: warning: Guest MCE Memory Error at QEMU addr 0x7f3ae2729000
and GUEST addr 0x172929000 of type BUS_MCEERR_AO injected
-> A few minutes later, the VM console shows:
localhost login: [ 332.973864] mce: [Hardware Error]: Machine check
events logged
[ 332.976795] Memory failure: 0x172929: Sending SIGBUS to
mce_process_rea:5607 due to hardware memory corruption
[ 332.977832] Memory failure: 0x172929: recovery action for dirty LRU
page: Recovered
[ 355.056785] MCE: Killing mce_process_rea:5607 due to hardware memory
corruption fault at 0x7fa0f9b25000
-> The process shows:
Signal 7 received: BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000
Signal 7 received: BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
Exit from the signal handler on BUS_MCEERR_AR
-> Works as expected: AO error is relayed by the VM kernel to the
application running.
> Before that, is that case even worth testing?
If we accept that relayed MCEs is supported by the upstream kernel
running in the VM, than yes.
>
> I hope I'm making sense here. The MCA and other low-level hw code works on
> baremetal as that's its main target. If it is supposed to work in VMs, then
> there better be a proper use case which we are willing to support and we can
> *actually* *test*.
The above detailed procedure can maybe help with this aspect, even if it
is virtualization oriented. As I do hope that upstream kernel supports
memory error handling in a VM.
But Yazen's answers about non-SMCA hardware can also help to decide what
to do with this fix.
>
> If not, you can keep this "fix" in your guest kernels and everyone's happy.
>
> Thx.
I hope my explanations helped to better understand the context.
Thanks,
William.
[-- Attachment #2: mce_process_react.c --]
[-- Type: text/x-csrc, Size: 4517 bytes --]
#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <signal.h>
#include <string.h>
#define PAGEMAP_ENTRY 8
#define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y
#define GET_PFN(X) X & 0x7FFFFFFFFFFFFF
const int __endian_bit = 1;
#define is_bigendian() ( (*(char*)&__endian_bit) == 0 )
static long pgsz;
/*
* Set the early kill mode reaction state to MCE error.
*/
static void early_reaction() {
printf("Setting Early kill... ");
if (prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) == 0)
printf("Ok\n");
else
printf("Failure !\n");
}
/*
* Return the physical address associated to a given local virtual address,
* or -1 in case of an error.
*/
static uint64_t physical_address(uint64_t virt_addr) {
char path_buf [0x100];
FILE * f;
uint64_t read_val, file_offset, pfn = 0;
unsigned char c_buf[PAGEMAP_ENTRY];
pid_t my_pid = getpid();
int status, i;
sprintf(path_buf, "/proc/%u/pagemap", my_pid);
f = fopen(path_buf, "rb");
if(!f){
printf("Error! Cannot open %s\n", path_buf);
return (uint64_t)-1;
}
file_offset = virt_addr / (uint64_t)pgsz * PAGEMAP_ENTRY;
status = fseek(f, (long)file_offset, SEEK_SET);
if(status){
perror("Failed to do fseek!");
fclose(f);
return (uint64_t)-1;
}
for(i=0; i < PAGEMAP_ENTRY; i++){
int c = getc(f);
if(c==EOF){
fclose(f);
return (uint64_t)-1;
}
if(is_bigendian())
c_buf[i] = (unsigned char)c;
else
c_buf[PAGEMAP_ENTRY - i - 1] = (unsigned char)c;
}
fclose(f);
read_val = 0;
for(i=0; i < PAGEMAP_ENTRY; i++){
read_val = (read_val << 8) + c_buf[i];
}
if(GET_BIT(read_val, 63)) {
pfn = GET_PFN(read_val);
} else {
printf("Page not present !\n");
}
if(GET_BIT(read_val, 62))
printf("Page swapped\n");
if (pfn == 0)
return (uint64_t)-1;
return pfn * (uint64_t)pgsz;
}
/*
* SIGBUS handler to display the given information.
*/
static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) {
printf("Signal %d received: ", signum);
printf("%s on vaddr: %p\n",
(siginfo->si_code == 4? "BUS_MCEERR_AR":"BUS_MCEERR_AO"),
siginfo->si_addr);
if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */
fprintf(stderr, "Exit from the signal handler on BUS_MCEERR_AR\n");
_exit(1);
}
}
int main(int argc, char ** argv) {
struct sigaction my_sigaction;
uint64_t virt_addr = 0, phys_addr;
void *local_pnt;
// Need to have the CAP_SYS_ADMIN capability to get PFNs values in pagemap.
if (getuid() != 0) {
fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]);
exit(EXIT_FAILURE);
}
// attach our SIGBUS handler.
memset(&my_sigaction, 0, sizeof(my_sigaction));
my_sigaction.sa_sigaction = sigbus_action;
my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER;
sigemptyset(&my_sigaction.sa_mask);
if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) {
perror("Signal handler attach failed");
exit(EXIT_FAILURE);
}
pgsz = sysconf(_SC_PAGESIZE);
if (pgsz == -1) {
perror("sysconf(_SC_PAGESIZE)");
exit(EXIT_FAILURE);
}
early_reaction();
// Allocate a private page.
local_pnt = mmap(NULL, pgsz, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
if (local_pnt == MAP_FAILED) {
fprintf(stderr, "Memory Allocation failed !\n");
exit(EXIT_FAILURE);
}
virt_addr = (uint64_t)local_pnt;
// Dirty / map the page.
sprintf((char *)local_pnt, "My page\n");
phys_addr = physical_address(virt_addr);
if (phys_addr == -1) {
fprintf(stderr, "Virtual address translation 0x%llx failed\n",
(unsigned long long)virt_addr);
exit(EXIT_FAILURE);
}
printf("\nData pages at 0x%llx physically 0x%llx\n",
(unsigned long long)virt_addr, (unsigned long long)phys_addr);
fflush(stdout);
printf("\nPress ENTER to continue\n");
fgetc(stdin);
// read the string at the beginning of page.
printf("%s", (char *)local_pnt);
phys_addr = physical_address(virt_addr);
if (phys_addr == -1) {
fprintf(stderr, "Virtual address translation 0x%llx failed\n",
(unsigned long long)virt_addr);
} else {
printf("\nData pages at 0x%llx physically 0x%llx\n",
(unsigned long long)virt_addr, (unsigned long long)phys_addr);
}
return 0;
}
next prev parent reply other threads:[~2026-03-12 22:44 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-18 16:30 [PATCH v2 0/1] AMD VM crashing on deferred memory error injection “William Roche
2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche
2026-03-12 14:42 ` Borislav Petkov
2026-03-12 15:11 ` William Roche
2026-03-12 16:04 ` Borislav Petkov
2026-03-12 22:44 ` William Roche [this message]
2026-03-13 20:10 ` Borislav Petkov
2026-03-16 15:27 ` William Roche
2026-03-13 20:26 ` Yazen Ghannam
2026-03-16 15:26 ` William Roche
2026-03-19 14:25 ` Yazen Ghannam
2026-03-12 14:23 ` [PATCH v2 0/1] AMD VM crashing on deferred memory error injection William Roche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ff7008cd-db47-4a0a-a74f-dcf3e6ab1891@oracle.com \
--to=william.roche@oracle.com \
--cc=John.Allen@amd.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jane.chu@oracle.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=tglx@kernel.org \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox