Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling

public inbox for linux-edac@vger.kernel.org
 help / color / mirror / Atom feed

From: William Roche <william.roche@oracle.com>
To: Borislav Petkov <bp@alien8.de>, yazen.ghannam@amd.com
Cc: tony.luck@intel.com, tglx@kernel.org, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	John.Allen@amd.com, jane.chu@oracle.com
Subject: Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling
Date: Thu, 12 Mar 2026 23:44:04 +0100	[thread overview]
Message-ID: <ff7008cd-db47-4a0a-a74f-dcf3e6ab1891@oracle.com> (raw)
In-Reply-To: <20260312160453.GDabLkJfhslCLXZntv@fat_crate.local>

[-- Attachment #1: Type: text/plain, Size: 7495 bytes --]

Thank you for taking the time to explain your worries about the context 
of this fix integration, and I do hope my feedback can help to convince you.

On 3/12/26 17:04, Borislav Petkov wrote:
> On Thu, Mar 12, 2026 at 04:11:10PM +0100, William Roche wrote:
>>  From the kernel point of view (regardless if it is running on bare metal or
>> in a VM), access to these registers registers is provided by the platform:
>> either the Hardware or the emulation framework.
> 
> Except the emulation doesn't emulate the platform properly. We test on real
> hw. If your hypervisor doesn't do that properly then that's not really
> upstream kernel's problem.

There are several aspects that are worth considering here:
First, I totally agree that the emulation has to emulate properly ! :)

The problem we are facing is to consider non-SMCA platform reaction to 
updating and SMCA specific register.
And is the QEMU/KVM VM reaction as a non-SCMA machine a valid case ?

In this VM case, the MSR handling emulation is done by KVM which doesn't 
implement a "permissive" access to unimplemented registers. I also 
agreed with you when you said that it is working as advertised.
Now if emulating an AMD platform requires to provide a "permissive" 
access to a specific set of registers, the fix would not be absolutely 
necessary. But I may have missed a specification about that. And if such 
a thing exists, it would also be all kernels (including upstream) 
responsibility to take that into account.

Yazen may help us on this aspect: Could you please let us know if there 
is an AMD specification for accessing SMCA registers on non SMCA machines ?

Now if we had a valid case of an existing non-SMCA AMD hardware that 
could crash on updating an SMCA register, the fix would be needed not 
only for the VM case.

Yazen, could you also please tell us if an existing non-SMCA AMD 
hardware could crash on updating an SMCA register ?

The commit 7cb735d7c0cb [x86/mce: Unify AMD DFR handler with MCA 
Polling]  written by Yazen,
introduced an upstream kernel problem on non-SMCA platforms that has 
been revealed by the emulation framework on AMD. That's the reason why I 
think it should be fixed in upstream too. And Yazen himself agrees with 
that.

> 
>> Errors are injected into VMs by the hypervisor when real memory hardware
>> errors occur on the system that impact the VM address space.
> 
> And?

The injected error is received by the VM kernel to deal with it.

> Why?

The VM kernel executes the same mechanisms used on bare metal in that case.
As Tony said on Feb 9: The guest may be able to just kill a process and 
keep running.

> 
> What's the recovery action scenario for having errors injected into guests?

Just the same as running on real HW.

> Where is that documented? Why does the upstream kernel need to care?

Sorry I don't have a kernel documentation pointer about that, but the 
MCE relay mechanism sure is an Hypervisor functionality.

> 
> Basically I'm asking you for the use case in order to determine whether that
> use case is valid for the *upstream* kernel to support.

Yes, of course, see below.

> 
>> This is not only a test, this is real life mechanism. With the fix
>> 7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes
>> on Deferred errors, where it used to be able to deal with them before this
>> commit.
> 
> Because we don't know of your use case. So when we do upstream development how
> can we test your case?
> 

I have a procedure to verify the behavior: It consists of running the 
upstream kernel in a VM (on an AMD platform) and injecting a memory 
error from the hardware platform to this VM to mimic a real hardware 
error being reported to the platform Kernel.

To do so:
Run Qemu as root (to help with the address translation).
The VM runs the upstream kernel.
Run the small attached program in the VM as root, so that it gives a 
guest physical address of one of its mapped memory page.

[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok

Data pages at 0xXXXXXXX  physically 0xYYYYY000

-> DON'T Press enter !   (just leave the process wait here)

Ask the emulator (QEMU in this case) to give the host physical address 
of the guest physical page:
  (qemu) gpa2hpa 0xYYYYY000
  Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000

 From the host physical address get the pfn value (removing the last 3 
zeros of the address) to poison.

On the host, use hwpoison kernel module:
[root@host]# modprobe hwpoison_inject

and inject an error to the targeted pfn:
[root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn

Than wait until the Asynchronous error generated reaches the VM (it can 
take up to 5 minutes on AMD virtualization) to see the VM kernel deal 
with it.

Without this suggested fix, the VM kernel panics, with the stack trace I 
gave:

mce: MSR access error: WRMSR to 0xc0002098 (tried to write 
0x0000000000000000)
at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)

    amd_clear_bank+0x6e/0x70
    machine_check_poll+0x228/0x2e0
    ? __pfx_mce_timer_fn+0x10/0x10
    mce_timer_fn+0xb1/0x130
    ? __pfx_mce_timer_fn+0x10/0x10
    call_timer_fn+0x26/0x120
    __run_timers+0x202/0x290
    run_timer_softirq+0x49/0x100
    handle_softirqs+0xeb/0x2c0
    __irq_exit_rcu+0xda/0x100
    sysvec_apic_timer_interrupt+0x71/0x90
[...]
   Kernel panic - not syncing: MCA architectural violation!

With the fix the VM Kernel deals with the error:

[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok
Data pages at 0x7fa0f9b25000 physically 0x172929000

(qemu) gpa2hpa 0x172929000
Host physical address for 0x172929000 (pc.ram) is 0x237129000

-> Injecting the error with:
[root@host]# echo 0x237129 >  /sys/kernel/debug/hwpoison/corrupt-pfn

-> The VM monitor indicates:
qemu-kvm: warning: Guest MCE Memory Error at QEMU addr 0x7f3ae2729000 
and GUEST addr 0x172929000 of type BUS_MCEERR_AO injected

-> A few minutes later, the VM console shows:
localhost login: [  332.973864] mce: [Hardware Error]: Machine check 
events logged
[  332.976795] Memory failure: 0x172929: Sending SIGBUS to 
mce_process_rea:5607 due to hardware memory corruption
[  332.977832] Memory failure: 0x172929: recovery action for dirty LRU 
page: Recovered
[  355.056785] MCE: Killing mce_process_rea:5607 due to hardware memory 
corruption fault at 0x7fa0f9b25000

-> The process shows:
Signal 7 received: BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000
Signal 7 received: BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
Exit from the signal handler on BUS_MCEERR_AR

-> Works as expected: AO error is relayed by the VM kernel to the 
application running.

> Before that, is that case even worth testing?

If we accept that relayed MCEs is supported by the upstream kernel 
running in the VM, than yes.

> 
> I hope I'm making sense here. The MCA and other low-level hw code works on
> baremetal as that's its main target. If it is supposed to work in VMs, then
> there better be a proper use case which we are willing to support and we can
> *actually* *test*.

The above detailed procedure can maybe help with this aspect, even if it 
is virtualization oriented. As I do hope that upstream kernel supports 
memory error handling in a VM.

But Yazen's answers about non-SMCA hardware can also help to decide what 
to do with this fix.

> 
> If not, you can keep this "fix" in your guest kernels and everyone's happy.
> 
> Thx.

I hope my explanations helped to better understand the context.

Thanks,
William.

[-- Attachment #2: mce_process_react.c --]
[-- Type: text/x-csrc, Size: 4517 bytes --]

#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <signal.h>
#include <string.h>

#define PAGEMAP_ENTRY 8
#define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y
#define GET_PFN(X) X & 0x7FFFFFFFFFFFFF

const int __endian_bit = 1;
#define is_bigendian() ( (*(char*)&__endian_bit) == 0 )
static long pgsz;

/*
 * Set the early kill mode reaction state to MCE error.
 */
static void early_reaction() {
   printf("Setting Early kill... ");
   if (prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) == 0)
      printf("Ok\n");
   else
      printf("Failure !\n");
}

/*
 * Return the physical address associated to a given local virtual address,
 * or -1 in case of an error.
 */
static uint64_t physical_address(uint64_t virt_addr) {
   char path_buf [0x100];
   FILE * f;
   uint64_t read_val, file_offset, pfn = 0;
   unsigned char c_buf[PAGEMAP_ENTRY];
   pid_t my_pid = getpid();
   int status, i;

   sprintf(path_buf, "/proc/%u/pagemap", my_pid);

   f = fopen(path_buf, "rb");
   if(!f){
      printf("Error! Cannot open %s\n", path_buf);
      return (uint64_t)-1;
   }

   file_offset = virt_addr / (uint64_t)pgsz * PAGEMAP_ENTRY;
   status = fseek(f, (long)file_offset, SEEK_SET);
   if(status){
      perror("Failed to do fseek!");
      fclose(f);
      return (uint64_t)-1;
   }

   for(i=0; i < PAGEMAP_ENTRY; i++){
      int c = getc(f);
      if(c==EOF){
         fclose(f);
         return (uint64_t)-1;
      }
      if(is_bigendian())
           c_buf[i] = (unsigned char)c;
      else
           c_buf[PAGEMAP_ENTRY - i - 1] = (unsigned char)c;
   }
   fclose(f);

   read_val = 0;
   for(i=0; i < PAGEMAP_ENTRY; i++){
      read_val = (read_val << 8) + c_buf[i];
   }

   if(GET_BIT(read_val, 63)) {
      pfn = GET_PFN(read_val);
   } else {
      printf("Page not present !\n");
   }
   if(GET_BIT(read_val, 62))
      printf("Page swapped\n");

   if (pfn == 0)
      return (uint64_t)-1;

   return pfn * (uint64_t)pgsz;
}

/*
 * SIGBUS handler to display the given information.
 */
static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) {
   printf("Signal %d received: ", signum);
   printf("%s on vaddr: %p\n",
      (siginfo->si_code == 4? "BUS_MCEERR_AR":"BUS_MCEERR_AO"),
      siginfo->si_addr);

  if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */
	fprintf(stderr, "Exit from the signal handler on BUS_MCEERR_AR\n");
	_exit(1);
  }
}

int main(int argc, char ** argv) {
   struct sigaction my_sigaction;
   uint64_t virt_addr = 0, phys_addr;
   void *local_pnt;

   // Need to have the CAP_SYS_ADMIN capability to get PFNs values in pagemap.
   if (getuid() != 0) {
      fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]);
      exit(EXIT_FAILURE);
   }

   // attach our SIGBUS handler.
   memset(&my_sigaction, 0, sizeof(my_sigaction));
   my_sigaction.sa_sigaction = sigbus_action;
   my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER;
   sigemptyset(&my_sigaction.sa_mask);
   if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) {
      perror("Signal handler attach failed");
      exit(EXIT_FAILURE);
   }

   pgsz = sysconf(_SC_PAGESIZE);
   if (pgsz == -1) {
	   perror("sysconf(_SC_PAGESIZE)");
	   exit(EXIT_FAILURE);
   }
   early_reaction();

   // Allocate a private page.
   local_pnt = mmap(NULL, pgsz, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
   if (local_pnt == MAP_FAILED) {
      fprintf(stderr, "Memory Allocation failed !\n");
      exit(EXIT_FAILURE);
   }
   virt_addr = (uint64_t)local_pnt;

   // Dirty / map the page.
   sprintf((char *)local_pnt, "My page\n");

   phys_addr = physical_address(virt_addr);
   if (phys_addr == -1) {
      fprintf(stderr, "Virtual address translation 0x%llx failed\n", 
         (unsigned long long)virt_addr);
      exit(EXIT_FAILURE);
   }
   printf("\nData pages at 0x%llx  physically 0x%llx\n",
      (unsigned long long)virt_addr, (unsigned long long)phys_addr);
   fflush(stdout);

   printf("\nPress ENTER to continue\n");
   fgetc(stdin);

   // read the string at the beginning of page.
   printf("%s", (char *)local_pnt);

   phys_addr = physical_address(virt_addr);
   if (phys_addr == -1) {
      fprintf(stderr, "Virtual address translation 0x%llx failed\n", 
         (unsigned long long)virt_addr);
   } else {
      printf("\nData pages at 0x%llx  physically 0x%llx\n",
         (unsigned long long)virt_addr, (unsigned long long)phys_addr);
   }

   return 0;
}

next prev parent reply	other threads:[~2026-03-12 22:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 16:30 [PATCH v2 0/1] AMD VM crashing on deferred memory error injection “William Roche
2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche
2026-03-12 14:42   ` Borislav Petkov
2026-03-12 15:11     ` William Roche
2026-03-12 16:04       ` Borislav Petkov
2026-03-12 22:44         ` William Roche [this message]
2026-03-13 20:10           ` Borislav Petkov
2026-03-16 15:27             ` William Roche
2026-03-13 20:26           ` Yazen Ghannam
2026-03-16 15:26             ` William Roche
2026-03-19 14:25               ` Yazen Ghannam
2026-03-12 14:23 ` [PATCH v2 0/1] AMD VM crashing on deferred memory error injection William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ff7008cd-db47-4a0a-a74f-dcf3e6ab1891@oracle.com \
    --to=william.roche@oracle.com \
    --cc=John.Allen@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jane.chu@oracle.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox