Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling

public inbox for linux-edac@vger.kernel.org
 help / color / mirror / Atom feed

From: William Roche <william.roche@oracle.com>
To: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Borislav Petkov <bp@alien8.de>,
	tony.luck@intel.com, tglx@kernel.org, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	John.Allen@amd.com, jane.chu@oracle.com
Subject: Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling
Date: Mon, 16 Mar 2026 16:26:11 +0100	[thread overview]
Message-ID: <1221e4c3-31f9-49ca-b50f-e79d37448d4e@oracle.com> (raw)
In-Reply-To: <20260313202618.GA221731@yaz-khff2.amd.com>

On 3/13/26 21:26, Yazen Ghannam wrote:
> On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote:
> 
> [...]
> 
>>
>> Yazen may help us on this aspect: Could you please let us know if there is
>> an AMD specification for accessing SMCA registers on non SMCA machines ?
>>
>>
>> Now if we had a valid case of an existing non-SMCA AMD hardware that could
>> crash on updating an SMCA register, the fix would be needed not only for the
>> VM case.
>>
>> Yazen, could you also please tell us if an existing non-SMCA AMD hardware
>> could crash on updating an SMCA register ?
>>
> 
> All the systems I have access to are Zen systems, and all Zen systems
> are SMCA systems. I'll try to find a older system to test (Bulldozer,
> etc.).

I don't think that it is needed anymore, if the bare metal doesn't show 
this case of AO errors dealt the same way (as discussed below).
It looks to me like the QEMU/KVM VM case could be a specific case, 
exposed with your new change.

> 
> [...]
> 
>>
>> I have a procedure to verify the behavior: It consists of running the
>> upstream kernel in a VM (on an AMD platform) and injecting a memory error
>> from the hardware platform to this VM to mimic a real hardware error being
>> reported to the platform Kernel.
>>
>> To do so:
>> Run Qemu as root (to help with the address translation).
>> The VM runs the upstream kernel.
>> Run the small attached program in the VM as root, so that it gives a guest
>> physical address of one of its mapped memory page.
>>
>> [root@VM]# ./mce_process_react_x86
>> Setting Early kill... Ok
>>
>> Data pages at 0xXXXXXXX  physically 0xYYYYY000
>>
>> -> DON'T Press enter !   (just leave the process wait here)
>>
>> Ask the emulator (QEMU in this case) to give the host physical address of
>> the guest physical page:
>>   (qemu) gpa2hpa 0xYYYYY000
>>   Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000
>>
>>  From the host physical address get the pfn value (removing the last 3 zeros
>> of the address) to poison.
>>
>> On the host, use hwpoison kernel module:
>> [root@host]# modprobe hwpoison_inject
>>
>> and inject an error to the targeted pfn:
>> [root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn
>>
>> Than wait until the Asynchronous error generated reaches the VM (it can take
>> up to 5 minutes on AMD virtualization) to see the VM kernel deal with it.
> 
> ...hint for below question.
> 
>>
>> Without this suggested fix, the VM kernel panics, with the stack trace I
>> gave:
>>
>> mce: MSR access error: WRMSR to 0xc0002098 (tried to write
>> 0x0000000000000000)
>> at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
>>
>>     amd_clear_bank+0x6e/0x70
>>     machine_check_poll+0x228/0x2e0
>>     ? __pfx_mce_timer_fn+0x10/0x10
>>     mce_timer_fn+0xb1/0x130
>>     ? __pfx_mce_timer_fn+0x10/0x10
>>     call_timer_fn+0x26/0x120
>>     __run_timers+0x202/0x290
>>     run_timer_softirq+0x49/0x100
>>     handle_softirqs+0xeb/0x2c0
>>     __irq_exit_rcu+0xda/0x100
>>     sysvec_apic_timer_interrupt+0x71/0x90
>> [...]
>>    Kernel panic - not syncing: MCA architectural violation!
> 
> The code flow indicates that a Deferred error was found by MCA polling.

This is right.

> 
> I thought QEMU injects a #MC into the guest?

The way AO error handling has been integrated to QEMU/KVM for the AMD VM 
case relies on machine_check_poll()

> 
> William, do you encounter the issue if you disable MCA polling in the
> guest?

If I disable machine check polling (with mce=ignore_ce kernel option for 
example), the AO error is not seen in the VM anymore, and of course we 
don't crash because of it.

> 
> To my knowledge, Deferred errors are reported starting with Zen/SMCA
> systems, even though the concept is found in older documentation. This
> is another reason for the implicit handling.
> 
> I see in QEMU we set the DEFERRED status bit for BUS_MCEERR_AO errors. I
> don't recall why we did that. I'll need to review the old threads.
> 
> I feel like the intent was to select bits to produce the desired outcome
> rather than faithfully replicate hardware behavior. Specifically, the
> DEFERRED status bit would prevent CE filtering condition in
> do_machine_check(). And it would trigger the AO flow in the guest rather
> than the AR flow if we set the UC status bit.
> 
> Another example is we use the POISON status bit so the address is marked
> as "usable". A real DEFERRED error would never have the POISON status
> bit; they are mutually exclusive by definition.

That's the QEMU/KVM choice that was made about 2 years ago, and 
explained in the following comment of the *QEMU* fix:
     4b77512b2782 i386: Fix MCE support for AMD hosts
target/i386/kvm/kvm.c  function kvm_mce_inject():

      /* Setting the POISON bit for deferred errors indicates to the
       * guest kernel that the address provided by the MCE is valid
       * and usable which will ensure that the guest kernel will send
       * a SIGBUS_AO signal to the guest process. This allows for
       * more desirable behavior in the case that the guest process
       * with poisoned memory has set the MCE_KILL_EARLY prctl flag
       * which indicates that the process would prefer to handle or
       * shutdown due to the poisoned memory condition before the
       * memory has been accessed.
       *
       * While the POISON bit would not be set in a deferred error
       * sent from hardware, the bit is not meaningful for deferred
       * errors and can be reused in this scenario.
       */
       status |= MCI_STATUS_DEFERRED | MCI_STATUS_POISON;

> 
> But there may be another hidden issue: handling the error through
> polling rather than #MC. I'm thinking this isn't intentional, and the
> recent Linux changes exposed this behavior.

You are right about "recent Linux changes exposed this behavior", but 
handling AO this way was intentional.

With the suggested fix, we should cover this new exposed failure case.

Now if we have a better way to deal with AO error handling on AMD VMs, 
it could be the subject of a separate thread (probably a Qemu thread).
Our current suggested kernel fix would still be valid, even if it the 
code may not be exercised in the bare-metal case.

> 
> Thanks,
> Yazen


Thank you very much Yazen for your help !

Cheers,
William.

next prev parent reply	other threads:[~2026-03-16 15:26 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 16:30 [PATCH v2 0/1] AMD VM crashing on deferred memory error injection “William Roche
2026-02-18 16:30 ` [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling “William Roche
2026-03-12 14:42   ` Borislav Petkov
2026-03-12 15:11     ` William Roche
2026-03-12 16:04       ` Borislav Petkov
2026-03-12 22:44         ` William Roche
2026-03-13 20:10           ` Borislav Petkov
2026-03-16 15:27             ` William Roche
2026-03-13 20:26           ` Yazen Ghannam
2026-03-16 15:26             ` William Roche [this message]
2026-03-19 14:25               ` Yazen Ghannam
2026-03-12 14:23 ` [PATCH v2 0/1] AMD VM crashing on deferred memory error injection William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1221e4c3-31f9-49ca-b50f-e79d37448d4e@oracle.com \
    --to=william.roche@oracle.com \
    --cc=John.Allen@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jane.chu@oracle.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox