From: William Roche <william.roche@oracle.com>
To: qemu-devel@nongnu.org, John Allen <john.allen@amd.com>,
Yazen Ghannam <yazen.ghannam@amd.com>,
Joao Martins <joao.m.martins@oracle.com>
Cc: michael.roth@amd.com, babu.moger@amd.com, pbonzini@redhat.com,
richard.henderson@linaro.org, eduardo@habkost.net
Subject: Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
Date: Fri, 13 Oct 2023 17:41:00 +0200 [thread overview]
Message-ID: <d492f17f-87a9-4714-aeb9-690d6b972803@oracle.com> (raw)
In-Reply-To: <60d3f74a-a1a8-4fed-a102-9985c47c69c8@amd.com>
Just a note to inform you that I've submitted a new patch on a
separate thread -- dealing with VM live migration after receiving
memory errors:
https://lore.kernel.org/qemu-devel/20231013150839.867164-3-william.roche@oracle.com/
This patch belongs to a 2 patches set that should fix the migration in
case of memory errors received and handled by the VM before the
migration request.
For the moment this other patch only fixes the ARM case ignoring
SIGBUS/BUS_MCEERR_AO errors, but the same mechanism should be used with
AMD ignoring SIGBUS/BUS_MCEERR_AO too. Using the same new parameter
to the kvm_hwpoison_page_add function in kvm_arch_on_sigbus_vcpu with:
kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
Of course we'll have to wait for this above patch to be integrated first.
HTH,
William.
On 9/19/23 00:00, William Roche wrote:
> Hi John,
>
> I'd like to put the emphasis on the fact that ignoring the SRAO error
> for a VM is a real problem at least for a specific (rare) case I'm
> currently working on: The VM migration.
>
> Context:
>
> - In the case of a poisoned page in the VM address space, the migration
> can't read it and will skip this page, considering it as a zero-filled
> page. The VM kernel (that handled the vMCE) would have marked it's
> associated page as poisoned, and if the VM touches the page, the VM
> kernel generates the associated MCE because it already knows about the
> poisoned page.
>
> - When we ignore the vMCE in the case of a SIGBUS/BUS_MCEERR_AO error
> (what this patch does), we entirely rely on the Hypervisor to send an
> SRAR error to qemu when the page is touched: The AMD VM kernel will
> receive the SIGBUS/BUS_MCEERR_AR and deal with it, thanks to your
> changes here.
>
> So it looks like the mechanism works fine... unless the VM has migrated
> between the SRAO error and the first time it really touches the poisoned
> page to get an SRAR error ! In this case, its new address space
> (created on the migration destination) will have a zero-page where we
> had a poisoned page, and the AMD VM Kernel (that never dealt with the
> SRAO) doesn't know about the poisoned page and will access the page
> finding only zeros... We have a memory corruption !
>
> It is a very rare window, but in order to fix it the most reasonable
> course of action would be to make the AMD emulation deal with SRAO
> errors, instead of ignoring them.
>
> Do you agree with my analysis ?
> Would an AMD platform generate SRAO signal to a process
> (SIGBUS/BUS_MCEERR_AO) in case of a real hardware error ?
>
> Thanks,
> William.
next prev parent reply other threads:[~2023-10-13 15:41 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-12 21:18 [PATCH v4 0/3] Fix MCE handling on AMD hosts John Allen
2023-09-12 21:18 ` [PATCH v4 1/3] i386: Fix MCE support for " John Allen
2023-09-12 21:18 ` [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest John Allen
2023-09-13 3:22 ` Gupta, Pankaj
2023-09-18 22:00 ` William Roche
2023-09-20 11:13 ` Joao Martins
2023-09-21 17:41 ` Yazen Ghannam
2023-09-22 8:36 ` William Roche
2023-09-22 14:30 ` Yazen Ghannam
2023-09-22 16:18 ` William Roche
2023-10-13 15:41 ` William Roche [this message]
2023-09-12 21:18 ` [PATCH v4 3/3] i386: Add support for SUCCOR feature John Allen
2024-02-07 11:21 ` [PATCH v4 0/3] Fix MCE handling on AMD hosts Joao Martins
2024-02-20 17:27 ` John Allen
2024-02-21 11:42 ` Joao Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d492f17f-87a9-4714-aeb9-690d6b972803@oracle.com \
--to=william.roche@oracle.com \
--cc=babu.moger@amd.com \
--cc=eduardo@habkost.net \
--cc=joao.m.martins@oracle.com \
--cc=john.allen@amd.com \
--cc=michael.roth@amd.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).