Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: William Roche <william.roche@oracle.com>
To: Yazen Ghannam <yazen.ghannam@amd.com>,
	Joao Martins <joao.m.martins@oracle.com>,
	John Allen <john.allen@amd.com>,
	qemu-devel@nongnu.org
Cc: michael.roth@amd.com, babu.moger@amd.com, pbonzini@redhat.com,
	richard.henderson@linaro.org, eduardo@habkost.net
Subject: Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
Date: Fri, 22 Sep 2023 18:18:39 +0200	[thread overview]
Message-ID: <cec7bf1f-3c43-2e68-ceee-a196d10093cd@oracle.com> (raw)
In-Reply-To: <60d3f74a-a1a8-4fed-a102-9985c47c69c8@amd.com>

On 9/22/23 16:30, Yazen Ghannam wrote:
> On 9/22/23 4:36 AM, William Roche wrote:
>> On 9/21/23 19:41, Yazen Ghannam wrote:
>>> [...]
>>> Also, during page migration, does the data flow through the CPU core?
>>> Sorry for the basic question. I haven't done a lot with virtualization.
>>
>> Yes, in most cases (with the exception of RDMA) the data flow through
>> the CPU cores because the migration verifies if the area to transfer has
>> some empty pages.
>>
> 
> If the CPU moves the memory, then the data will pass through the core/L1
> caches, correct? If so, then this will result in a MCE/poison
> consumption/AR event in that core.

That's the entire point of this other patch I was referring to:
  "Qemu crashes on VM migration after an handled memory error"
an example of a direct link:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg990803.html

The idea is to skip the pages we know are poisoned -- so we have a
chance to complete the migration without getting AR events :)

> 
> So it seems to me that migration will always cause an AR event, and the
> gap you describe will not occur. Does this make sense? Sorry if I
> misunderstood.
> 
> In general, the hardware is designed to detect and mark poison, and to
> not let poison escape a system undetected. In the strictest case, the
> hardware will perform a system reset if poison is leaving the system. In
> a more graceful case, the hardware will continue to pass the poison
> marker with the data, so the destination hardware will receive it. In
> both cases, the goal is to avoid silent data corruption, and to do so in
> the hardware, i.e. without relying on firmware or software management.
> The hardware designers are very keen on this point.

For the moment virtualization needs *several* enhancements just to deal
with memory errors -- what we are currently trying to fix is a good
example of that !

> 
> BTW, the RDMA case will need further discussion. I *think* this would
> fall under the "strictest" case. And likely, CPU-based migration will
> also. But I think we can test this and find out. :)

The test has been done, and showed that the RDMA migration is failing
when poison exists.
But we are discussing aspects that are probably too far from our main
topic here.

> 
>>>
>>> Please note that current AMD systems use an internal poison marker on
>>> memory. This cannot be cleared through normal memory operations. The
>>> only exception, I think, is to use the CLZERO instruction. This will
>>> completely wipe a cacheline including metadata like poison, etc.
>>>
>>> So the hardware should not (by design) loose track of poisoned data.
>>
>> This would be better, but virtualization migration currently looses
>> track of this.
>> Which is not a problem for VMs where the kernel took note of the poison
>> and keeps track of it. Because this kernel will handle the poison
>> locations it knows about, signaling when these poisoned locations are
>> touched.
>>
> 
> Can you please elaborate on this? I would expect the host kernel to do
> all the physical, including poison, memory management.

Yes, the host kernel does that, and the VM kernel too for its own
address space.

> 
> Or do you mean in the nested poison case like this?
> 1) The host detects an "AO/deferred" error.

The host Kernel is notified by the hardware of an SRAO/deferred error

> 2) The host can try to recover the memory, if clean, etc.

 From my understanding, this is an uncorrectable error, standard case
Kernel can't "clean" the error, but keeps track of it and tries to
signal the user of the impacted memory page every-time it's needed.

> 3) Otherwise, the host passes the error info, with "AO/deferred" severity
> to the guest.

Yes, in the case of a guest VM impacted, qemu asked to be informed of AO
events, so that the host kernel should signal it to qemu. Qemu than
relays the information (creating a virtual MCE event) that the VM Kernel
receives and deals with.

> 4) The guest, in nested fashion, can try to recover the memory, if
> clean, etc. Or signal its own processes with the AO SIGBUS.

Here again there is no recovery: The VM kernel does the same thing as
the host kernel: memory management, possible signals, etc...


>>> An enhancement will be to take the MCA error information collected
>>> during the interrupt and extract useful data. For example, we'll need to
>>> translate the reported address to a system physical address that can be
>>> mapped to a page.
>>
>> This would be great, as it would mean that a kernel running in a VM can
>> get notified too.
>>
> 
> Yes, I agree.
> 
>>>
>>> Once we have the page, then we can decide how we want to signal the
>>> process(es). We could get a deferred/AO error in the host, and signal the
>>> guest with an AR. So the guest handling could be the same in both cases. >
>>> Would this be okay? Or is it important that the guest can distinguish
>>> between the A0/AR cases?
>>
>>
>> SIGBUS/BUS_MCEERR_AO and BUS_MCEERR_AR are not interchangeable, it is
>> important to distinguish them.
>> AO is an asynchronous signal that is only generated when the process
>> asked for it -- indicating that an error has been detected in its
>> address space but hasn't been touched yet.
>> Most of the processes don't care about that (and don't get notified),
>> they just continue to run, if the poisoned area is not touched, great.
>> Otherwise a BUS_MCEERR_AR signal is generated when the area is touched,
>> indicating that the execution thread can't access the location.
>>
> 
> Yes, understood.
> 
>>
>>> IOW, will guests have their own policies on
>>> when to take action? Or is it more about allowing the guest to handle
>>> the error less urgently?
>>
>> Yes to both questions. Any process can indicate if it is interested to
>> be "early killed on MCE" or not. See proc(5) man page about
>> /proc/sys/vm/memory_failure_early_kill, and prctl(2) about
>> PR_MCE_KILL/PR_MCE_KILL_GET. Such a process could take actions before
>> it's too late and it would need the poisoned data.
>>
> 
> Yes, agree. I think the "nested" case above would fall under this. Also,
> an application, or software stack, with complex memory management could
> benefit.

Sure -- some databases already take advantage of this mechanism for
example too ;)

>> In other words, having the AMD kernel to generate SIGBUS/BUS_MCEERR_AO
>> signals and making AMD qemu able to relay them to the VM kernel would
>> make things better for AMD platforms ;)
>>
> 
> Yes, I agree. :)

So according to me, for the moment we should integrate the 3 proposed
patches, and continue to work to make:
  - the AMD kernel deal better with SRAO both on the host
    and the VM sides,
  - in relationship with another qemu enhancement to relay the
    BUS_MCEERR_AO signal so that the VM kernel deals with it too.

The reason why I started this conversation was to know if there would be
a simple way to already informed the VM kernel of an AO signal (without
crashing it) even if it is not yet able to relay the event to its own
processes. But this would prepare qemu so that when the kernel is
enhanced, it may not be necessary to modify qemu again.

The patches we are currently focusing on (Fix MCE handling on AMD hosts)
help to better deal with BUS_MCEERR_AR signal instead of crashing --
this looks like a necessary step to me.

HTH,
William.

next prev parent reply	other threads:[~2023-09-22 16:20 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-12 21:18 [PATCH v4 0/3] Fix MCE handling on AMD hosts John Allen
2023-09-12 21:18 ` [PATCH v4 1/3] i386: Fix MCE support for " John Allen
2023-09-12 21:18 ` [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest John Allen
2023-09-13  3:22   ` Gupta, Pankaj
2023-09-18 22:00   ` William Roche
2023-09-20 11:13     ` Joao Martins
2023-09-21 17:41       ` Yazen Ghannam
2023-09-22  8:36         ` William Roche
2023-09-22 14:30           ` Yazen Ghannam
2023-09-22 16:18             ` William Roche [this message]
2023-10-13 15:41             ` William Roche
2023-09-12 21:18 ` [PATCH v4 3/3] i386: Add support for SUCCOR feature John Allen
2024-02-07 11:21 ` [PATCH v4 0/3] Fix MCE handling on AMD hosts Joao Martins
2024-02-20 17:27   ` John Allen
2024-02-21 11:42     ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cec7bf1f-3c43-2e68-ceee-a196d10093cd@oracle.com \
    --to=william.roche@oracle.com \
    --cc=babu.moger@amd.com \
    --cc=eduardo@habkost.net \
    --cc=joao.m.martins@oracle.com \
    --cc=john.allen@amd.com \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).