qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "“William Roche" <william.roche@oracle.com>
To: pbonzini@redhat.com, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, marcandre.lureau@redhat.com,
	berrange@redhat.com, thuth@redhat.com,
	richard.henderson@linaro.org, peter.maydell@linaro.org,
	mtosatti@redhat.com, qemu-devel@nongnu.org
Cc: kvm@vger.kernel.org, qemu-arm@nongnu.org,
	william.roche@oracle.com, joao.m.martins@oracle.com
Subject: [RFC RESEND 0/6] hugetlbfs largepage RAS project
Date: Tue, 10 Sep 2024 10:02:10 +0000	[thread overview]
Message-ID: <20240910100216.2744078-1-william.roche@oracle.com> (raw)
In-Reply-To: <20240910090747.2741475-1-william.roche@oracle.com>

From: William Roche <william.roche@oracle.com>


Apologies for the noise; resending as I missed CC'ing the maintainers of the
changed files


Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the
set of standard sized pages. But if an error is returned when accessing
a location we consider it poisoned and mark the corresponding standard sized
memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM
can also continue to use the possible valid pieces of information retrieved.
- Adjust the poison address information. When accessing a poison location,
an older Kernel version may only provide the address of the beginning of
the poisoned large page in the associated SIGBUS siginfo data. Pointing to
a more accurate touched poison location allows the VM kernel to trigger
the right memory error reaction.

A warning is given for hugetlbfs backed memory-regions that are mapped
without the 'share=on' option.
(This warning is also given when using the deprecated "-mem-path" option)

The hugetlbfs memory mapping option should look like that
(with XXX replaced with the actual size):
  -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram

I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific
code for this feature. It's only compiled on Linux versions.

Note that we have to be able to mark as "poison" a replacing valid standard
sized page. We currently do that calling madvise(..., MADV_HWPOISON).
But this requires qemu process to have CAP_SYS_ADMIN priviledge.
Using userfaultfd instead of madvise() to mark the pages as poison could
remove this constraint, and complicating the code adding thread(s) dealing
with the user page faults service.


It's also worth mentioning the IO memory, vfio configured memory buffers
case. The Qemu memory remapping (if it succeeds) will not reconfigure any
device IO buffers locations (no dma unmap/remap is performed) and if an
hardware IO is supposed to access (read or write) a poisoned hugetlbfs
page, I would expect it to fail the same way as before (as its location
hasn't been updated to take into account the new mapping).
But can someone confirm this possible behavior ? Or indicate me what should
be done to deal with this type of memory buffers ?

Details:
--------
The following problems had to be considered:

. kvm dealing with memory faults:
 - Address space mapping changes can't be handled in a signal handler (mmap
   is not async signal safe for example)
     We have a separate listener thread (only created when we use hugetlbfs)
     to deal with the mapping changes.
 - If a memory is not mapped when accessed, kvm fails with
   (exit_reason: KVM_EXIT_UNKNOWN)
     To avoid that, I needed to prevent the access to a changing memory
     region: pausing the VM is used to do so.
 - A fault on a poisoned hugetlbfs large page will report a hardcoded page
   size of 4k (See kernel kvm_send_hwpoison_signal() function)
     When a SIGBUS is received with a page size indication of 4k we have to
     verify if the impacted page is not a hugetlbfs page.
 - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size,
   but the current Qemu version needs to take the information into account.

. system/physmem needed fixes:
 - When recreating the memory mapping on VM reset, we have to consider the
   memory size impacted.
 - In the case of a mapped file, punching a hole is necessary to clean the
   poison.

. Implementation details:
 - SIGBUS signal received for a large page will trigger the page modification,
   but in order to pause the VM, the signal handers have to terminate.
     So we return from the SIGBUS signal handler(s) when a VM has to be stopped.
     A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the
     VM pause, will be repeated when the VM resumes. If the memory is still
     not accessible (poisoned) the signal will be generated again by the
     hypervisor kernel.
     In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is
     not repeated by the kernel and will be recorded by qemu in order to be
     replayed when the VM resumes.
 - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when
   called. The listener thread taking care of the memory modification needs
   to deal with this case. To do so, it sets a thread specific variable
   that is recognized by the sigbus handler.


Some questions:
---------------
. Should we take extra care for IO memory, vfio configured memory buffers ?

. My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only
  compiled on linux versions
  Should we have a configure option to prevent the introduction of this
  feature in the code (turning off CONFIG_HUGETLBFS_RAS) ?

. Should I include the content of my system/hugetlbfs_ras.[ch] files into
  another existing file ?

. Should we force 'sharing' when using "-mem-path" option, instead of the
  -object memory-backend-file,share=on,... ?


This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS
update for the 2 added files).
'make check' runs fine on both x86 and ARM
Units tests have been done on Intel, AMD and ARM platforms.



William Roche (6):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage sizes
  system/physmem: Remap memory pages on reset based on the page size
  system: Introducing hugetlbfs largepage RAS feature
  system/hugetlb_ras: Handle madvise SIGBUS signal on listener
  system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume

 accel/kvm/kvm-all.c      |  24 +-
 accel/stubs/kvm-stub.c   |   4 +-
 include/qemu/osdep.h     |   5 +-
 include/sysemu/kvm.h     |   7 +-
 include/sysemu/kvm_int.h |   3 +-
 meson.build              |   2 +
 system/cpus.c            |  15 +-
 system/hugetlbfs_ras.c   | 645 +++++++++++++++++++++++++++++++++++++++
 system/hugetlbfs_ras.h   |   4 +
 system/meson.build       |   1 +
 system/physmem.c         |  30 ++
 target/arm/kvm.c         |  15 +-
 target/i386/kvm/kvm.c    |  15 +-
 util/oslib-posix.c       |   3 +
 14 files changed, 753 insertions(+), 20 deletions(-)
 create mode 100644 system/hugetlbfs_ras.c
 create mode 100644 system/hugetlbfs_ras.h

-- 
2.43.5


  parent reply	other threads:[~2024-09-10 10:03 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  9:07 [RFC 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10  9:07 ` [RFC 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10  9:07 ` [RFC 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10  9:07 ` [RFC 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10  9:07 ` [RFC 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10  9:07 ` [RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10  9:07 ` [RFC 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 10:02 ` “William Roche [this message]
2024-09-10 10:02   ` [RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 10:02   ` [RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 10:02   ` [RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 10:02   ` [RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 10:02   ` [RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 10:02   ` [RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 11:36   ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-09-10 16:24     ` William Roche
2024-09-11 22:07       ` David Hildenbrand
2024-09-12 17:07         ` William Roche
2024-09-19 16:52           ` William Roche
2024-10-09 15:45             ` Peter Xu
2024-10-10 20:35               ` William Roche
2024-10-22 21:34               ` [PATCH v1 0/4] hugetlbfs memory HW error fixes “William Roche
2024-10-22 21:35                 ` [PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-10-22 21:35                 ` [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-10-23  7:28                   ` David Hildenbrand
2024-10-25 23:27                     ` William Roche
2024-10-28 16:42                       ` David Hildenbrand
2024-10-30  1:56                         ` William Roche
2024-11-04 14:10                           ` David Hildenbrand
2024-10-25 23:30                     ` William Roche
2024-10-22 21:35                 ` [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages “William Roche
2024-10-23  7:30                   ` David Hildenbrand
2024-10-25 23:27                     ` William Roche
2024-10-28 17:01                       ` David Hildenbrand
2024-10-30  1:56                         ` William Roche
2024-11-04 13:30                           ` David Hildenbrand
2024-11-07 10:21                             ` [PATCH v2 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-07 10:21                               ` [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-11-12 10:30                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 21:35                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-12 11:07                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:06                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-12 11:13                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:22                                     ` David Hildenbrand
2024-11-15 21:03                                       ` William Roche
2024-11-18  9:45                                         ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-07 10:21                               ` [PATCH v2 5/7] hostmem: Factor out applying settings “William Roche
2024-11-07 10:21                               ` [PATCH v2 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-12 13:45                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:24                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-10-22 21:35                 ` [PATCH v1 4/4] accel/kvm: Report the loss of a large memory page “William Roche
2024-10-28 16:32             ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-11-25 14:27         ` [PATCH v3 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-25 14:27           ` [PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages “William Roche
2024-11-25 14:27           ` [PATCH v3 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-25 14:27           ` [PATCH v3 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-25 14:27           ` [PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-25 14:27           ` [PATCH v3 5/7] hostmem: Factor out applying settings “William Roche
2024-11-25 14:27           ` [PATCH v3 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-25 14:27           ` [PATCH v3 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-12-02 15:41           ` [PATCH v3 0/7] hugetlbfs memory HW error fixes William Roche
2024-12-02 16:00             ` David Hildenbrand
2024-12-03  0:15               ` William Roche
2024-12-03 14:08                 ` David Hildenbrand
2024-12-03 14:39                   ` William Roche
2024-12-03 15:00                     ` David Hildenbrand
2024-12-06 18:26                       ` William Roche
2024-12-09 21:25                         ` David Hildenbrand
2024-12-14 13:45         ` [PATCH v4 0/7] Poisoned memory recovery on reboot “William Roche
2024-12-14 13:45           ` [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages “William Roche
2025-01-08 21:34             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2025-01-14 13:56                 ` David Hildenbrand
2024-12-14 13:45           ` [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-08 21:44             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2025-01-14 14:00                 ` David Hildenbrand
2025-01-27 21:15                   ` William Roche
2024-12-14 13:45           ` [PATCH v4 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-12-14 13:45           ` [PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-12-14 13:45           ` [PATCH v4 5/7] hostmem: Factor out applying settings “William Roche
2025-01-08 21:58             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2024-12-14 13:45           ` [PATCH v4 6/7] hostmem: Handle remapping of RAM “William Roche
2025-01-08 21:51             ` [PATCH v4 6/7] c David Hildenbrand
2025-01-10 20:57               ` [PATCH v4 6/7] hostmem: Handle remapping of RAM William Roche
2024-12-14 13:45           ` [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2025-01-08 21:53             ` David Hildenbrand
2025-01-10 20:57               ` William Roche
2025-01-14 14:01                 ` David Hildenbrand
2025-01-08 21:22           ` [PATCH v4 0/7] Poisoned memory recovery on reboot David Hildenbrand
2025-01-10 20:55             ` William Roche
2025-01-10 21:13         ` [PATCH v5 0/6] " “William Roche
2025-01-10 21:14           ` [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-01-14 14:02             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-28 18:41                 ` David Hildenbrand
2025-01-10 21:14           ` [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-14 14:07             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-10 21:14           ` [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page “William Roche
2025-01-14 14:09             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-28 18:45                 ` David Hildenbrand
2025-01-10 21:14           ` [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
2025-01-10 21:14           ` [PATCH v5 5/6] hostmem: Factor out applying settings “William Roche
2025-01-10 21:14           ` [PATCH v5 6/6] hostmem: Handle remapping of RAM “William Roche
2025-01-14 14:11             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-14 14:12           ` [PATCH v5 0/6] Poisoned memory recovery on reboot David Hildenbrand
2025-01-27 21:16             ` William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240910100216.2744078-1-william.roche@oracle.com \
    --to=william.roche@oracle.com \
    --cc=berrange@redhat.com \
    --cc=david@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).