qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "“William Roche" <william.roche@oracle.com>,
	pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org,
	marcandre.lureau@redhat.com, berrange@redhat.com,
	thuth@redhat.com, richard.henderson@linaro.org,
	peter.maydell@linaro.org, mtosatti@redhat.com,
	qemu-devel@nongnu.org
Cc: kvm@vger.kernel.org, qemu-arm@nongnu.org, joao.m.martins@oracle.com
Subject: Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project
Date: Tue, 10 Sep 2024 13:36:35 +0200	[thread overview]
Message-ID: <ec3337f7-3906-4a1b-b153-e3d5b16685b6@redhat.com> (raw)
In-Reply-To: <20240910100216.2744078-1-william.roche@oracle.com>

On 10.09.24 12:02, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 

Hi,

> 
> Apologies for the noise; resending as I missed CC'ing the maintainers of the
> changed files
> 
> 
> Hello,
> 
> This is a Qemu RFC to introduce the possibility to deal with hardware
> memory errors impacting hugetlbfs memory backed VMs. When using
> hugetlbfs large pages, any large page location being impacted by an
> HW memory error results in poisoning the entire page, suddenly making
> a large chunk of the VM memory unusable.
> 
> The implemented proposal is simply a memory mapping change when an HW error
> is reported to Qemu, to transform a hugetlbfs large page into a set of
> standard sized pages. The failed large page is unmapped and a set of
> standard sized pages are mapped in place.
> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
> by qemu and the reported location corresponds to a large page.
> 
> This gives the possibility to:
> - Take advantage of newer hypervisor kernel providing a way to retrieve
> still valid data on the impacted hugetlbfs poisoned large page.
> If the backend file is MAP_SHARED, we can copy the valid data into the

How are you dealing with other consumers of the shared memory, such as 
vhost-user processes, vm migration whereby RAM is migrated using file 
content, vfio that might have these pages pinned?

In general, you cannot simply replace pages by private copies when 
somebody else might be relying on these pages to go to actual guest RAM.

It sounds very hacky and incomplete at first.


-- 
Cheers,

David / dhildenb



  parent reply	other threads:[~2024-09-10 11:37 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  9:07 [RFC 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10  9:07 ` [RFC 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10  9:07 ` [RFC 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10  9:07 ` [RFC 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10  9:07 ` [RFC 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10  9:07 ` [RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10  9:07 ` [RFC 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 10:02 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10 10:02   ` [RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 10:02   ` [RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 10:02   ` [RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 10:02   ` [RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 10:02   ` [RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 10:02   ` [RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 11:36   ` David Hildenbrand [this message]
2024-09-10 16:24     ` [RFC RESEND 0/6] hugetlbfs largepage RAS project William Roche
2024-09-11 22:07       ` David Hildenbrand
2024-09-12 17:07         ` William Roche
2024-09-19 16:52           ` William Roche
2024-10-09 15:45             ` Peter Xu
2024-10-10 20:35               ` William Roche
2024-10-22 21:34               ` [PATCH v1 0/4] hugetlbfs memory HW error fixes “William Roche
2024-10-22 21:35                 ` [PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-10-22 21:35                 ` [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-10-23  7:28                   ` David Hildenbrand
2024-10-25 23:27                     ` William Roche
2024-10-28 16:42                       ` David Hildenbrand
2024-10-30  1:56                         ` William Roche
2024-11-04 14:10                           ` David Hildenbrand
2024-10-25 23:30                     ` William Roche
2024-10-22 21:35                 ` [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages “William Roche
2024-10-23  7:30                   ` David Hildenbrand
2024-10-25 23:27                     ` William Roche
2024-10-28 17:01                       ` David Hildenbrand
2024-10-30  1:56                         ` William Roche
2024-11-04 13:30                           ` David Hildenbrand
2024-11-07 10:21                             ` [PATCH v2 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-07 10:21                               ` [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-11-12 10:30                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 21:35                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-12 11:07                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:06                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-12 11:13                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:22                                     ` David Hildenbrand
2024-11-15 21:03                                       ` William Roche
2024-11-18  9:45                                         ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-07 10:21                               ` [PATCH v2 5/7] hostmem: Factor out applying settings “William Roche
2024-11-07 10:21                               ` [PATCH v2 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-12 13:45                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:24                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-10-22 21:35                 ` [PATCH v1 4/4] accel/kvm: Report the loss of a large memory page “William Roche
2024-10-28 16:32             ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-11-25 14:27         ` [PATCH v3 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-25 14:27           ` [PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages “William Roche
2024-11-25 14:27           ` [PATCH v3 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-25 14:27           ` [PATCH v3 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-25 14:27           ` [PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-25 14:27           ` [PATCH v3 5/7] hostmem: Factor out applying settings “William Roche
2024-11-25 14:27           ` [PATCH v3 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-25 14:27           ` [PATCH v3 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-12-02 15:41           ` [PATCH v3 0/7] hugetlbfs memory HW error fixes William Roche
2024-12-02 16:00             ` David Hildenbrand
2024-12-03  0:15               ` William Roche
2024-12-03 14:08                 ` David Hildenbrand
2024-12-03 14:39                   ` William Roche
2024-12-03 15:00                     ` David Hildenbrand
2024-12-06 18:26                       ` William Roche
2024-12-09 21:25                         ` David Hildenbrand
2024-12-14 13:45         ` [PATCH v4 0/7] Poisoned memory recovery on reboot “William Roche
2024-12-14 13:45           ` [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages “William Roche
2025-01-08 21:34             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2025-01-14 13:56                 ` David Hildenbrand
2024-12-14 13:45           ` [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-08 21:44             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2025-01-14 14:00                 ` David Hildenbrand
2025-01-27 21:15                   ` William Roche
2024-12-14 13:45           ` [PATCH v4 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-12-14 13:45           ` [PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-12-14 13:45           ` [PATCH v4 5/7] hostmem: Factor out applying settings “William Roche
2025-01-08 21:58             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2024-12-14 13:45           ` [PATCH v4 6/7] hostmem: Handle remapping of RAM “William Roche
2025-01-08 21:51             ` [PATCH v4 6/7] c David Hildenbrand
2025-01-10 20:57               ` [PATCH v4 6/7] hostmem: Handle remapping of RAM William Roche
2024-12-14 13:45           ` [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2025-01-08 21:53             ` David Hildenbrand
2025-01-10 20:57               ` William Roche
2025-01-14 14:01                 ` David Hildenbrand
2025-01-08 21:22           ` [PATCH v4 0/7] Poisoned memory recovery on reboot David Hildenbrand
2025-01-10 20:55             ` William Roche
2025-01-10 21:13         ` [PATCH v5 0/6] " “William Roche
2025-01-10 21:14           ` [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-01-14 14:02             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-28 18:41                 ` David Hildenbrand
2025-01-10 21:14           ` [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-14 14:07             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-10 21:14           ` [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page “William Roche
2025-01-14 14:09             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-28 18:45                 ` David Hildenbrand
2025-01-10 21:14           ` [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
2025-01-10 21:14           ` [PATCH v5 5/6] hostmem: Factor out applying settings “William Roche
2025-01-10 21:14           ` [PATCH v5 6/6] hostmem: Handle remapping of RAM “William Roche
2025-01-14 14:11             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-14 14:12           ` [PATCH v5 0/6] Poisoned memory recovery on reboot David Hildenbrand
2025-01-27 21:16             ` William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec3337f7-3906-4a1b-b153-e3d5b16685b6@redhat.com \
    --to=david@redhat.com \
    --cc=berrange@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=thuth@redhat.com \
    --cc=william.roche@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).