From: David Hildenbrand <david@redhat.com>
To: William Roche <william.roche@oracle.com>,
pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org,
marcandre.lureau@redhat.com, berrange@redhat.com,
thuth@redhat.com, richard.henderson@linaro.org,
peter.maydell@linaro.org, mtosatti@redhat.com,
qemu-devel@nongnu.org
Cc: kvm@vger.kernel.org, qemu-arm@nongnu.org, joao.m.martins@oracle.com
Subject: Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project
Date: Thu, 12 Sep 2024 00:07:51 +0200 [thread overview]
Message-ID: <cf587c8b-3894-4589-bfea-be5db70e81f3@redhat.com> (raw)
In-Reply-To: <9f9a975e-3a04-4923-b8a5-f1edbed945e6@oracle.com>
Hi again,
>>> This is a Qemu RFC to introduce the possibility to deal with hardware
>>> memory errors impacting hugetlbfs memory backed VMs. When using
>>> hugetlbfs large pages, any large page location being impacted by an
>>> HW memory error results in poisoning the entire page, suddenly making
>>> a large chunk of the VM memory unusable.
>>>
>>> The implemented proposal is simply a memory mapping change when an HW
>>> error
>>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>>> standard sized pages. The failed large page is unmapped and a set of
>>> standard sized pages are mapped in place.
>>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
>>> received
>>> by qemu and the reported location corresponds to a large page.
One clarifying question: you simply replace the hugetlb page by multiple
small pages using mmap(MAP_FIXED). So you
(a) are not able to recover any memory of the original page (as of now)
(b) no longer have a hugetlb page and, therefore, possibly a performance
degradation, relevant in low-latency applications that really care
about the usage of hugetlb pages.
(c) run into the described inconsistency issues
Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the
full page and get a fresh, non-poisoned page instead?
Sure, you have to reserve some pages if that ever happens, but what is
the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it
and it was spelled out)
>>>
>>> This gives the possibility to:
>>> - Take advantage of newer hypervisor kernel providing a way to retrieve
>>> still valid data on the impacted hugetlbfs poisoned large page.
Reading that again, that shouldn't have to be hypervisor-specific.
Really, if someone were to extract data from a poisoned hugetlb folio,
it shouldn't be hypervisor-specific. The kernel should be able to know
which regions are accessible and could allow ways for reading these, one
way or the other.
It could just be a fairly hugetlb-special feature that would replace the
poisoned page by a fresh hugetlb page where as much page content as
possible has been recoverd from the old one.
>>> If the backend file is MAP_SHARED, we can copy the valid data into the
>
>
> Thank you David for this first reaction on this proposal.
>
>
>> How are you dealing with other consumers of the shared memory,
>> such as vhost-user processes,
>
>
> In the current proposal, I don't deal with this aspect.
> In fact, any other process sharing the changed memory will
> continue to map the poisoned large page. So any access to
> this page will generate a SIGBUS to this other process.
>
> In this situation vhost-user processes should continue to receive
> SIGBUS signals (and probably continue to die because of that).
That's ... suboptimal. :)
Assume you have a 1 GiB page. The guest OS can happily allocate buffers
in there so they can end up in vhost-user and crash that process.
Without any warning.
>
> So I do see a real problem if 2 qemu processes are sharing the
> same hugetlbfs segment -- in this case, error recovery should not
> occur on this piece of the memory. Maybe dealing with this situation
> with "ivshmem" options is doable (marking the shared segment
> "not eligible" to hugetlbfs recovery, just like not "share=on"
> hugetlbfs entries are not eligible)
> -- I need to think about this specific case.
>
> Please let me know if there is a better way to deal with this
> shared memory aspect and have a better system reaction.
Not creating the inconsistency in the first place :)
>> vm migration whereby RAM is migrated using file content,
>
>
> Migration doesn't currently work with memory poisoning.
> You can give a look at the already integrated following commit:
>
> 06152b89db64 migration: prevent migration when VM has poisoned memory
>
> This proposal doesn't change anything on this side.
That commit is fairly fresh and likely missed the option to *not*
migrate RAM by reading it, but instead by migrating it through a shared
file. For example, VM life-upgrade (CPR) wants to use that (or is
already using that), to avoid RAM migration completely.
>
>> vfio that might have these pages pinned?
>
> AFAIK even pinned memory can be impacted by memory error and poisoned
> by the kernel. Now as I said in the cover letter, I'd like to know if
> we should take extra care for IO memory, vfio configured memory buffers...
Assume your GPU has a hugetlb folio pinned via vfio. As soon as you make
the guest RAM point at anything else as VFIO is aware of, we end up in
the same problem we had when we learned about having to disable balloon
inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.
We'd have to inform VFIO that the mapping is now different. Otherwise
it's really better to crash the VM than having your GPU read/write
different data than your CPU reads/writes,
>
>
>> In general, you cannot simply replace pages by private copies
>> when somebody else might be relying on these pages to go to
>> actual guest RAM.
>
> This is correct, but the current proposal is dealing with a specific
> shared memory type: poisoned large pages. So any other process mapping
> this type of page can't access it without generating a SIGBUS.
Right, and that's the issue. Because, for example, how should the VM be
aware that this memory is now special and must not be used for some
purposes without leading to problems elsewhere?
>
>
>> It sounds very hacky and incomplete at first.
>
> As you can see, RAS features need to be completed.
> And if this proposal is incomplete, what other changes should be
> done to complete it ?
>
> I do hope we can discuss this RFC to adapt what is incorrect, or
> find a better way to address this situation.
One long-term goal people are working on is to allow remapping the
hugetlb folios in smaller granularity, such that only a single affected
PTE can be marked as poisoned. (used to be called high-granularity-mapping)
However, at the same time, the focus hseems to shift towards using
guest_memfd instead of hugetlb, once it supports 1 GiB pages and shared
memory. It will likely be easier to support mapping 1 GiB pages using
PTEs that way, and there are ongoing discussions how that can be
achieved more easily.
There are also discussions [1] about not poisoning the mappings at all
and handling it differently. But I haven't yet digested how exactly that
could look like in reality.
[1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-09-11 22:08 UTC|newest]
Thread overview: 119+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 9:07 [RFC 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10 9:07 ` [RFC 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 9:07 ` [RFC 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 9:07 ` [RFC 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 9:07 ` [RFC 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 9:07 ` [RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 9:07 ` [RFC 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 10:02 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10 10:02 ` [RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 10:02 ` [RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 10:02 ` [RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 10:02 ` [RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 10:02 ` [RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 10:02 ` [RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 11:36 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-09-10 16:24 ` William Roche
2024-09-11 22:07 ` David Hildenbrand [this message]
2024-09-12 17:07 ` William Roche
2024-09-19 16:52 ` William Roche
2024-10-09 15:45 ` Peter Xu
2024-10-10 20:35 ` William Roche
2024-10-22 21:34 ` [PATCH v1 0/4] hugetlbfs memory HW error fixes “William Roche
2024-10-22 21:35 ` [PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-10-22 21:35 ` [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-10-23 7:28 ` David Hildenbrand
2024-10-25 23:27 ` William Roche
2024-10-28 16:42 ` David Hildenbrand
2024-10-30 1:56 ` William Roche
2024-11-04 14:10 ` David Hildenbrand
2024-10-25 23:30 ` William Roche
2024-10-22 21:35 ` [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages “William Roche
2024-10-23 7:30 ` David Hildenbrand
2024-10-25 23:27 ` William Roche
2024-10-28 17:01 ` David Hildenbrand
2024-10-30 1:56 ` William Roche
2024-11-04 13:30 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-07 10:21 ` [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-11-12 10:30 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 21:35 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-12 11:07 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 22:06 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-12 11:13 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 22:22 ` David Hildenbrand
2024-11-15 21:03 ` William Roche
2024-11-18 9:45 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-07 10:21 ` [PATCH v2 5/7] hostmem: Factor out applying settings “William Roche
2024-11-07 10:21 ` [PATCH v2 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-12 13:45 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 22:24 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-10-22 21:35 ` [PATCH v1 4/4] accel/kvm: Report the loss of a large memory page “William Roche
2024-10-28 16:32 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-11-25 14:27 ` [PATCH v3 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-25 14:27 ` [PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages “William Roche
2024-11-25 14:27 ` [PATCH v3 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-25 14:27 ` [PATCH v3 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-25 14:27 ` [PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-25 14:27 ` [PATCH v3 5/7] hostmem: Factor out applying settings “William Roche
2024-11-25 14:27 ` [PATCH v3 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-25 14:27 ` [PATCH v3 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-12-02 15:41 ` [PATCH v3 0/7] hugetlbfs memory HW error fixes William Roche
2024-12-02 16:00 ` David Hildenbrand
2024-12-03 0:15 ` William Roche
2024-12-03 14:08 ` David Hildenbrand
2024-12-03 14:39 ` William Roche
2024-12-03 15:00 ` David Hildenbrand
2024-12-06 18:26 ` William Roche
2024-12-09 21:25 ` David Hildenbrand
2024-12-14 13:45 ` [PATCH v4 0/7] Poisoned memory recovery on reboot “William Roche
2024-12-14 13:45 ` [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages “William Roche
2025-01-08 21:34 ` David Hildenbrand
2025-01-10 20:56 ` William Roche
2025-01-14 13:56 ` David Hildenbrand
2024-12-14 13:45 ` [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-08 21:44 ` David Hildenbrand
2025-01-10 20:56 ` William Roche
2025-01-14 14:00 ` David Hildenbrand
2025-01-27 21:15 ` William Roche
2024-12-14 13:45 ` [PATCH v4 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-12-14 13:45 ` [PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-12-14 13:45 ` [PATCH v4 5/7] hostmem: Factor out applying settings “William Roche
2025-01-08 21:58 ` David Hildenbrand
2025-01-10 20:56 ` William Roche
2024-12-14 13:45 ` [PATCH v4 6/7] hostmem: Handle remapping of RAM “William Roche
2025-01-08 21:51 ` [PATCH v4 6/7] c David Hildenbrand
2025-01-10 20:57 ` [PATCH v4 6/7] hostmem: Handle remapping of RAM William Roche
2024-12-14 13:45 ` [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2025-01-08 21:53 ` David Hildenbrand
2025-01-10 20:57 ` William Roche
2025-01-14 14:01 ` David Hildenbrand
2025-01-08 21:22 ` [PATCH v4 0/7] Poisoned memory recovery on reboot David Hildenbrand
2025-01-10 20:55 ` William Roche
2025-01-10 21:13 ` [PATCH v5 0/6] " “William Roche
2025-01-10 21:14 ` [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-01-14 14:02 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-28 18:41 ` David Hildenbrand
2025-01-10 21:14 ` [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-14 14:07 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-10 21:14 ` [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page “William Roche
2025-01-14 14:09 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-28 18:45 ` David Hildenbrand
2025-01-10 21:14 ` [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
2025-01-10 21:14 ` [PATCH v5 5/6] hostmem: Factor out applying settings “William Roche
2025-01-10 21:14 ` [PATCH v5 6/6] hostmem: Handle remapping of RAM “William Roche
2025-01-14 14:11 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-14 14:12 ` [PATCH v5 0/6] Poisoned memory recovery on reboot David Hildenbrand
2025-01-27 21:16 ` William Roche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cf587c8b-3894-4589-bfea-be5db70e81f3@redhat.com \
--to=david@redhat.com \
--cc=berrange@redhat.com \
--cc=joao.m.martins@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=marcandre.lureau@redhat.com \
--cc=mtosatti@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=thuth@redhat.com \
--cc=william.roche@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).