qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: William Roche <william.roche@oracle.com>,
	kvm@vger.kernel.org, qemu-devel@nongnu.org, qemu-arm@nongnu.org
Cc: peterx@redhat.com, pbonzini@redhat.com,
	richard.henderson@linaro.org, philmd@linaro.org,
	peter.maydell@linaro.org, mtosatti@redhat.com,
	joao.m.martins@oracle.com
Subject: Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages
Date: Mon, 28 Oct 2024 18:01:29 +0100	[thread overview]
Message-ID: <6cc00e04-6e38-4970-9d6b-52b56ee20a64@redhat.com> (raw)
In-Reply-To: <e9f8e404-50db-4e0f-a5e1-749acad49325@oracle.com>

On 26.10.24 01:27, William Roche wrote:
> On 10/23/24 09:30, David Hildenbrand wrote:
> 
>> On 22.10.24 23:35, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When the VM reboots, a memory reset is performed calling
>>> qemu_ram_remap() on all hwpoisoned pages.
>>> While we take into account the recorded page sizes to repair the
>>> memory locations, a large page also needs to punch a hole in the
>>> backend file to regenerate a usable memory, cleaning the HW
>>> poisoned section. This is mandatory for hugetlbfs case for example.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    system/physmem.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index 3757428336..3f6024a92d 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr,
>>> ram_addr_t length)
>>>                    prot = PROT_READ;
>>>                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>>>                    if (block->fd >= 0) {
>>> +                    if (length > TARGET_PAGE_SIZE &&
>>> fallocate(block->fd,
>>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>>> +                        offset + block->fd_offset, length) != 0) {
>>> +                        error_report("Could not recreate the file
>>> hole for "
>>> +                                     "addr: " RAM_ADDR_FMT "@"
>>> RAM_ADDR_FMT "",
>>> +                                     length, addr);
>>> +                        exit(1);
>>> +                    }
>>>                        area = mmap(vaddr, length, prot, flags, block->fd,
>>>                                    offset + block->fd_offset);
>>>                    } else {
>>
>> Ah! Just what I commented to patch #3; we should be using
>> ram_discard_range(). It might be better to avoid the mmap() completely
>> if ram_discard_range() worked.
> 

Hi!

> 
> I think you are referring to ram_block_discard_range() here, as
> ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.

Please take a look at the users of ram_block_discard_range(), including 
virtio-balloon to completely zap guest memory, so we will get fresh 
memory on next access. It takes care of process-private and file-backed 
(shared) memory.

> 
> Remapping the page is needed to get rid of the poison. So if we want to
> avoid the mmap(), we have to shrink the memory address space -- which
> can be a real problem if we imagine a VM with 1G large pages for
> example. qemu_ram_remap() is used to regenerate the lost memory and the
> mmap() call looks mandatory on the reset phase.

Why can't we use ram_block_discard_range() to zap the poisoned page 
(unmap from page tables + conditionallydrop from the page cache)? Is 
there anything important I am missing?

> 
> 
>>
>> And as raised, there is the problem with memory preallocation (where
>> we should fail if it doesn't work) and ram discards being disabled
>> because something relies on long-term page pinning ...
> 
> 
> Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the
> remapped area in case of a backend->prealloc being true ?

Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at 
runtime and SIGBUS QEMU :(

> 
> Or as we are running on posix machines for this piece of code (ifndef
> _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call
> done in qemu_ram_remap() in the case where the backend requires a
> 'prealloc' ?  Can you confirm if this flag could be used on all systems
> running this code ?

Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's 
really weird :/ mmap() might succeed even though MAP_POPULATE didn't 
work ... and it's problematic with NUMA policies because we essentially 
lose (overwrite) them.

And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't 
reset the memory policy we apply in 
host_memory_backend_memory_complete() ... that code really needs a 
rewrite to do it properly.


Ideally, we'd do something high-level like


if (ram_block_discard_is_disabled()) {
	/*
	 * We cannot safely discard RAM,  ... for example we might have
	 * to remap all guest RAM into vfio after discarding the 	
	 * problematic pages ... TODO.
	 */
	exit(0);
}

/* Throw away the problematic (poisoned) page. *./
if (ram_block_discard_range()) {
	/* Conditionally fallback to MAP_FIXED workaround */
	...
}

/* If prealloction was requested, we really must re-preallcoate. */
if (prealloc && qemu_prealloc_mem()) {
	/* Preallocation failed .... */
	exit(0);
}

As you note the last part is tricky. See bwloe.

> 
> Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding
> to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is
> a valid way to retrieve the Backend object and its 'prealloc' property
> here. Could you please give me a direction here ?

We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" 
semantics.

I once had an alternative approach: Similar to ram_block_notify_resize() 
we would implement ram_block_notify_remap().

That's where the backend could register and re-apply mmap properties 
like NUMA policies (in case we have to fallback to MAP_FIXED) and handle 
the preallocation.

So one would implement a ram_block_notify_remap() and maybe indicate if 
we had to do MAP_FIXED or if we only discarded the page.

I once had a prototype for that, let me dig ...

> 
> I can send a new version using ram_block_discard_range() as you
> suggested to replace the direct call to fallocate(), if you think it
> would be better.
> Please let me know what other enhancement(s) you'd like to see in this
> code change.

Something along the lines above. Please let me know if you see problems 
with that approach that I am missing.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-10-28 17:02 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  9:07 [RFC 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10  9:07 ` [RFC 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10  9:07 ` [RFC 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10  9:07 ` [RFC 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10  9:07 ` [RFC 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10  9:07 ` [RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10  9:07 ` [RFC 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 10:02 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10 10:02   ` [RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 10:02   ` [RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 10:02   ` [RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 10:02   ` [RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 10:02   ` [RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 10:02   ` [RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 11:36   ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-09-10 16:24     ` William Roche
2024-09-11 22:07       ` David Hildenbrand
2024-09-12 17:07         ` William Roche
2024-09-19 16:52           ` William Roche
2024-10-09 15:45             ` Peter Xu
2024-10-10 20:35               ` William Roche
2024-10-22 21:34               ` [PATCH v1 0/4] hugetlbfs memory HW error fixes “William Roche
2024-10-22 21:35                 ` [PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-10-22 21:35                 ` [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-10-23  7:28                   ` David Hildenbrand
2024-10-25 23:27                     ` William Roche
2024-10-28 16:42                       ` David Hildenbrand
2024-10-30  1:56                         ` William Roche
2024-11-04 14:10                           ` David Hildenbrand
2024-10-25 23:30                     ` William Roche
2024-10-22 21:35                 ` [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages “William Roche
2024-10-23  7:30                   ` David Hildenbrand
2024-10-25 23:27                     ` William Roche
2024-10-28 17:01                       ` David Hildenbrand [this message]
2024-10-30  1:56                         ` William Roche
2024-11-04 13:30                           ` David Hildenbrand
2024-11-07 10:21                             ` [PATCH v2 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-07 10:21                               ` [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-11-12 10:30                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 21:35                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-12 11:07                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:06                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-12 11:13                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:22                                     ` David Hildenbrand
2024-11-15 21:03                                       ` William Roche
2024-11-18  9:45                                         ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-07 10:21                               ` [PATCH v2 5/7] hostmem: Factor out applying settings “William Roche
2024-11-07 10:21                               ` [PATCH v2 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-12 13:45                                 ` David Hildenbrand
2024-11-12 18:17                                   ` William Roche
2024-11-12 22:24                                     ` David Hildenbrand
2024-11-07 10:21                               ` [PATCH v2 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-10-22 21:35                 ` [PATCH v1 4/4] accel/kvm: Report the loss of a large memory page “William Roche
2024-10-28 16:32             ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-11-25 14:27         ` [PATCH v3 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-25 14:27           ` [PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages “William Roche
2024-11-25 14:27           ` [PATCH v3 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-25 14:27           ` [PATCH v3 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-25 14:27           ` [PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-25 14:27           ` [PATCH v3 5/7] hostmem: Factor out applying settings “William Roche
2024-11-25 14:27           ` [PATCH v3 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-25 14:27           ` [PATCH v3 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-12-02 15:41           ` [PATCH v3 0/7] hugetlbfs memory HW error fixes William Roche
2024-12-02 16:00             ` David Hildenbrand
2024-12-03  0:15               ` William Roche
2024-12-03 14:08                 ` David Hildenbrand
2024-12-03 14:39                   ` William Roche
2024-12-03 15:00                     ` David Hildenbrand
2024-12-06 18:26                       ` William Roche
2024-12-09 21:25                         ` David Hildenbrand
2024-12-14 13:45         ` [PATCH v4 0/7] Poisoned memory recovery on reboot “William Roche
2024-12-14 13:45           ` [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages “William Roche
2025-01-08 21:34             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2025-01-14 13:56                 ` David Hildenbrand
2024-12-14 13:45           ` [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-08 21:44             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2025-01-14 14:00                 ` David Hildenbrand
2025-01-27 21:15                   ` William Roche
2024-12-14 13:45           ` [PATCH v4 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-12-14 13:45           ` [PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-12-14 13:45           ` [PATCH v4 5/7] hostmem: Factor out applying settings “William Roche
2025-01-08 21:58             ` David Hildenbrand
2025-01-10 20:56               ` William Roche
2024-12-14 13:45           ` [PATCH v4 6/7] hostmem: Handle remapping of RAM “William Roche
2025-01-08 21:51             ` [PATCH v4 6/7] c David Hildenbrand
2025-01-10 20:57               ` [PATCH v4 6/7] hostmem: Handle remapping of RAM William Roche
2024-12-14 13:45           ` [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2025-01-08 21:53             ` David Hildenbrand
2025-01-10 20:57               ` William Roche
2025-01-14 14:01                 ` David Hildenbrand
2025-01-08 21:22           ` [PATCH v4 0/7] Poisoned memory recovery on reboot David Hildenbrand
2025-01-10 20:55             ` William Roche
2025-01-10 21:13         ` [PATCH v5 0/6] " “William Roche
2025-01-10 21:14           ` [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-01-14 14:02             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-28 18:41                 ` David Hildenbrand
2025-01-10 21:14           ` [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-14 14:07             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-10 21:14           ` [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page “William Roche
2025-01-14 14:09             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-28 18:45                 ` David Hildenbrand
2025-01-10 21:14           ` [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
2025-01-10 21:14           ` [PATCH v5 5/6] hostmem: Factor out applying settings “William Roche
2025-01-10 21:14           ` [PATCH v5 6/6] hostmem: Handle remapping of RAM “William Roche
2025-01-14 14:11             ` David Hildenbrand
2025-01-27 21:16               ` William Roche
2025-01-14 14:12           ` [PATCH v5 0/6] Poisoned memory recovery on reboot David Hildenbrand
2025-01-27 21:16             ` William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6cc00e04-6e38-4970-9d6b-52b56ee20a64@redhat.com \
    --to=david@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=william.roche@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).