From: David Hildenbrand <david@redhat.com>
To: William Roche <william.roche@oracle.com>,
kvm@vger.kernel.org, qemu-devel@nongnu.org, qemu-arm@nongnu.org
Cc: peterx@redhat.com, pbonzini@redhat.com,
richard.henderson@linaro.org, philmd@linaro.org,
peter.maydell@linaro.org, mtosatti@redhat.com,
joao.m.martins@oracle.com
Subject: Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages
Date: Mon, 4 Nov 2024 14:30:27 +0100 [thread overview]
Message-ID: <e2ac7ad0-aa26-4af2-8bb3-825cba4ffca0@redhat.com> (raw)
In-Reply-To: <416a47ff-3324-444b-a2e2-9ea775e61244@oracle.com>
>>>
>>> Remapping the page is needed to get rid of the poison. So if we want to
>>> avoid the mmap(), we have to shrink the memory address space -- which
>>> can be a real problem if we imagine a VM with 1G large pages for
>>> example. qemu_ram_remap() is used to regenerate the lost memory and the
>>> mmap() call looks mandatory on the reset phase.
>>
>> Why can't we use ram_block_discard_range() to zap the poisoned page
>> (unmap from page tables + conditionally drop from the page cache)? Is
>> there anything important I am missing?
>
> Or maybe _I'm_ missing something important, but what I understand is that:
> need_madvise = (rb->page_size == qemu_real_host_page_size());
>
> ensures that the madvise call on ram_block_discard_range() is not done
> in the case off hugepages.
> In this case, we need to call mmap the remap the hugetlbfs large page.
Right, madvise(DONTNEED) works ever since "90e7e7f5ef3f ("mm: enable
MADV_DONTNEED for hugetlb mappings")".
But as you note, in QEMU we never called madvise(DONTNEED) for hugetlb
as of today. But note that we always have an "fd" with hugetlb, because
we never use mmap(MAP_ANON|MAP_PRIVATE|MAP_HUGETLB) in QEMU.
The weird thing is that if you have a mmap(fd, MAP_PRIVATE) hugetlb
mapping, fallocate(fd, FALLOC_FL_PUNCH_HOLE) will *also* zap any private
pages. So in contrast to "ordinary" memory, the madvise(DONTNEED) is not
required.
(yes, it's very weird)
So the fallocate(fd, FALLOC_FL_PUNCH_HOLE) will zap the hugetlb page and
you will get a fresh one on next fault.
For all the glorious details, see:
https://lore.kernel.org/linux-mm/2ddd0a26-33fd-9cde-3501-f0584bbffefc@redhat.com/
>
> As I said in the previous email, recent kernels start to implement these
> calls for hugetlbfs, but I'm not sure that changing the mechanism of
> this ram_block_discard_range() function now is appropriate.
> Do you agree with that ?
The key point is that it works for hugetlb without madvise(DONTNEED),
which is weird :)
Which is also why the introducing kernel change added "Do note that
there is no compelling use case for adding this support.
This was discussed in the RFC [1]. However, adding support makes sense
as it is fairly trivial and brings hugetlb functionality more in line
with 'normal' memory."
[...]
>>
>> So one would implement a ram_block_notify_remap() and maybe indicate if
>> we had to do MAP_FIXED or if we only discarded the page.
>>
>> I once had a prototype for that, let me dig ...
>
> That would be great ! Thanks.
Found them:
https://gitlab.com/virtio-mem/qemu/-/commit/f528c861897d1086ae84ea1bcd6a0be43e8fea7d
https://gitlab.com/virtio-mem/qemu/-/commit/c5b0328654def8f168497715409d6364096eb63f
https://gitlab.com/virtio-mem/qemu/-/commit/15e9737907835105c132091ad10f9d0c9c68ea64
But note that I didn't realize back then that the mmap(MAP_FIXED) is the
wrong way to do it, and that we actually have to DONTNEED/PUNCH_HOLE to
do it properly. But to get the preallocation performed by the backend,
it should still be valuable.
Note that I wonder if we can get rid of the mmap(MAP_FIXED) handling
completely: likely we only support Linux with MCE recovery, and
ram_block_discard_range() should do what we need under Linux.
That would make it a lot simpler.
>
>>
>>>
>>> I can send a new version using ram_block_discard_range() as you
>>> suggested to replace the direct call to fallocate(), if you think it
>>> would be better.
>>> Please let me know what other enhancement(s) you'd like to see in this
>>> code change.
>>
>> Something along the lines above. Please let me know if you see problems
>> with that approach that I am missing.
>
>
> Let me check the madvise use on hugetlbfs and if it works as expected,
> I'll try to implement a V2 version of the fix proposal integrating a
> modified ram_block_discard_range() function.
As discussed, it might all be working. If not, we would have to fix
ram_block_discard_range().
>
> I'll also remove the page size information from the signal handlers
> and only keep it in the kvm_hwpoison_page_add() function.
That's good. Especially because there was talk in the last bi-weekly MM
sync [1] about possibly indicating only the actually failed cachelines
in the future, not necessarily the full page.
So relying on that interface to return the actual pagesize would no be
future proof.
That session was in general very interesting and very relevant for your
work; did you by any chance attend it? If not, we should find you the
recordings, because the idea is to be able to configure to
not-unmap-during-mce, and instead only inform the guest OS about the MCE
(forward it). Which avoids any HGM (high-granularity mapping) issues
completely.
Only during reboot of the VM we will have to do exactly what is being
done in this series: zap the whole *page* so our fresh OS will see "all
non-faulty" memory.
[1]
https://lkml.kernel.org/r/9242f7cc-6b9d-b807-9079-db0ca81f3c6d@google.com
>
> I'll investigate how to keep track of the 'prealloc' attribute to
> optionally use when remapping the hugepages (on older kernels).
> And if you find the prototype code you talked about that would
> definitely help :)
Right, the above should help getting that sorted out (but code id 4
years old, so it won't "just apply").
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-11-04 13:31 UTC|newest]
Thread overview: 119+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 9:07 [RFC 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10 9:07 ` [RFC 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 9:07 ` [RFC 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 9:07 ` [RFC 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 9:07 ` [RFC 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 9:07 ` [RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 9:07 ` [RFC 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 10:02 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project “William Roche
2024-09-10 10:02 ` [RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-09-10 10:02 ` [RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes “William Roche
2024-09-10 10:02 ` [RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size “William Roche
2024-09-10 10:02 ` [RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature “William Roche
2024-09-10 10:02 ` [RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener “William Roche
2024-09-10 10:02 ` [RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume “William Roche
2024-09-10 11:36 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-09-10 16:24 ` William Roche
2024-09-11 22:07 ` David Hildenbrand
2024-09-12 17:07 ` William Roche
2024-09-19 16:52 ` William Roche
2024-10-09 15:45 ` Peter Xu
2024-10-10 20:35 ` William Roche
2024-10-22 21:34 ` [PATCH v1 0/4] hugetlbfs memory HW error fixes “William Roche
2024-10-22 21:35 ` [PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb “William Roche
2024-10-22 21:35 ` [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-10-23 7:28 ` David Hildenbrand
2024-10-25 23:27 ` William Roche
2024-10-28 16:42 ` David Hildenbrand
2024-10-30 1:56 ` William Roche
2024-11-04 14:10 ` David Hildenbrand
2024-10-25 23:30 ` William Roche
2024-10-22 21:35 ` [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages “William Roche
2024-10-23 7:30 ` David Hildenbrand
2024-10-25 23:27 ` William Roche
2024-10-28 17:01 ` David Hildenbrand
2024-10-30 1:56 ` William Roche
2024-11-04 13:30 ` David Hildenbrand [this message]
2024-11-07 10:21 ` [PATCH v2 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-07 10:21 ` [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size “William Roche
2024-11-12 10:30 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 21:35 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-12 11:07 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 22:06 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-12 11:13 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 22:22 ` David Hildenbrand
2024-11-15 21:03 ` William Roche
2024-11-18 9:45 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-07 10:21 ` [PATCH v2 5/7] hostmem: Factor out applying settings “William Roche
2024-11-07 10:21 ` [PATCH v2 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-12 13:45 ` David Hildenbrand
2024-11-12 18:17 ` William Roche
2024-11-12 22:24 ` David Hildenbrand
2024-11-07 10:21 ` [PATCH v2 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-10-22 21:35 ` [PATCH v1 4/4] accel/kvm: Report the loss of a large memory page “William Roche
2024-10-28 16:32 ` [RFC RESEND 0/6] hugetlbfs largepage RAS project David Hildenbrand
2024-11-25 14:27 ` [PATCH v3 0/7] hugetlbfs memory HW error fixes “William Roche
2024-11-25 14:27 ` [PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages “William Roche
2024-11-25 14:27 ` [PATCH v3 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2024-11-25 14:27 ` [PATCH v3 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-11-25 14:27 ` [PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-11-25 14:27 ` [PATCH v3 5/7] hostmem: Factor out applying settings “William Roche
2024-11-25 14:27 ` [PATCH v3 6/7] hostmem: Handle remapping of RAM “William Roche
2024-11-25 14:27 ` [PATCH v3 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2024-12-02 15:41 ` [PATCH v3 0/7] hugetlbfs memory HW error fixes William Roche
2024-12-02 16:00 ` David Hildenbrand
2024-12-03 0:15 ` William Roche
2024-12-03 14:08 ` David Hildenbrand
2024-12-03 14:39 ` William Roche
2024-12-03 15:00 ` David Hildenbrand
2024-12-06 18:26 ` William Roche
2024-12-09 21:25 ` David Hildenbrand
2024-12-14 13:45 ` [PATCH v4 0/7] Poisoned memory recovery on reboot “William Roche
2024-12-14 13:45 ` [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages “William Roche
2025-01-08 21:34 ` David Hildenbrand
2025-01-10 20:56 ` William Roche
2025-01-14 13:56 ` David Hildenbrand
2024-12-14 13:45 ` [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-08 21:44 ` David Hildenbrand
2025-01-10 20:56 ` William Roche
2025-01-14 14:00 ` David Hildenbrand
2025-01-27 21:15 ` William Roche
2024-12-14 13:45 ` [PATCH v4 3/7] accel/kvm: Report the loss of a large memory page “William Roche
2024-12-14 13:45 ` [PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap() “William Roche
2024-12-14 13:45 ` [PATCH v4 5/7] hostmem: Factor out applying settings “William Roche
2025-01-08 21:58 ` David Hildenbrand
2025-01-10 20:56 ` William Roche
2024-12-14 13:45 ` [PATCH v4 6/7] hostmem: Handle remapping of RAM “William Roche
2025-01-08 21:51 ` [PATCH v4 6/7] c David Hildenbrand
2025-01-10 20:57 ` [PATCH v4 6/7] hostmem: Handle remapping of RAM William Roche
2024-12-14 13:45 ` [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification “William Roche
2025-01-08 21:53 ` David Hildenbrand
2025-01-10 20:57 ` William Roche
2025-01-14 14:01 ` David Hildenbrand
2025-01-08 21:22 ` [PATCH v4 0/7] Poisoned memory recovery on reboot David Hildenbrand
2025-01-10 20:55 ` William Roche
2025-01-10 21:13 ` [PATCH v5 0/6] " “William Roche
2025-01-10 21:14 ` [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-01-14 14:02 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-28 18:41 ` David Hildenbrand
2025-01-10 21:14 ` [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot “William Roche
2025-01-14 14:07 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-10 21:14 ` [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page “William Roche
2025-01-14 14:09 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-28 18:45 ` David Hildenbrand
2025-01-10 21:14 ` [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
2025-01-10 21:14 ` [PATCH v5 5/6] hostmem: Factor out applying settings “William Roche
2025-01-10 21:14 ` [PATCH v5 6/6] hostmem: Handle remapping of RAM “William Roche
2025-01-14 14:11 ` David Hildenbrand
2025-01-27 21:16 ` William Roche
2025-01-14 14:12 ` [PATCH v5 0/6] Poisoned memory recovery on reboot David Hildenbrand
2025-01-27 21:16 ` William Roche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e2ac7ad0-aa26-4af2-8bb3-825cba4ffca0@redhat.com \
--to=david@redhat.com \
--cc=joao.m.martins@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=william.roche@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).