[PATCH v7 0/6] Poisoned memory recovery on reboot

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/6] Poisoned memory recovery on reboot
@ 2025-02-01  9:57 “William Roche
  2025-02-01  9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: William Roche <william.roche@oracle.com>

Hello David,

Here is the version with the small nits corrected.
And the 'Acked-by' entries you gave me for patch 1 and 2.

 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs and the generic memory recovery
on VM reset.
When using hugetlbfs large pages, any large page location being impacted
by an HW memory error results in poisoning the entire page, suddenly
making a large chunk of the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.

In case of a discard failure we fall back to remapping the memory
location. We also have to reset the memory settings and honor the
'prealloc' attribute.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----

v1 -> v2:
. I removed the kernel SIGBUS siginfo provided lsb size information
  tracking. Only relying on the RAMBlock page_size instead.
. I adapted the 3 patches you indicated me to implement the
  notification mechanism on remap.  Thank you for this code!
  I left them as Authored by you.
  But I haven't tested if the policy setting works as expected on VM
  reset, only that the replacement of physical memory works.
. I also removed the old memory setting that was kept in qemu_ram_remap()
  but this small last fix could probably be merged with your last commit.

v2 -> v3:
. dropped the size parameter from qemu_ram_remap() and determine the page
  size when adding it to the poison list, aligning the offset down to the
  pagesize. Multiple sub-pages poisoned on a large page lead to a single
  poison entry.
. introduction of a helper function for the mmap code
. adding "on lost large page <size>@<ram_addr>" to the error injection
  msg (notation used in qemu_ram_remap() too ).
  So only in the case of a large page, it looks like:
Guest MCE Memory Error at QEMU addr 0x7fc1f5dd6000 and GUEST addr 0x19fd6000 on lost large page 200000@19e00000 of type BUS_MCEERR_AR injected
. as we need the page_size value for the above message, I retrieve the
  value in kvm_arch_on_sigbus_vcpu() to pass the appropriate pointer
  to kvm_hwpoison_page_add() that doesn't need to align it anymore.
. added a similar message for the ARM platform (removing the MCE
  keyword)
. I also introduced a "fail hard" in the remap notification:
  host_memory_backend_ram_remapped()

v3 -> v4:
. Fixed some commit messages typos
. Enhanced some code comments
. Changed the discard fall back conditions to consider only anonymous
  memory
. Fixed missing some variable name changes in intermediary patches.
. Modify the error message given when an error is injected to report
  the case of a large page
. use snprintf() to generate this message
. Adding this same type of message in the ARM case too

v4->v5:
. Updated commit messages (for patches 1, 5 and 6)
. Fixed comment typo of patch 2
. Changed the fall back function parameters to match the
  ram_block_discard_range() function.
. Removed the unused case of remapping a file in this function
. add the assert(block->fd < 0) in this function too
. I merged my patch 7 with your patch 6 (we only have 6 patches now)

v5->v6:
. don't align down ram_addr on kvm_hwpoison_page_add() but create
  a new entry for each subpage reported as poisoned
. introduce similar messages about memory error as discard_range()
. introduce a function to retrieve more information about a RAMBlock
  experiencing an error than just its associated page size
. file offset as an uint64_t instead of a ram_addr_t
. changed ownership of patch 6/6

v6->v7:
. change the block location information collection function name to
  qemu_ram_block_info_from_addr()
. display the fd_offset value only when dealing with a file backend
  in kvm_hwpoison_page_add() and qemu_ram_remap()
. better placed offset alignment computation
. two empty separation lines missing

This code is scripts/checkpatch.pl clean
'make check' runs clean on both x86 and ARM.

David Hildenbrand (2):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings

William Roche (4):
  system/physmem: handle hugetlb correctly in qemu_ram_remap()
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  hostmem: Handle remapping of RAM

 accel/kvm/kvm-all.c       |  20 +++-
 backends/hostmem.c        | 189 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |  12 ++-
 include/exec/ramlist.h    |   3 +
 include/system/hostmem.h  |   1 +
 system/physmem.c          | 107 +++++++++++++++------
 target/arm/kvm.c          |   3 +
 8 files changed, 244 insertions(+), 102 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
  2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
@ 2025-02-01  9:57 ` “William Roche
  2025-02-04 17:09   ` Peter Xu
  2025-02-01  9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size.
To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
hugetlb page; hugetlb pages cannot be partially mapped.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 accel/kvm/kvm-all.c       |  2 +-
 include/exec/cpu-common.h |  2 +-
 system/physmem.c          | 38 +++++++++++++++++++++++++++++---------
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c65b790433..f89568bfa3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1288,7 +1288,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index b1d76d6985..3771b2130c 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/system/physmem.c b/system/physmem.c
index c76503aea8..3dd2adde73 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,17 +2167,35 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+/*
+ * qemu_ram_remap - remap a single RAM page
+ *
+ * @addr: address in ram_addr_t address space.
+ *
+ * This function will try remapping a single page of guest RAM identified by
+ * @addr, essentially discarding memory to recover from previously poisoned
+ * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
+ * does not have to point at the start of the page.
+ *
+ * This function is only to be used during system resets; it will kill the
+ * VM if remapping failed.
+ */
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
-    ram_addr_t offset;
+    uint64_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -2191,21 +2209,23 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
+                                 " +%zx", block->idstr, offset,
+                                 block->fd_offset, page_size);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
  2025-02-01  9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
@ 2025-02-04 17:09   ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:09 UTC (permalink / raw)
  To: “William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Sat, Feb 01, 2025 at 09:57:21AM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> The list of hwpoison pages used to remap the memory on reset
> is based on the backend real page size.
> To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
> hugetlb page; hugetlb pages cannot be partially mapped.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Acked-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
  2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
  2025-02-01  9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
@ 2025-02-01  9:57 ` “William Roche
  2025-02-04 17:09   ` Peter Xu
  2025-02-01  9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: William Roche <william.roche@oracle.com>

Repair poisoned memory location(s), calling ram_block_discard_range():
punching a hole in the backend file when necessary and regenerating
a usable memory.
If the kernel doesn't support the madvise calls used by this function
and we are dealing with anonymous memory, fall back to remapping the
location(s).

Signed-off-by: William Roche <william.roche@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 22 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 3dd2adde73..e8ff930bc9 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,6 +2167,23 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Simply remap the given VM memory location from start to start+length */
+static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
+{
+    int flags, prot;
+    void *area;
+    void *host_startaddr = block->host + start;
+
+    assert(block->fd < 0);
+    flags = MAP_FIXED | MAP_ANONYMOUS;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    area = mmap(host_startaddr, length, prot, flags, -1, 0);
+    return area != host_startaddr ? -errno : 0;
+}
+
 /*
  * qemu_ram_remap - remap a single RAM page
  *
@@ -2184,9 +2201,7 @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     uint64_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2201,25 +2216,24 @@ void qemu_ram_remap(ram_addr_t addr)
                 ;
             } else if (xen_enabled()) {
                 abort();
-            } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
-                                 " +%zx", block->idstr, offset,
-                                 block->fd_offset, page_size);
-                    exit(1);
+                if (ram_block_discard_range(block, offset, page_size) != 0) {
+                    /*
+                     * Fall back to using mmap() only for anonymous mapping,
+                     * as if a backing file is associated we may not be able
+                     * to recover the memory in all cases.
+                     * So don't take the risk of using only mmap and fail now.
+                     */
+                    if (block->fd >= 0) {
+                        error_report("Could not remap RAM %s:%" PRIx64 "+%"
+                                     PRIx64 " +%zx", block->idstr, offset,
+                                     block->fd_offset, page_size);
+                        exit(1);
+                    }
+                    if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
+                        error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
+                                     block->idstr, offset, page_size);
+                        exit(1);
+                    }
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
  2025-02-01  9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
@ 2025-02-04 17:09   ` Peter Xu
  2025-02-05 16:27     ` William Roche
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:09 UTC (permalink / raw)
  To: “William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Sat, Feb 01, 2025 at 09:57:22AM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Repair poisoned memory location(s), calling ram_block_discard_range():
> punching a hole in the backend file when necessary and regenerating
> a usable memory.
> If the kernel doesn't support the madvise calls used by this function
> and we are dealing with anonymous memory, fall back to remapping the
> location(s).
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
>  1 file changed, 36 insertions(+), 22 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 3dd2adde73..e8ff930bc9 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2167,6 +2167,23 @@ void qemu_ram_free(RAMBlock *block)
>  }
>  
>  #ifndef _WIN32
> +/* Simply remap the given VM memory location from start to start+length */
> +static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
> +{
> +    int flags, prot;
> +    void *area;
> +    void *host_startaddr = block->host + start;
> +
> +    assert(block->fd < 0);
> +    flags = MAP_FIXED | MAP_ANONYMOUS;
> +    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
> +    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> +    prot = PROT_READ;
> +    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> +    area = mmap(host_startaddr, length, prot, flags, -1, 0);
> +    return area != host_startaddr ? -errno : 0;
> +}
> +
>  /*
>   * qemu_ram_remap - remap a single RAM page
>   *
> @@ -2184,9 +2201,7 @@ void qemu_ram_remap(ram_addr_t addr)
>  {
>      RAMBlock *block;
>      uint64_t offset;
> -    int flags;
> -    void *area, *vaddr;
> -    int prot;
> +    void *vaddr;
>      size_t page_size;
>  
>      RAMBLOCK_FOREACH(block) {
> @@ -2201,25 +2216,24 @@ void qemu_ram_remap(ram_addr_t addr)
>                  ;
>              } else if (xen_enabled()) {
>                  abort();
> -            } else {

Do we need to keep this line?  Otherwise it looks to me the new code won't
be executed at all in !xen..

> -                flags = MAP_FIXED;
> -                flags |= block->flags & RAM_SHARED ?
> -                         MAP_SHARED : MAP_PRIVATE;
> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> -                prot = PROT_READ;
> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> -                if (block->fd >= 0) {
> -                    area = mmap(vaddr, page_size, prot, flags, block->fd,
> -                                offset + block->fd_offset);
> -                } else {
> -                    flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
> -                }
> -                if (area != vaddr) {
> -                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
> -                                 " +%zx", block->idstr, offset,
> -                                 block->fd_offset, page_size);
> -                    exit(1);
> +                if (ram_block_discard_range(block, offset, page_size) != 0) {
> +                    /*
> +                     * Fall back to using mmap() only for anonymous mapping,
> +                     * as if a backing file is associated we may not be able
> +                     * to recover the memory in all cases.
> +                     * So don't take the risk of using only mmap and fail now.
> +                     */
> +                    if (block->fd >= 0) {
> +                        error_report("Could not remap RAM %s:%" PRIx64 "+%"
> +                                     PRIx64 " +%zx", block->idstr, offset,
> +                                     block->fd_offset, page_size);
> +                        exit(1);
> +                    }
> +                    if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
> +                        error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
> +                                     block->idstr, offset, page_size);
> +                        exit(1);
> +                    }
>                  }
>                  memory_try_enable_merging(vaddr, page_size);
>                  qemu_ram_setup_dump(vaddr, page_size);
> -- 
> 2.43.5
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
  2025-02-04 17:09   ` Peter Xu
@ 2025-02-05 16:27     ` William Roche
  0 siblings, 0 replies; 25+ messages in thread
From: William Roche @ 2025-02-05 16:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On 2/4/25 18:09, Peter Xu wrote:
> On Sat, Feb 01, 2025 at 09:57:22AM +0000, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Repair poisoned memory location(s), calling ram_block_discard_range():
>> punching a hole in the backend file when necessary and regenerating
>> a usable memory.
>> If the kernel doesn't support the madvise calls used by this function
>> and we are dealing with anonymous memory, fall back to remapping the
>> location(s).
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> ---
>>   system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
>>   1 file changed, 36 insertions(+), 22 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 3dd2adde73..e8ff930bc9 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> ...
>>   /*
>>    * qemu_ram_remap - remap a single RAM page
>>    *
>> @@ -2184,9 +2201,7 @@ void qemu_ram_remap(ram_addr_t addr)
>>   {
>>       RAMBlock *block;
>>       uint64_t offset;
>> -    int flags;
>> -    void *area, *vaddr;
>> -    int prot;
>> +    void *vaddr;
>>       size_t page_size;
>>   
>>       RAMBLOCK_FOREACH(block) {
>> @@ -2201,25 +2216,24 @@ void qemu_ram_remap(ram_addr_t addr)
>>                   ;
>>               } else if (xen_enabled()) {
>>                   abort();
>> -            } else {
> 
> Do we need to keep this line?  Otherwise it looks to me the new code won't
> be executed at all in !xen..


Very true, this is a mistake I made in a very recent version - sorry!
I'll fix it with my next version.
Thanks.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
  2025-02-01  9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
  2025-02-01  9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
@ 2025-02-01  9:57 ` “William Roche
  2025-02-04 17:01   ` Peter Xu
  2025-02-01  9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: William Roche <william.roche@oracle.com>

In case of a large page impacted by a memory error, provide an
information about the impacted large page before the memory
error injection message.

This message would also appear on ras enabled ARM platforms, with
the introduction of an x86 similar error injection message.

In the case of a large page impacted, we now report:
Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>

The +<fd_offset> information is only provided with a file backend.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       | 18 ++++++++++++++++++
 include/exec/cpu-common.h | 10 ++++++++++
 system/physmem.c          | 22 ++++++++++++++++++++++
 target/arm/kvm.c          |  3 +++
 4 files changed, 53 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f89568bfa3..9a0d970ce1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1296,6 +1296,24 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    struct RAMBlockInfo rb_info;
+
+    if (qemu_ram_block_info_from_addr(ram_addr, &rb_info)) {
+        size_t ps = rb_info.page_size;
+
+        if (ps > TARGET_PAGE_SIZE) {
+            uint64_t offset = QEMU_ALIGN_DOWN(ram_addr - rb_info.offset, ps);
+
+            if (rb_info.fd >= 0) {
+                error_report("Memory Error on large page from %s:%" PRIx64
+                             "+%" PRIx64 " +%zx", rb_info.idstr, offset,
+                             rb_info.fd_offset, ps);
+            } else {
+                error_report("Memory Error on large page from %s:%" PRIx64
+                            " +%zx", rb_info.idstr, offset, ps);
+            }
+        }
+    }
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 3771b2130c..190bd4f34a 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -110,6 +110,16 @@ int qemu_ram_get_fd(RAMBlock *rb);
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
 
+struct RAMBlockInfo {
+    char idstr[256];
+    ram_addr_t offset;
+    int fd;
+    uint64_t fd_offset;
+    size_t page_size;
+};
+bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
+                                   struct RAMBlockInfo *block);
+
 /**
  * cpu_address_space_init:
  * @cpu: CPU to add this address space to
diff --git a/system/physmem.c b/system/physmem.c
index e8ff930bc9..686f569270 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1678,6 +1678,28 @@ size_t qemu_ram_pagesize_largest(void)
     return largest;
 }
 
+/* Copy RAMBlock information associated to the given ram_addr location */
+bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
+                                   struct RAMBlockInfo *b_info)
+{
+    RAMBlock *rb;
+
+    assert(b_info);
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(ram_addr);
+    if (!rb) {
+        return false;
+    }
+
+    pstrcat(b_info->idstr, sizeof(b_info->idstr), rb->idstr);
+    b_info->offset = rb->offset;
+    b_info->fd = rb->fd;
+    b_info->fd_offset = rb->fd_offset;
+    b_info->page_size = rb->page_size;
+    return true;
+}
+
 static int memory_try_enable_merging(void *addr, size_t len)
 {
     if (!machine_mem_merge(current_machine)) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index da30bdbb23..d9dedc6d74 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2389,6 +2389,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                     kvm_inject_arm_sea(c);
+                    error_report("Guest Memory Error at QEMU addr %p and "
+                        "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
+                        addr, paddr, "BUS_MCEERR_AR");
                 } else {
                     error_report("failed to record the error");
                     abort();
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-01  9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
@ 2025-02-04 17:01   ` Peter Xu
  2025-02-05 16:27     ` William Roche
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:01 UTC (permalink / raw)
  To: “William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Sat, Feb 01, 2025 at 09:57:23AM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> In case of a large page impacted by a memory error, provide an
> information about the impacted large page before the memory
> error injection message.
> 
> This message would also appear on ras enabled ARM platforms, with
> the introduction of an x86 similar error injection message.
> 
> In the case of a large page impacted, we now report:
> Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
> 
> The +<fd_offset> information is only provided with a file backend.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>

This is still pretty kvm / arch relevant patch that needs some reviews.

I wonder do we really need this - we could fetch ramblock mapping
(e.g. hwaddr -> HVA) via HMP "info ramblock", and also dmesg shows process
ID + VA.  IIUC we have all below info already as long as we do some math
based on above.  Would that work too?

> ---
>  accel/kvm/kvm-all.c       | 18 ++++++++++++++++++
>  include/exec/cpu-common.h | 10 ++++++++++
>  system/physmem.c          | 22 ++++++++++++++++++++++
>  target/arm/kvm.c          |  3 +++
>  4 files changed, 53 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index f89568bfa3..9a0d970ce1 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1296,6 +1296,24 @@ static void kvm_unpoison_all(void *param)
>  void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>  {
>      HWPoisonPage *page;
> +    struct RAMBlockInfo rb_info;
> +
> +    if (qemu_ram_block_info_from_addr(ram_addr, &rb_info)) {
> +        size_t ps = rb_info.page_size;
> +
> +        if (ps > TARGET_PAGE_SIZE) {
> +            uint64_t offset = QEMU_ALIGN_DOWN(ram_addr - rb_info.offset, ps);
> +
> +            if (rb_info.fd >= 0) {
> +                error_report("Memory Error on large page from %s:%" PRIx64
> +                             "+%" PRIx64 " +%zx", rb_info.idstr, offset,
> +                             rb_info.fd_offset, ps);
> +            } else {
> +                error_report("Memory Error on large page from %s:%" PRIx64
> +                            " +%zx", rb_info.idstr, offset, ps);
> +            }
> +        }
> +    }
>  
>      QLIST_FOREACH(page, &hwpoison_page_list, list) {
>          if (page->ram_addr == ram_addr) {
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 3771b2130c..190bd4f34a 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -110,6 +110,16 @@ int qemu_ram_get_fd(RAMBlock *rb);
>  size_t qemu_ram_pagesize(RAMBlock *block);
>  size_t qemu_ram_pagesize_largest(void);
>  
> +struct RAMBlockInfo {
> +    char idstr[256];
> +    ram_addr_t offset;
> +    int fd;
> +    uint64_t fd_offset;
> +    size_t page_size;
> +};
> +bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
> +                                   struct RAMBlockInfo *block);
> +
>  /**
>   * cpu_address_space_init:
>   * @cpu: CPU to add this address space to
> diff --git a/system/physmem.c b/system/physmem.c
> index e8ff930bc9..686f569270 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1678,6 +1678,28 @@ size_t qemu_ram_pagesize_largest(void)
>      return largest;
>  }
>  
> +/* Copy RAMBlock information associated to the given ram_addr location */
> +bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
> +                                   struct RAMBlockInfo *b_info)
> +{
> +    RAMBlock *rb;
> +
> +    assert(b_info);
> +
> +    RCU_READ_LOCK_GUARD();
> +    rb =  qemu_get_ram_block(ram_addr);
> +    if (!rb) {
> +        return false;
> +    }
> +
> +    pstrcat(b_info->idstr, sizeof(b_info->idstr), rb->idstr);
> +    b_info->offset = rb->offset;
> +    b_info->fd = rb->fd;
> +    b_info->fd_offset = rb->fd_offset;
> +    b_info->page_size = rb->page_size;
> +    return true;
> +}
> +
>  static int memory_try_enable_merging(void *addr, size_t len)
>  {
>      if (!machine_mem_merge(current_machine)) {
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index da30bdbb23..d9dedc6d74 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -2389,6 +2389,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>                  kvm_cpu_synchronize_state(c);
>                  if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
>                      kvm_inject_arm_sea(c);
> +                    error_report("Guest Memory Error at QEMU addr %p and "
> +                        "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
> +                        addr, paddr, "BUS_MCEERR_AR");
>                  } else {
>                      error_report("failed to record the error");
>                      abort();
> -- 
> 2.43.5
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-04 17:01   ` Peter Xu
@ 2025-02-05 16:27     ` William Roche
  2025-02-05 17:07       ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-05 16:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On 2/4/25 18:01, Peter Xu wrote:
> On Sat, Feb 01, 2025 at 09:57:23AM +0000, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> In case of a large page impacted by a memory error, provide an
>> information about the impacted large page before the memory
>> error injection message.
>>
>> This message would also appear on ras enabled ARM platforms, with
>> the introduction of an x86 similar error injection message.
>>
>> In the case of a large page impacted, we now report:
>> Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
>>
>> The +<fd_offset> information is only provided with a file backend.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
> 
> This is still pretty kvm / arch relevant patch that needs some reviews.
> 
> I wonder do we really need this - we could fetch ramblock mapping
> (e.g. hwaddr -> HVA) via HMP "info ramblock", and also dmesg shows process
> ID + VA.  IIUC we have all below info already as long as we do some math
> based on above.  Would that work too?

The HMP command "info ramblock" is implemented with the 
ram_block_format() function which returns a message buffer built with a 
string for each ramblock (protected by the RCU_READ_LOCK_GUARD). Our new 
function copies a struct with the necessary information.

Relaying on the buffer format to retrieve the information doesn't seem 
reasonable, and more importantly, this buffer doesn't provide all the 
needed data, like fd and fd_offset.

I would say that ram_block_format() and qemu_ram_block_info_from_addr() 
serve 2 different goals.

(a reimplementation of ram_block_format() with an adapted version of 
qemu_ram_block_info_from_addr() taking the extra information needed 
could be doable for example, but may not be worth doing for now)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-05 16:27     ` William Roche
@ 2025-02-05 17:07       ` Peter Xu
  2025-02-07 18:02         ` William Roche
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-05 17:07 UTC (permalink / raw)
  To: William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote:
> On 2/4/25 18:01, Peter Xu wrote:
> > On Sat, Feb 01, 2025 at 09:57:23AM +0000, “William Roche wrote:
> > > From: William Roche <william.roche@oracle.com>
> > > 
> > > In case of a large page impacted by a memory error, provide an
> > > information about the impacted large page before the memory
> > > error injection message.
> > > 
> > > This message would also appear on ras enabled ARM platforms, with
> > > the introduction of an x86 similar error injection message.
> > > 
> > > In the case of a large page impacted, we now report:
> > > Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
> > > 
> > > The +<fd_offset> information is only provided with a file backend.
> > > 
> > > Signed-off-by: William Roche <william.roche@oracle.com>
> > 
> > This is still pretty kvm / arch relevant patch that needs some reviews.
> > 
> > I wonder do we really need this - we could fetch ramblock mapping
> > (e.g. hwaddr -> HVA) via HMP "info ramblock", and also dmesg shows process
> > ID + VA.  IIUC we have all below info already as long as we do some math
> > based on above.  Would that work too?
> 
> The HMP command "info ramblock" is implemented with the ram_block_format()
> function which returns a message buffer built with a string for each
> ramblock (protected by the RCU_READ_LOCK_GUARD). Our new function copies a
> struct with the necessary information.
> 
> Relaying on the buffer format to retrieve the information doesn't seem
> reasonable, and more importantly, this buffer doesn't provide all the needed
> data, like fd and fd_offset.
> 
> I would say that ram_block_format() and qemu_ram_block_info_from_addr()
> serve 2 different goals.
> 
> (a reimplementation of ram_block_format() with an adapted version of
> qemu_ram_block_info_from_addr() taking the extra information needed could be
> doable for example, but may not be worth doing for now)

IIUC admin should be aware of fd_offset because the admin should be fully
aware of the start offset of FDs to specify in qemu cmdlines, or in
Libvirt. But yes, we can always add fd_offset into ram_block_format() if
it's helpful.

Besides, the existing issues on this patch:

  - From outcome of this patch, it introduces one ramblock API (which is ok
    to me, so far), to do some error_report()s.  It looks pretty much for
    debugging rather than something serious (e.g. report via QMP queries,
    QMP events etc.).  From debug POV, I still don't see why this is
    needed.. per discussed above.

  - From merge POV, this patch isn't a pure memory change, so I'll need to
    get ack from other maintainers, at least that should be how it works..

I feel like when hwpoison becomes a serious topic, we need some more
serious reporting facility than error reports.  So that we could have this
as separate topic to be revisited.  It might speed up your prior patches
from not being blocked on this.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-05 17:07       ` Peter Xu
@ 2025-02-07 18:02         ` William Roche
  2025-02-10 16:48           ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-07 18:02 UTC (permalink / raw)
  To: Peter Xu, david
  Cc: kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

On 2/5/25 18:07, Peter Xu wrote:
> On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote:
>> [...]
>> The HMP command "info ramblock" is implemented with the ram_block_format()
>> function which returns a message buffer built with a string for each
>> ramblock (protected by the RCU_READ_LOCK_GUARD). Our new function copies a
>> struct with the necessary information.
>>
>> Relaying on the buffer format to retrieve the information doesn't seem
>> reasonable, and more importantly, this buffer doesn't provide all the needed
>> data, like fd and fd_offset.
>>
>> I would say that ram_block_format() and qemu_ram_block_info_from_addr()
>> serve 2 different goals.
>>
>> (a reimplementation of ram_block_format() with an adapted version of
>> qemu_ram_block_info_from_addr() taking the extra information needed could be
>> doable for example, but may not be worth doing for now)
> 
> IIUC admin should be aware of fd_offset because the admin should be fully
> aware of the start offset of FDs to specify in qemu cmdlines, or in
> Libvirt. But yes, we can always add fd_offset into ram_block_format() if
> it's helpful.
> 
> Besides, the existing issues on this patch:
> 
>    - From outcome of this patch, it introduces one ramblock API (which is ok
>      to me, so far), to do some error_report()s.  It looks pretty much for
>      debugging rather than something serious (e.g. report via QMP queries,
>      QMP events etc.).  From debug POV, I still don't see why this is
>      needed.. per discussed above.

The reason why I want to inform the user of a large memory failure more 
specifically than a standard sized page loss is because of the 
significant behavior difference: Our current implementation can 
transparently handle many situations without necessarily leading the VM 
to a crash. But when it comes to large pages, there is no mechanism to 
inform the VM of a large memory loss, and usually this situation leads 
the VM to crash, and can also generate some weird situations like qemu 
itself crashing or a loop of errors, for example.

So having a message informing of such a memory loss can help to 
understand a more radical VM or qemu behavior -- it increases the 
diagnosability of our code.

To verify that a SIGBUS appeared because of a large page loss, we 
currently need to verify the targeted memory block backend page_size.
We should usually get this information from the SIGBUS siginfo data 
(with a si_addr_lsb field giving an indication of the page size) but a 
KVM weakness with a hardcoded si_addr_lsb=PAGE_SHIFT value in the SIGBUS 
siginfo returned from the kernel prevents that: See 
kvm_send_hwpoison_signal() function.

So I first wrote a small API addition called 
qemu_ram_pagesize_from_addr() to retrieve only this page_size value from 
the impacted address; and later on, this function turned into the richer 
qemu_ram_block_info_from_addr() function to have the generated messages 
match the existing memory messages as rightly requested by David.

So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), 
and the second reason is to have richer error messages.

>    - From merge POV, this patch isn't a pure memory change, so I'll need to
>      get ack from other maintainers, at least that should be how it works..

I agree :)

> 
> I feel like when hwpoison becomes a serious topic, we need some more
> serious reporting facility than error reports.  So that we could have this
> as separate topic to be revisited.  It might speed up your prior patches
> from not being blocked on this.

I explained why I think that error messages are important, but I don't 
want to get blocked on fixing the hugepage memory recovery because of that.

If you think that not displaying a specific message for large page loss 
can help to get the recovery fixed, than I can change my proposal to do so.

Early next week, I'll send a simplified version of my first 3 patches 
without this specific messages and without the preallocation handling in 
all remap cases, so you can evaluate this possibility.

Thank again for your feedback
William.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-07 18:02         ` William Roche
@ 2025-02-10 16:48           ` Peter Xu
  2025-02-11 21:22             ` William Roche
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-10 16:48 UTC (permalink / raw)
  To: William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:
> On 2/5/25 18:07, Peter Xu wrote:
> > On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote:
> > > [...]
> > > The HMP command "info ramblock" is implemented with the ram_block_format()
> > > function which returns a message buffer built with a string for each
> > > ramblock (protected by the RCU_READ_LOCK_GUARD). Our new function copies a
> > > struct with the necessary information.
> > > 
> > > Relaying on the buffer format to retrieve the information doesn't seem
> > > reasonable, and more importantly, this buffer doesn't provide all the needed
> > > data, like fd and fd_offset.
> > > 
> > > I would say that ram_block_format() and qemu_ram_block_info_from_addr()
> > > serve 2 different goals.
> > > 
> > > (a reimplementation of ram_block_format() with an adapted version of
> > > qemu_ram_block_info_from_addr() taking the extra information needed could be
> > > doable for example, but may not be worth doing for now)
> > 
> > IIUC admin should be aware of fd_offset because the admin should be fully
> > aware of the start offset of FDs to specify in qemu cmdlines, or in
> > Libvirt. But yes, we can always add fd_offset into ram_block_format() if
> > it's helpful.
> > 
> > Besides, the existing issues on this patch:
> > 
> >    - From outcome of this patch, it introduces one ramblock API (which is ok
> >      to me, so far), to do some error_report()s.  It looks pretty much for
> >      debugging rather than something serious (e.g. report via QMP queries,
> >      QMP events etc.).  From debug POV, I still don't see why this is
> >      needed.. per discussed above.
> 
> The reason why I want to inform the user of a large memory failure more
> specifically than a standard sized page loss is because of the significant
> behavior difference: Our current implementation can transparently handle
> many situations without necessarily leading the VM to a crash. But when it
> comes to large pages, there is no mechanism to inform the VM of a large
> memory loss, and usually this situation leads the VM to crash, and can also
> generate some weird situations like qemu itself crashing or a loop of
> errors, for example.
> 
> So having a message informing of such a memory loss can help to understand a
> more radical VM or qemu behavior -- it increases the diagnosability of our
> code.
> 
> To verify that a SIGBUS appeared because of a large page loss, we currently
> need to verify the targeted memory block backend page_size.
> We should usually get this information from the SIGBUS siginfo data (with a
> si_addr_lsb field giving an indication of the page size) but a KVM weakness
> with a hardcoded si_addr_lsb=PAGE_SHIFT value in the SIGBUS siginfo returned
> from the kernel prevents that: See kvm_send_hwpoison_signal() function.
> 
> So I first wrote a small API addition called qemu_ram_pagesize_from_addr()
> to retrieve only this page_size value from the impacted address; and later
> on, this function turned into the richer qemu_ram_block_info_from_addr()
> function to have the generated messages match the existing memory messages
> as rightly requested by David.
> 
> So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
> the second reason is to have richer error messages.

This seems true, and I also remember something when I looked at this
previously but maybe nobody tried to fix it.  ARM seems to be correct on
that field, otoh.

Is it possible we fix KVM on x86?

kvm_handle_error_pfn() has the fault context, so IIUC it should be able to
figure that out too like what ARM does (with get_vma_page_shift()).

> 
> 
> 
> >    - From merge POV, this patch isn't a pure memory change, so I'll need to
> >      get ack from other maintainers, at least that should be how it works..
> 
> I agree :)
> 
> > 
> > I feel like when hwpoison becomes a serious topic, we need some more
> > serious reporting facility than error reports.  So that we could have this
> > as separate topic to be revisited.  It might speed up your prior patches
> > from not being blocked on this.
> 
> I explained why I think that error messages are important, but I don't want
> to get blocked on fixing the hugepage memory recovery because of that.

What is the major benefit of reporting in QEMU's stderr in this case?

For example, how should we consume the error reports that this patch
introduces?  Is it still for debugging purpose?

I agree it's always better to dump something in QEMU when such happened,
but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
to deduce the page size of affected vaddr, especially if it's for debugging
purpose.  However I could possibly have missed the goal here..

> 
> If you think that not displaying a specific message for large page loss can
> help to get the recovery fixed, than I can change my proposal to do so.
> 
> Early next week, I'll send a simplified version of my first 3 patches
> without this specific messages and without the preallocation handling in all
> remap cases, so you can evaluate this possibility.

Yes IMHO it'll always be helpful to separate it if possible.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-10 16:48           ` Peter Xu
@ 2025-02-11 21:22             ` William Roche
  2025-02-11 21:45               ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-11 21:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On 2/10/25 17:48, Peter Xu wrote:
> On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:
>> [...]
>> So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
>> the second reason is to have richer error messages.
> 
> This seems true, and I also remember something when I looked at this
> previously but maybe nobody tried to fix it.  ARM seems to be correct on
> that field, otoh.
> 
> Is it possible we fix KVM on x86?

Yes, very probably, and it would be a kernel fix.
This kernel modification would be needed to run on the hypervisor first 
to influence a new code in qemu able to use the SIGBUS siginfo 
information and identify the size of the page impacted (instead of using 
an internal addition to kvm API).
But this mechanism could help to generate a large page memory error 
specific message on SIGBUS receiving.

>>>
>>> I feel like when hwpoison becomes a serious topic, we need some more
>>> serious reporting facility than error reports.  So that we could have this
>>> as separate topic to be revisited.  It might speed up your prior patches
>>> from not being blocked on this.
>>
>> I explained why I think that error messages are important, but I don't want
>> to get blocked on fixing the hugepage memory recovery because of that.
> 
> What is the major benefit of reporting in QEMU's stderr in this case?

Such messages can be collected into VM specific log file, as any other 
error_report() message, like the existing x86 error injection messages 
reported by Qemu.
This messages should help the administrator to better understand the 
behavior of the VM.

> For example, how should we consume the error reports that this patch
> introduces?  Is it still for debugging purpose?

Its not only debugging, but it's a trace of a significant event that can 
have major consequences on the VM.

> 
> I agree it's always better to dump something in QEMU when such happened,
> but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
> monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
> to deduce the page size of affected vaddr, especially if it's for debugging
> purpose.  However I could possibly have missed the goal here..

You're right that knowing the address, the administrator can deduce what 
memory area was impacted and the associated page size. But the goal of 
these large page specific messages was to give details on the event type 
and immediately qualify the consequences.
Using large pages can also have drawbacks, and a large page specific 
message on memory error makes that more obvious !  Not only a debug msg, 
but an indication that the VM lost an unusually large amount of its memory.

>>
>> If you think that not displaying a specific message for large page loss can
>> help to get the recovery fixed, than I can change my proposal to do so.
>>
>> Early next week, I'll send a simplified version of my first 3 patches
>> without this specific messages and without the preallocation handling in all
>> remap cases, so you can evaluate this possibility.
> 
> Yes IMHO it'll always be helpful to separate it if possible.

I'm sending now a v8 version, without the specific messages and the 
remap notification. It should fix the main recovery bug we currently 
have. More messages and a notification dealing with pre-allocation can 
be added in a second step.

Please let me know if this v8 version can be integrated without the 
prealloc and specific messages ?

Thanks,
William.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
  2025-02-11 21:22             ` William Roche
@ 2025-02-11 21:45               ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2025-02-11 21:45 UTC (permalink / raw)
  To: William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Tue, Feb 11, 2025 at 10:22:38PM +0100, William Roche wrote:
> On 2/10/25 17:48, Peter Xu wrote:
> > On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:
> > > [...]
> > > So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
> > > the second reason is to have richer error messages.
> > 
> > This seems true, and I also remember something when I looked at this
> > previously but maybe nobody tried to fix it.  ARM seems to be correct on
> > that field, otoh.
> > 
> > Is it possible we fix KVM on x86?
> 
> Yes, very probably, and it would be a kernel fix.
> This kernel modification would be needed to run on the hypervisor first to
> influence a new code in qemu able to use the SIGBUS siginfo information and
> identify the size of the page impacted (instead of using an internal
> addition to kvm API).
> But this mechanism could help to generate a large page memory error specific
> message on SIGBUS receiving.

Yes, QEMU should probably better be able to work on both old/new kernels,
even if this will be fixed.

> 
> 
> > > > 
> > > > I feel like when hwpoison becomes a serious topic, we need some more
> > > > serious reporting facility than error reports.  So that we could have this
> > > > as separate topic to be revisited.  It might speed up your prior patches
> > > > from not being blocked on this.
> > > 
> > > I explained why I think that error messages are important, but I don't want
> > > to get blocked on fixing the hugepage memory recovery because of that.
> > 
> > What is the major benefit of reporting in QEMU's stderr in this case?
> 
> Such messages can be collected into VM specific log file, as any other
> error_report() message, like the existing x86 error injection messages
> reported by Qemu.
> This messages should help the administrator to better understand the
> behavior of the VM.

I'll still put "better understand the behavior of VM" into debugging
category. :)

But I agree such can be important information.  That's also why I was
curious whether it should be something like a QMP event instead.  That's a
much formal way of sending important messages.

> 
> 
> > For example, how should we consume the error reports that this patch
> > introduces?  Is it still for debugging purpose?
> 
> Its not only debugging, but it's a trace of a significant event that can
> have major consequences on the VM.
> 
> > 
> > I agree it's always better to dump something in QEMU when such happened,
> > but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
> > monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
> > to deduce the page size of affected vaddr, especially if it's for debugging
> > purpose.  However I could possibly have missed the goal here..
> 
> You're right that knowing the address, the administrator can deduce what
> memory area was impacted and the associated page size. But the goal of these
> large page specific messages was to give details on the event type and
> immediately qualify the consequences.
> Using large pages can also have drawbacks, and a large page specific message
> on memory error makes that more obvious !  Not only a debug msg, but an
> indication that the VM lost an unusually large amount of its memory.
> 
> > > 
> > > If you think that not displaying a specific message for large page loss can
> > > help to get the recovery fixed, than I can change my proposal to do so.
> > > 
> > > Early next week, I'll send a simplified version of my first 3 patches
> > > without this specific messages and without the preallocation handling in all
> > > remap cases, so you can evaluate this possibility.
> > 
> > Yes IMHO it'll always be helpful to separate it if possible.
> 
> I'm sending now a v8 version, without the specific messages and the remap
> notification. It should fix the main recovery bug we currently have. More
> messages and a notification dealing with pre-allocation can be added in a
> second step.
> 
> Please let me know if this v8 version can be integrated without the prealloc
> and specific messages ?

IMHO fixing hugetlb page is still a progress on its own, even without any
added error message, or proactive allocation during reset.

One issue is the v8 still contains patch 3 which is for ARM kvm.. You may
need to post it separately for ARM maintainers to review & collect.  I'll
be able to queue patch 1-2.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
  2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
                   ` (2 preceding siblings ...)
  2025-02-01  9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
@ 2025-02-01  9:57 ` “William Roche
  2025-02-04 17:17   ` Peter Xu
  2025-02-01  9:57 ` [PATCH v7 5/6] hostmem: Factor out applying settings “William Roche
  2025-02-01  9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
  5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 218576f745..003bcd8a66 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index 686f569270..561b2c38c0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2259,6 +2259,7 @@ void qemu_ram_remap(ram_addr_t addr)
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
+                ram_block_notify_remap(block->host, offset, page_size);
             }
 
             break;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
  2025-02-01  9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
@ 2025-02-04 17:17   ` Peter Xu
  2025-02-04 17:42     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:17 UTC (permalink / raw)
  To: “William Roche
  Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
	philmd, peter.maydell, mtosatti, imammedo, eduardo,
	marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Sat, Feb 01, 2025 at 09:57:24AM +0000, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
> 
> Notify registered listeners about the remap at the end of
> qemu_ram_remap() so e.g., a memory backend can re-apply its
> settings correctly.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>

IIUC logically speaking we don't need a global remap notifier - here a
per-ramblock notifier looks more reasonable, like RAMBlock.resized().
It'll change the notify path from O(N**2) to O(N).  After all, backend1's
notifier won't care other ramblock's remap() events but only itself's.

It's not a huge deal as I expect we don't have a huge amount of ramblocks,
but looks like this series will miss the recent pull anyway..  so let me
comment as so on this one for consideration when respin.

We could also merge partial of the series to fix hugetlb poisoning first,
as this one looks like can be separately done too.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
  2025-02-04 17:17   ` Peter Xu
@ 2025-02-04 17:42     ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-02-04 17:42 UTC (permalink / raw)
  To: Peter Xu, “William Roche
  Cc: kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

On 04.02.25 18:17, Peter Xu wrote:
> On Sat, Feb 01, 2025 at 09:57:24AM +0000, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Notify registered listeners about the remap at the end of
>> qemu_ram_remap() so e.g., a memory backend can re-apply its
>> settings correctly.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
> 
> IIUC logically speaking we don't need a global remap notifier - here a
> per-ramblock notifier looks more reasonable, like RAMBlock.resized().

Right. Note that qemu_ram_resize() also triggers global notifiers.

> It'll change the notify path from O(N**2) to O(N).  After all, backend1's
> notifier won't care other ramblock's remap() events but only itself's.
> 
> It's not a huge deal as I expect we don't have a huge amount of ramblocks,
> but looks like this series will miss the recent pull anyway..  so let me
> comment as so on this one for consideration when respin.

... and ram remap during reboot is not particularly the fast path we 
care about (or should be caring about).

> 
> We could also merge partial of the series to fix hugetlb poisoning first,
> as this one looks like can be separately done too.

hugetlb frequently uses preallocation, and the remaining patches in this 
series make sure preallocation after remapping happens.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 5/6] hostmem: Factor out applying settings
  2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
                   ` (3 preceding siblings ...)
  2025-02-01  9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
@ 2025-02-01  9:57 ` “William Roche
  2025-02-01  9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
  5 siblings, 0 replies; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index bceca1a8d9..46d80f98b4 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
                   ` (4 preceding siblings ...)
  2025-02-01  9:57 ` [PATCH v7 5/6] hostmem: Factor out applying settings “William Roche
@ 2025-02-01  9:57 ` “William Roche
  2025-02-04 17:50   ` David Hildenbrand
  5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01  9:57 UTC (permalink / raw)
  To: david, kvm, qemu-devel, qemu-arm
  Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

From: William Roche <william.roche@oracle.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
 include/system/hostmem.h |  1 +
 system/physmem.c         |  4 ----
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 46d80f98b4..4589467c77 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,37 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        /*
+         * If memory settings can't be successfully applied on remap,
+         * don't take the risk to continue without them.
+         */
+        error_report_err(err);
+        exit(1);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +405,13 @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/system/hostmem.h b/include/system/hostmem.h
index 5c21ca55c0..170849e8a4 100644
--- a/include/system/hostmem.h
+++ b/include/system/hostmem.h
@@ -83,6 +83,7 @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
diff --git a/system/physmem.c b/system/physmem.c
index 561b2c38c0..a047fd680e 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2223,7 +2223,6 @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     uint64_t offset;
-    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2233,7 +2232,6 @@ void qemu_ram_remap(ram_addr_t addr)
             page_size = qemu_ram_pagesize(block);
             offset = QEMU_ALIGN_DOWN(offset, page_size);
 
-            vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
             } else if (xen_enabled()) {
@@ -2257,8 +2255,6 @@ void qemu_ram_remap(ram_addr_t addr)
                         exit(1);
                     }
                 }
-                memory_try_enable_merging(vaddr, page_size);
-                qemu_ram_setup_dump(vaddr, page_size);
                 ram_block_notify_remap(block->host, offset, page_size);
             }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-01  9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
@ 2025-02-04 17:50   ` David Hildenbrand
  2025-02-04 17:58     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-02-04 17:50 UTC (permalink / raw)
  To: “William Roche, kvm, qemu-devel, qemu-arm
  Cc: peterx, pbonzini, richard.henderson, philmd, peter.maydell,
	mtosatti, imammedo, eduardo, marcel.apfelbaum, wangyanan55,
	zhao1.liu, joao.m.martins

>       /*
> @@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
>       .instance_size = sizeof(HostMemoryBackend),
>       .instance_init = host_memory_backend_init,
>       .instance_post_init = host_memory_backend_post_init,
> +    .instance_finalize = host_memory_backend_finalize,
>       .interfaces = (InterfaceInfo[]) {
>           { TYPE_USER_CREATABLE },
>           { }
> diff --git a/include/system/hostmem.h b/include/system/hostmem.h
> index 5c21ca55c0..170849e8a4 100644
> --- a/include/system/hostmem.h
> +++ b/include/system/hostmem.h
> @@ -83,6 +83,7 @@ struct HostMemoryBackend {
>       HostMemPolicy policy;
>   
>       MemoryRegion mr;
> +    RAMBlockNotifier ram_notifier;
>   };

Thinking about Peters comment, it would be a nice improvement to have a 
single global memory-backend notifier that looks up the fitting memory 
backend, instead of having one per memory backend.

A per-ramblock notifier might also be possible, but that's a bit 
harder/ackward to configure: e.g., the resize callback is passed to 
memory_region_init_resizeable_ram() right now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-04 17:50   ` David Hildenbrand
@ 2025-02-04 17:58     ` Peter Xu
  2025-02-04 18:55       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: “William Roche, kvm, qemu-devel, qemu-arm, pbonzini,
	richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
	eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Tue, Feb 04, 2025 at 06:50:17PM +0100, David Hildenbrand wrote:
> >       /*
> > @@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
> >       .instance_size = sizeof(HostMemoryBackend),
> >       .instance_init = host_memory_backend_init,
> >       .instance_post_init = host_memory_backend_post_init,
> > +    .instance_finalize = host_memory_backend_finalize,
> >       .interfaces = (InterfaceInfo[]) {
> >           { TYPE_USER_CREATABLE },
> >           { }
> > diff --git a/include/system/hostmem.h b/include/system/hostmem.h
> > index 5c21ca55c0..170849e8a4 100644
> > --- a/include/system/hostmem.h
> > +++ b/include/system/hostmem.h
> > @@ -83,6 +83,7 @@ struct HostMemoryBackend {
> >       HostMemPolicy policy;
> >       MemoryRegion mr;
> > +    RAMBlockNotifier ram_notifier;
> >   };
> 
> Thinking about Peters comment, it would be a nice improvement to have a
> single global memory-backend notifier that looks up the fitting memory
> backend, instead of having one per memory backend.

Yes, this could also avoid O(N**2).

> 
> A per-ramblock notifier might also be possible, but that's a bit
> harder/ackward to configure: e.g., the resize callback is passed to
> memory_region_init_resizeable_ram() right now.

Yes, that can be some fuss on code to be touched up.  We could avoid
passing that in when create the ramblock, instead we could allow ramblocks
to opt-in on hooks after ramblock created.  Maybe we could move resize()
out too like that.

Either way looks good.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-04 17:58     ` Peter Xu
@ 2025-02-04 18:55       ` David Hildenbrand
  2025-02-04 20:16         ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-02-04 18:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: “William Roche, kvm, qemu-devel, qemu-arm, pbonzini,
	richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
	eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On 04.02.25 18:58, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 06:50:17PM +0100, David Hildenbrand wrote:
>>>        /*
>>> @@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
>>>        .instance_size = sizeof(HostMemoryBackend),
>>>        .instance_init = host_memory_backend_init,
>>>        .instance_post_init = host_memory_backend_post_init,
>>> +    .instance_finalize = host_memory_backend_finalize,
>>>        .interfaces = (InterfaceInfo[]) {
>>>            { TYPE_USER_CREATABLE },
>>>            { }
>>> diff --git a/include/system/hostmem.h b/include/system/hostmem.h
>>> index 5c21ca55c0..170849e8a4 100644
>>> --- a/include/system/hostmem.h
>>> +++ b/include/system/hostmem.h
>>> @@ -83,6 +83,7 @@ struct HostMemoryBackend {
>>>        HostMemPolicy policy;
>>>        MemoryRegion mr;
>>> +    RAMBlockNotifier ram_notifier;
>>>    };
>>
>> Thinking about Peters comment, it would be a nice improvement to have a
>> single global memory-backend notifier that looks up the fitting memory
>> backend, instead of having one per memory backend.
> 
> Yes, this could also avoid O(N**2).

Ah, and now I remember where these 3 patches originate from: virtio-mem 
handling.

For virtio-mem I want to register also a remap handler, for example, to 
perform the custom preallocation handling.

So there will be at least two instances getting notified (memory 
backend, virtio-mem), and the per-ramblock one would have only allowed 
to trigger one (at least with a simple callback as we have today for 
->resize).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-04 18:55       ` David Hildenbrand
@ 2025-02-04 20:16         ` Peter Xu
  2025-02-05 16:27           ` William Roche
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 20:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: “William Roche, kvm, qemu-devel, qemu-arm, pbonzini,
	richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
	eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote:
> Ah, and now I remember where these 3 patches originate from: virtio-mem
> handling.
> 
> For virtio-mem I want to register also a remap handler, for example, to
> perform the custom preallocation handling.
> 
> So there will be at least two instances getting notified (memory backend,
> virtio-mem), and the per-ramblock one would have only allowed to trigger one
> (at least with a simple callback as we have today for ->resize).

I see, we can put something into commit log with such on decisions, then
we'll remember.

Said that, this still sounds like a per-ramblock thing, so instead of one
hook function we can also have per-ramblock notifier lists.

But I agree the perf issue isn't some immediate concern, so I'll leave that
to you and William.  If so I think we should discuss that in the commit log
too, so we decide to not care about perf until necessary (or we just make
it per-ramblock..).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-04 20:16         ` Peter Xu
@ 2025-02-05 16:27           ` William Roche
  2025-02-05 17:58             ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-05 16:27 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson, philmd,
	peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
	wangyanan55, zhao1.liu, joao.m.martins

On 2/4/25 21:16, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote:
>> Ah, and now I remember where these 3 patches originate from: virtio-mem
>> handling.
>>
>> For virtio-mem I want to register also a remap handler, for example, to
>> perform the custom preallocation handling.
>>
>> So there will be at least two instances getting notified (memory backend,
>> virtio-mem), and the per-ramblock one would have only allowed to trigger one
>> (at least with a simple callback as we have today for ->resize).
> 
> I see, we can put something into commit log with such on decisions, then
> we'll remember.
> 
> Said that, this still sounds like a per-ramblock thing, so instead of one
> hook function we can also have per-ramblock notifier lists.
> 
> But I agree the perf issue isn't some immediate concern, so I'll leave that
> to you and William.  If so I think we should discuss that in the commit log
> too, so we decide to not care about perf until necessary (or we just make
> it per-ramblock..).
> 
> Thanks,
> 

I agree that we could split this fix in 2 parts: The one fixing the 
hugetlbfs (ignoring the preallocation setting for the moment), and the 
notification mechanism as a second set of patches.

The first part would be the 3 first patches (including a corrected 
version of patch 2)  and the second part could be an adaptation of the 
next 3 patches, with their notification implementation dealing with 
merging, dump *and* preallocation setup.

But I'd be happy to help with the implementation of this 2nd aspect too:

In order to apply settings like preallocation to a RAMBLock we need to 
find its associated HostMemoryBackend (where we have the 'prealloc' flag).
To do so, we record a RAMBlockNotifier in the HostMemoryBackend struct, 
so that the notification triggered by the remap action:
    ram_block_notify_remap(block->host, offset, page_size);
will go through the list of notifiers ram_list.ramblock_notifiers to run 
the not NULL ram_block_remapped entries on all of them.

For each of them, we know the associated HostMemoryBackend (as it 
contains the RAMBlockNotifier), and we verify which one corresponds to 
the host address given, so that we can apply the appropriate settings.

IIUC, my proposal (with David's code) currently has a 
per-HostMemoryBackend notification.

Now if I want to implement a per-RAMBlock notification, would you 
suggest to consider that the 'mr' attibute of a RAMBlock always points 
to a HostMemoryBackend.mr, so that we could get the HostMemoryBackend 
associated to the block from a
     container_of(block->mr, HostMemoryBackend, mr) ?

If this is valid, than we could apply the appropriate settings from 
there, but can't we have RAMBlocks not pointing to a HostMemoryBackend.mr ?

I'm probably confused about what you are referring to.
So how would you suggest that I make the notification per-ramblock ?
Thanks in advance for your feedback.

I'll send a corrected version of the first 3 patches, unless you want to 
go with the current version of the patches 4/6, 5/6 and 6/6, so that we 
can deal with preallocation.

Please let me know.

Thanks,
William.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
  2025-02-05 16:27           ` William Roche
@ 2025-02-05 17:58             ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2025-02-05 17:58 UTC (permalink / raw)
  To: William Roche
  Cc: David Hildenbrand, kvm, qemu-devel, qemu-arm, pbonzini,
	richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
	eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins

On Wed, Feb 05, 2025 at 05:27:50PM +0100, William Roche wrote:
> On 2/4/25 21:16, Peter Xu wrote:
> > On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote:
> > > Ah, and now I remember where these 3 patches originate from: virtio-mem
> > > handling.
> > > 
> > > For virtio-mem I want to register also a remap handler, for example, to
> > > perform the custom preallocation handling.
> > > 
> > > So there will be at least two instances getting notified (memory backend,
> > > virtio-mem), and the per-ramblock one would have only allowed to trigger one
> > > (at least with a simple callback as we have today for ->resize).
> > 
> > I see, we can put something into commit log with such on decisions, then
> > we'll remember.
> > 
> > Said that, this still sounds like a per-ramblock thing, so instead of one
> > hook function we can also have per-ramblock notifier lists.
> > 
> > But I agree the perf issue isn't some immediate concern, so I'll leave that
> > to you and William.  If so I think we should discuss that in the commit log
> > too, so we decide to not care about perf until necessary (or we just make
> > it per-ramblock..).
> > 
> > Thanks,
> > 
> 
> 
> I agree that we could split this fix in 2 parts: The one fixing the
> hugetlbfs (ignoring the preallocation setting for the moment), and the
> notification mechanism as a second set of patches.
> 
> The first part would be the 3 first patches (including a corrected version
> of patch 2)  and the second part could be an adaptation of the next 3
> patches, with their notification implementation dealing with merging, dump
> *and* preallocation setup.
> 
> 
> But I'd be happy to help with the implementation of this 2nd aspect too:
> 
> In order to apply settings like preallocation to a RAMBLock we need to find
> its associated HostMemoryBackend (where we have the 'prealloc' flag).
> To do so, we record a RAMBlockNotifier in the HostMemoryBackend struct, so
> that the notification triggered by the remap action:
>    ram_block_notify_remap(block->host, offset, page_size);
> will go through the list of notifiers ram_list.ramblock_notifiers to run the
> not NULL ram_block_remapped entries on all of them.
> 
> For each of them, we know the associated HostMemoryBackend (as it contains
> the RAMBlockNotifier), and we verify which one corresponds to the host
> address given, so that we can apply the appropriate settings.
> 
> IIUC, my proposal (with David's code) currently has a per-HostMemoryBackend
> notification.
> 
> Now if I want to implement a per-RAMBlock notification, would you suggest to
> consider that the 'mr' attibute of a RAMBlock always points to a
> HostMemoryBackend.mr, so that we could get the HostMemoryBackend associated
> to the block from a
>     container_of(block->mr, HostMemoryBackend, mr) ?
> 
> If this is valid, than we could apply the appropriate settings from there,
> but can't we have RAMBlocks not pointing to a HostMemoryBackend.mr ?

Yes, QEMU definitely has ramblocks that are not backed by memory backends.
However each memory backend must have its ramblock.

IIUC what we need to do is let host_memory_backend_memory_complete()
register a per-ramblock notifier on top of its ramblock, which can be
referenced by backend->mr.ramblock.

> 
> 
> I'm probably confused about what you are referring to.
> So how would you suggest that I make the notification per-ramblock ?
> Thanks in advance for your feedback.
> 
> 
> I'll send a corrected version of the first 3 patches, unless you want to go
> with the current version of the patches 4/6, 5/6 and 6/6, so that we can
> deal with preallocation.

I don't feel strongly, but I can explain how the per-ramblock can be done.

One thing interesting I found is we actually have such notifier list
already in ramblocks.. see:

struct RAMBlock {
    ...
    QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
    ...
}

I guess that's some leftover from the global ramblock notifier.. e.g. I
tried remove that line and qemu compiles all fine.

Then instead of removing it, we could make that the per-ramblock list.

One way to do this is:

  - Patch 1: refactor current code, let RAMBlock.resized() to be a notifier
    instead of a fn() pointer passed over from
    memory_region_init_resizeable_ram().  It means we can remove
    RAMBlock.resized() but make fw_cfg_resized() becomes a notifier, taking
    RAM_BLOCK_RESIZED event instead.

  - Patch 2: introduce another RAM_BLOCK_REMAPPED event, then host backends
    (probably, host_memory_backend_memory_complete() after alloc() done so
    that the ramblock will always be available..) can register a notifier
    only looking for REMAPPED.

Then in the future virtio-mem can register similarly to specific ramblock
on REMAPPED only.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-02-11 21:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-01  9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
2025-02-01  9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-02-04 17:09   ` Peter Xu
2025-02-01  9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
2025-02-04 17:09   ` Peter Xu
2025-02-05 16:27     ` William Roche
2025-02-01  9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
2025-02-04 17:01   ` Peter Xu
2025-02-05 16:27     ` William Roche
2025-02-05 17:07       ` Peter Xu
2025-02-07 18:02         ` William Roche
2025-02-10 16:48           ` Peter Xu
2025-02-11 21:22             ` William Roche
2025-02-11 21:45               ` Peter Xu
2025-02-01  9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
2025-02-04 17:17   ` Peter Xu
2025-02-04 17:42     ` David Hildenbrand
2025-02-01  9:57 ` [PATCH v7 5/6] hostmem: Factor out applying settings “William Roche
2025-02-01  9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
2025-02-04 17:50   ` David Hildenbrand
2025-02-04 17:58     ` Peter Xu
2025-02-04 18:55       ` David Hildenbrand
2025-02-04 20:16         ` Peter Xu
2025-02-05 16:27           ` William Roche
2025-02-05 17:58             ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).