* [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
2025-02-01 9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
@ 2025-02-01 9:57 ` “William Roche
2025-02-04 17:09 ` Peter Xu
2025-02-01 9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
` (4 subsequent siblings)
5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01 9:57 UTC (permalink / raw)
To: david, kvm, qemu-devel, qemu-arm
Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
From: William Roche <william.roche@oracle.com>
The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size.
To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
hugetlb page; hugetlb pages cannot be partially mapped.
Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
accel/kvm/kvm-all.c | 2 +-
include/exec/cpu-common.h | 2 +-
system/physmem.c | 38 +++++++++++++++++++++++++++++---------
3 files changed, 31 insertions(+), 11 deletions(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c65b790433..f89568bfa3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1288,7 +1288,7 @@ static void kvm_unpoison_all(void *param)
QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
QLIST_REMOVE(page, list);
- qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+ qemu_ram_remap(page->ram_addr);
g_free(page);
}
}
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index b1d76d6985..3771b2130c 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
/* memory API */
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
/* This should not be used by devices. */
ram_addr_t qemu_ram_addr_from_host(void *ptr);
ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/system/physmem.c b/system/physmem.c
index c76503aea8..3dd2adde73 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,17 +2167,35 @@ void qemu_ram_free(RAMBlock *block)
}
#ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+/*
+ * qemu_ram_remap - remap a single RAM page
+ *
+ * @addr: address in ram_addr_t address space.
+ *
+ * This function will try remapping a single page of guest RAM identified by
+ * @addr, essentially discarding memory to recover from previously poisoned
+ * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
+ * does not have to point at the start of the page.
+ *
+ * This function is only to be used during system resets; it will kill the
+ * VM if remapping failed.
+ */
+void qemu_ram_remap(ram_addr_t addr)
{
RAMBlock *block;
- ram_addr_t offset;
+ uint64_t offset;
int flags;
void *area, *vaddr;
int prot;
+ size_t page_size;
RAMBLOCK_FOREACH(block) {
offset = addr - block->offset;
if (offset < block->max_length) {
+ /* Respect the pagesize of our RAMBlock */
+ page_size = qemu_ram_pagesize(block);
+ offset = QEMU_ALIGN_DOWN(offset, page_size);
+
vaddr = ramblock_ptr(block, offset);
if (block->flags & RAM_PREALLOC) {
;
@@ -2191,21 +2209,23 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
prot = PROT_READ;
prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
if (block->fd >= 0) {
- area = mmap(vaddr, length, prot, flags, block->fd,
+ area = mmap(vaddr, page_size, prot, flags, block->fd,
offset + block->fd_offset);
} else {
flags |= MAP_ANONYMOUS;
- area = mmap(vaddr, length, prot, flags, -1, 0);
+ area = mmap(vaddr, page_size, prot, flags, -1, 0);
}
if (area != vaddr) {
- error_report("Could not remap addr: "
- RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
- length, addr);
+ error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
+ " +%zx", block->idstr, offset,
+ block->fd_offset, page_size);
exit(1);
}
- memory_try_enable_merging(vaddr, length);
- qemu_ram_setup_dump(vaddr, length);
+ memory_try_enable_merging(vaddr, page_size);
+ qemu_ram_setup_dump(vaddr, page_size);
}
+
+ break;
}
}
}
--
2.43.5
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
2025-02-01 9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
@ 2025-02-04 17:09 ` Peter Xu
0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:09 UTC (permalink / raw)
To: “William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Sat, Feb 01, 2025 at 09:57:21AM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
>
> The list of hwpoison pages used to remap the memory on reset
> is based on the backend real page size.
> To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
> hugetlb page; hugetlb pages cannot be partially mapped.
>
> Signed-off-by: William Roche <william.roche@oracle.com>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
2025-02-01 9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
2025-02-01 9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
@ 2025-02-01 9:57 ` “William Roche
2025-02-04 17:09 ` Peter Xu
2025-02-01 9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
` (3 subsequent siblings)
5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01 9:57 UTC (permalink / raw)
To: david, kvm, qemu-devel, qemu-arm
Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
From: William Roche <william.roche@oracle.com>
Repair poisoned memory location(s), calling ram_block_discard_range():
punching a hole in the backend file when necessary and regenerating
a usable memory.
If the kernel doesn't support the madvise calls used by this function
and we are dealing with anonymous memory, fall back to remapping the
location(s).
Signed-off-by: William Roche <william.roche@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/system/physmem.c b/system/physmem.c
index 3dd2adde73..e8ff930bc9 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,6 +2167,23 @@ void qemu_ram_free(RAMBlock *block)
}
#ifndef _WIN32
+/* Simply remap the given VM memory location from start to start+length */
+static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
+{
+ int flags, prot;
+ void *area;
+ void *host_startaddr = block->host + start;
+
+ assert(block->fd < 0);
+ flags = MAP_FIXED | MAP_ANONYMOUS;
+ flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+ flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+ prot = PROT_READ;
+ prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+ area = mmap(host_startaddr, length, prot, flags, -1, 0);
+ return area != host_startaddr ? -errno : 0;
+}
+
/*
* qemu_ram_remap - remap a single RAM page
*
@@ -2184,9 +2201,7 @@ void qemu_ram_remap(ram_addr_t addr)
{
RAMBlock *block;
uint64_t offset;
- int flags;
- void *area, *vaddr;
- int prot;
+ void *vaddr;
size_t page_size;
RAMBLOCK_FOREACH(block) {
@@ -2201,25 +2216,24 @@ void qemu_ram_remap(ram_addr_t addr)
;
} else if (xen_enabled()) {
abort();
- } else {
- flags = MAP_FIXED;
- flags |= block->flags & RAM_SHARED ?
- MAP_SHARED : MAP_PRIVATE;
- flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
- prot = PROT_READ;
- prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
- if (block->fd >= 0) {
- area = mmap(vaddr, page_size, prot, flags, block->fd,
- offset + block->fd_offset);
- } else {
- flags |= MAP_ANONYMOUS;
- area = mmap(vaddr, page_size, prot, flags, -1, 0);
- }
- if (area != vaddr) {
- error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
- " +%zx", block->idstr, offset,
- block->fd_offset, page_size);
- exit(1);
+ if (ram_block_discard_range(block, offset, page_size) != 0) {
+ /*
+ * Fall back to using mmap() only for anonymous mapping,
+ * as if a backing file is associated we may not be able
+ * to recover the memory in all cases.
+ * So don't take the risk of using only mmap and fail now.
+ */
+ if (block->fd >= 0) {
+ error_report("Could not remap RAM %s:%" PRIx64 "+%"
+ PRIx64 " +%zx", block->idstr, offset,
+ block->fd_offset, page_size);
+ exit(1);
+ }
+ if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
+ error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
+ block->idstr, offset, page_size);
+ exit(1);
+ }
}
memory_try_enable_merging(vaddr, page_size);
qemu_ram_setup_dump(vaddr, page_size);
--
2.43.5
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
2025-02-01 9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
@ 2025-02-04 17:09 ` Peter Xu
2025-02-05 16:27 ` William Roche
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:09 UTC (permalink / raw)
To: “William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Sat, Feb 01, 2025 at 09:57:22AM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
>
> Repair poisoned memory location(s), calling ram_block_discard_range():
> punching a hole in the backend file when necessary and regenerating
> a usable memory.
> If the kernel doesn't support the madvise calls used by this function
> and we are dealing with anonymous memory, fall back to remapping the
> location(s).
>
> Signed-off-by: William Roche <william.roche@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
> system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
> 1 file changed, 36 insertions(+), 22 deletions(-)
>
> diff --git a/system/physmem.c b/system/physmem.c
> index 3dd2adde73..e8ff930bc9 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2167,6 +2167,23 @@ void qemu_ram_free(RAMBlock *block)
> }
>
> #ifndef _WIN32
> +/* Simply remap the given VM memory location from start to start+length */
> +static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
> +{
> + int flags, prot;
> + void *area;
> + void *host_startaddr = block->host + start;
> +
> + assert(block->fd < 0);
> + flags = MAP_FIXED | MAP_ANONYMOUS;
> + flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
> + flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> + prot = PROT_READ;
> + prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> + area = mmap(host_startaddr, length, prot, flags, -1, 0);
> + return area != host_startaddr ? -errno : 0;
> +}
> +
> /*
> * qemu_ram_remap - remap a single RAM page
> *
> @@ -2184,9 +2201,7 @@ void qemu_ram_remap(ram_addr_t addr)
> {
> RAMBlock *block;
> uint64_t offset;
> - int flags;
> - void *area, *vaddr;
> - int prot;
> + void *vaddr;
> size_t page_size;
>
> RAMBLOCK_FOREACH(block) {
> @@ -2201,25 +2216,24 @@ void qemu_ram_remap(ram_addr_t addr)
> ;
> } else if (xen_enabled()) {
> abort();
> - } else {
Do we need to keep this line? Otherwise it looks to me the new code won't
be executed at all in !xen..
> - flags = MAP_FIXED;
> - flags |= block->flags & RAM_SHARED ?
> - MAP_SHARED : MAP_PRIVATE;
> - flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> - prot = PROT_READ;
> - prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> - if (block->fd >= 0) {
> - area = mmap(vaddr, page_size, prot, flags, block->fd,
> - offset + block->fd_offset);
> - } else {
> - flags |= MAP_ANONYMOUS;
> - area = mmap(vaddr, page_size, prot, flags, -1, 0);
> - }
> - if (area != vaddr) {
> - error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
> - " +%zx", block->idstr, offset,
> - block->fd_offset, page_size);
> - exit(1);
> + if (ram_block_discard_range(block, offset, page_size) != 0) {
> + /*
> + * Fall back to using mmap() only for anonymous mapping,
> + * as if a backing file is associated we may not be able
> + * to recover the memory in all cases.
> + * So don't take the risk of using only mmap and fail now.
> + */
> + if (block->fd >= 0) {
> + error_report("Could not remap RAM %s:%" PRIx64 "+%"
> + PRIx64 " +%zx", block->idstr, offset,
> + block->fd_offset, page_size);
> + exit(1);
> + }
> + if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
> + error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
> + block->idstr, offset, page_size);
> + exit(1);
> + }
> }
> memory_try_enable_merging(vaddr, page_size);
> qemu_ram_setup_dump(vaddr, page_size);
> --
> 2.43.5
>
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
2025-02-04 17:09 ` Peter Xu
@ 2025-02-05 16:27 ` William Roche
0 siblings, 0 replies; 25+ messages in thread
From: William Roche @ 2025-02-05 16:27 UTC (permalink / raw)
To: Peter Xu
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On 2/4/25 18:09, Peter Xu wrote:
> On Sat, Feb 01, 2025 at 09:57:22AM +0000, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Repair poisoned memory location(s), calling ram_block_discard_range():
>> punching a hole in the backend file when necessary and regenerating
>> a usable memory.
>> If the kernel doesn't support the madvise calls used by this function
>> and we are dealing with anonymous memory, fall back to remapping the
>> location(s).
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> ---
>> system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
>> 1 file changed, 36 insertions(+), 22 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 3dd2adde73..e8ff930bc9 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> ...
>> /*
>> * qemu_ram_remap - remap a single RAM page
>> *
>> @@ -2184,9 +2201,7 @@ void qemu_ram_remap(ram_addr_t addr)
>> {
>> RAMBlock *block;
>> uint64_t offset;
>> - int flags;
>> - void *area, *vaddr;
>> - int prot;
>> + void *vaddr;
>> size_t page_size;
>>
>> RAMBLOCK_FOREACH(block) {
>> @@ -2201,25 +2216,24 @@ void qemu_ram_remap(ram_addr_t addr)
>> ;
>> } else if (xen_enabled()) {
>> abort();
>> - } else {
>
> Do we need to keep this line? Otherwise it looks to me the new code won't
> be executed at all in !xen..
Very true, this is a mistake I made in a very recent version - sorry!
I'll fix it with my next version.
Thanks.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-01 9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
2025-02-01 9:57 ` [PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap() “William Roche
2025-02-01 9:57 ` [PATCH v7 2/6] system/physmem: poisoned memory discard on reboot “William Roche
@ 2025-02-01 9:57 ` “William Roche
2025-02-04 17:01 ` Peter Xu
2025-02-01 9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
` (2 subsequent siblings)
5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01 9:57 UTC (permalink / raw)
To: david, kvm, qemu-devel, qemu-arm
Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
From: William Roche <william.roche@oracle.com>
In case of a large page impacted by a memory error, provide an
information about the impacted large page before the memory
error injection message.
This message would also appear on ras enabled ARM platforms, with
the introduction of an x86 similar error injection message.
In the case of a large page impacted, we now report:
Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
The +<fd_offset> information is only provided with a file backend.
Signed-off-by: William Roche <william.roche@oracle.com>
---
accel/kvm/kvm-all.c | 18 ++++++++++++++++++
include/exec/cpu-common.h | 10 ++++++++++
system/physmem.c | 22 ++++++++++++++++++++++
target/arm/kvm.c | 3 +++
4 files changed, 53 insertions(+)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f89568bfa3..9a0d970ce1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1296,6 +1296,24 @@ static void kvm_unpoison_all(void *param)
void kvm_hwpoison_page_add(ram_addr_t ram_addr)
{
HWPoisonPage *page;
+ struct RAMBlockInfo rb_info;
+
+ if (qemu_ram_block_info_from_addr(ram_addr, &rb_info)) {
+ size_t ps = rb_info.page_size;
+
+ if (ps > TARGET_PAGE_SIZE) {
+ uint64_t offset = QEMU_ALIGN_DOWN(ram_addr - rb_info.offset, ps);
+
+ if (rb_info.fd >= 0) {
+ error_report("Memory Error on large page from %s:%" PRIx64
+ "+%" PRIx64 " +%zx", rb_info.idstr, offset,
+ rb_info.fd_offset, ps);
+ } else {
+ error_report("Memory Error on large page from %s:%" PRIx64
+ " +%zx", rb_info.idstr, offset, ps);
+ }
+ }
+ }
QLIST_FOREACH(page, &hwpoison_page_list, list) {
if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 3771b2130c..190bd4f34a 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -110,6 +110,16 @@ int qemu_ram_get_fd(RAMBlock *rb);
size_t qemu_ram_pagesize(RAMBlock *block);
size_t qemu_ram_pagesize_largest(void);
+struct RAMBlockInfo {
+ char idstr[256];
+ ram_addr_t offset;
+ int fd;
+ uint64_t fd_offset;
+ size_t page_size;
+};
+bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
+ struct RAMBlockInfo *block);
+
/**
* cpu_address_space_init:
* @cpu: CPU to add this address space to
diff --git a/system/physmem.c b/system/physmem.c
index e8ff930bc9..686f569270 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1678,6 +1678,28 @@ size_t qemu_ram_pagesize_largest(void)
return largest;
}
+/* Copy RAMBlock information associated to the given ram_addr location */
+bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
+ struct RAMBlockInfo *b_info)
+{
+ RAMBlock *rb;
+
+ assert(b_info);
+
+ RCU_READ_LOCK_GUARD();
+ rb = qemu_get_ram_block(ram_addr);
+ if (!rb) {
+ return false;
+ }
+
+ pstrcat(b_info->idstr, sizeof(b_info->idstr), rb->idstr);
+ b_info->offset = rb->offset;
+ b_info->fd = rb->fd;
+ b_info->fd_offset = rb->fd_offset;
+ b_info->page_size = rb->page_size;
+ return true;
+}
+
static int memory_try_enable_merging(void *addr, size_t len)
{
if (!machine_mem_merge(current_machine)) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index da30bdbb23..d9dedc6d74 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2389,6 +2389,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
kvm_cpu_synchronize_state(c);
if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
kvm_inject_arm_sea(c);
+ error_report("Guest Memory Error at QEMU addr %p and "
+ "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
+ addr, paddr, "BUS_MCEERR_AR");
} else {
error_report("failed to record the error");
abort();
--
2.43.5
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-01 9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
@ 2025-02-04 17:01 ` Peter Xu
2025-02-05 16:27 ` William Roche
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:01 UTC (permalink / raw)
To: “William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Sat, Feb 01, 2025 at 09:57:23AM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
>
> In case of a large page impacted by a memory error, provide an
> information about the impacted large page before the memory
> error injection message.
>
> This message would also appear on ras enabled ARM platforms, with
> the introduction of an x86 similar error injection message.
>
> In the case of a large page impacted, we now report:
> Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
>
> The +<fd_offset> information is only provided with a file backend.
>
> Signed-off-by: William Roche <william.roche@oracle.com>
This is still pretty kvm / arch relevant patch that needs some reviews.
I wonder do we really need this - we could fetch ramblock mapping
(e.g. hwaddr -> HVA) via HMP "info ramblock", and also dmesg shows process
ID + VA. IIUC we have all below info already as long as we do some math
based on above. Would that work too?
> ---
> accel/kvm/kvm-all.c | 18 ++++++++++++++++++
> include/exec/cpu-common.h | 10 ++++++++++
> system/physmem.c | 22 ++++++++++++++++++++++
> target/arm/kvm.c | 3 +++
> 4 files changed, 53 insertions(+)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index f89568bfa3..9a0d970ce1 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1296,6 +1296,24 @@ static void kvm_unpoison_all(void *param)
> void kvm_hwpoison_page_add(ram_addr_t ram_addr)
> {
> HWPoisonPage *page;
> + struct RAMBlockInfo rb_info;
> +
> + if (qemu_ram_block_info_from_addr(ram_addr, &rb_info)) {
> + size_t ps = rb_info.page_size;
> +
> + if (ps > TARGET_PAGE_SIZE) {
> + uint64_t offset = QEMU_ALIGN_DOWN(ram_addr - rb_info.offset, ps);
> +
> + if (rb_info.fd >= 0) {
> + error_report("Memory Error on large page from %s:%" PRIx64
> + "+%" PRIx64 " +%zx", rb_info.idstr, offset,
> + rb_info.fd_offset, ps);
> + } else {
> + error_report("Memory Error on large page from %s:%" PRIx64
> + " +%zx", rb_info.idstr, offset, ps);
> + }
> + }
> + }
>
> QLIST_FOREACH(page, &hwpoison_page_list, list) {
> if (page->ram_addr == ram_addr) {
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 3771b2130c..190bd4f34a 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -110,6 +110,16 @@ int qemu_ram_get_fd(RAMBlock *rb);
> size_t qemu_ram_pagesize(RAMBlock *block);
> size_t qemu_ram_pagesize_largest(void);
>
> +struct RAMBlockInfo {
> + char idstr[256];
> + ram_addr_t offset;
> + int fd;
> + uint64_t fd_offset;
> + size_t page_size;
> +};
> +bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
> + struct RAMBlockInfo *block);
> +
> /**
> * cpu_address_space_init:
> * @cpu: CPU to add this address space to
> diff --git a/system/physmem.c b/system/physmem.c
> index e8ff930bc9..686f569270 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1678,6 +1678,28 @@ size_t qemu_ram_pagesize_largest(void)
> return largest;
> }
>
> +/* Copy RAMBlock information associated to the given ram_addr location */
> +bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
> + struct RAMBlockInfo *b_info)
> +{
> + RAMBlock *rb;
> +
> + assert(b_info);
> +
> + RCU_READ_LOCK_GUARD();
> + rb = qemu_get_ram_block(ram_addr);
> + if (!rb) {
> + return false;
> + }
> +
> + pstrcat(b_info->idstr, sizeof(b_info->idstr), rb->idstr);
> + b_info->offset = rb->offset;
> + b_info->fd = rb->fd;
> + b_info->fd_offset = rb->fd_offset;
> + b_info->page_size = rb->page_size;
> + return true;
> +}
> +
> static int memory_try_enable_merging(void *addr, size_t len)
> {
> if (!machine_mem_merge(current_machine)) {
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index da30bdbb23..d9dedc6d74 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -2389,6 +2389,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> kvm_cpu_synchronize_state(c);
> if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
> kvm_inject_arm_sea(c);
> + error_report("Guest Memory Error at QEMU addr %p and "
> + "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
> + addr, paddr, "BUS_MCEERR_AR");
> } else {
> error_report("failed to record the error");
> abort();
> --
> 2.43.5
>
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-04 17:01 ` Peter Xu
@ 2025-02-05 16:27 ` William Roche
2025-02-05 17:07 ` Peter Xu
0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-05 16:27 UTC (permalink / raw)
To: Peter Xu
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On 2/4/25 18:01, Peter Xu wrote:
> On Sat, Feb 01, 2025 at 09:57:23AM +0000, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> In case of a large page impacted by a memory error, provide an
>> information about the impacted large page before the memory
>> error injection message.
>>
>> This message would also appear on ras enabled ARM platforms, with
>> the introduction of an x86 similar error injection message.
>>
>> In the case of a large page impacted, we now report:
>> Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
>>
>> The +<fd_offset> information is only provided with a file backend.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>
> This is still pretty kvm / arch relevant patch that needs some reviews.
>
> I wonder do we really need this - we could fetch ramblock mapping
> (e.g. hwaddr -> HVA) via HMP "info ramblock", and also dmesg shows process
> ID + VA. IIUC we have all below info already as long as we do some math
> based on above. Would that work too?
The HMP command "info ramblock" is implemented with the
ram_block_format() function which returns a message buffer built with a
string for each ramblock (protected by the RCU_READ_LOCK_GUARD). Our new
function copies a struct with the necessary information.
Relaying on the buffer format to retrieve the information doesn't seem
reasonable, and more importantly, this buffer doesn't provide all the
needed data, like fd and fd_offset.
I would say that ram_block_format() and qemu_ram_block_info_from_addr()
serve 2 different goals.
(a reimplementation of ram_block_format() with an adapted version of
qemu_ram_block_info_from_addr() taking the extra information needed
could be doable for example, but may not be worth doing for now)
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-05 16:27 ` William Roche
@ 2025-02-05 17:07 ` Peter Xu
2025-02-07 18:02 ` William Roche
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-05 17:07 UTC (permalink / raw)
To: William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote:
> On 2/4/25 18:01, Peter Xu wrote:
> > On Sat, Feb 01, 2025 at 09:57:23AM +0000, “William Roche wrote:
> > > From: William Roche <william.roche@oracle.com>
> > >
> > > In case of a large page impacted by a memory error, provide an
> > > information about the impacted large page before the memory
> > > error injection message.
> > >
> > > This message would also appear on ras enabled ARM platforms, with
> > > the introduction of an x86 similar error injection message.
> > >
> > > In the case of a large page impacted, we now report:
> > > Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
> > >
> > > The +<fd_offset> information is only provided with a file backend.
> > >
> > > Signed-off-by: William Roche <william.roche@oracle.com>
> >
> > This is still pretty kvm / arch relevant patch that needs some reviews.
> >
> > I wonder do we really need this - we could fetch ramblock mapping
> > (e.g. hwaddr -> HVA) via HMP "info ramblock", and also dmesg shows process
> > ID + VA. IIUC we have all below info already as long as we do some math
> > based on above. Would that work too?
>
> The HMP command "info ramblock" is implemented with the ram_block_format()
> function which returns a message buffer built with a string for each
> ramblock (protected by the RCU_READ_LOCK_GUARD). Our new function copies a
> struct with the necessary information.
>
> Relaying on the buffer format to retrieve the information doesn't seem
> reasonable, and more importantly, this buffer doesn't provide all the needed
> data, like fd and fd_offset.
>
> I would say that ram_block_format() and qemu_ram_block_info_from_addr()
> serve 2 different goals.
>
> (a reimplementation of ram_block_format() with an adapted version of
> qemu_ram_block_info_from_addr() taking the extra information needed could be
> doable for example, but may not be worth doing for now)
IIUC admin should be aware of fd_offset because the admin should be fully
aware of the start offset of FDs to specify in qemu cmdlines, or in
Libvirt. But yes, we can always add fd_offset into ram_block_format() if
it's helpful.
Besides, the existing issues on this patch:
- From outcome of this patch, it introduces one ramblock API (which is ok
to me, so far), to do some error_report()s. It looks pretty much for
debugging rather than something serious (e.g. report via QMP queries,
QMP events etc.). From debug POV, I still don't see why this is
needed.. per discussed above.
- From merge POV, this patch isn't a pure memory change, so I'll need to
get ack from other maintainers, at least that should be how it works..
I feel like when hwpoison becomes a serious topic, we need some more
serious reporting facility than error reports. So that we could have this
as separate topic to be revisited. It might speed up your prior patches
from not being blocked on this.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-05 17:07 ` Peter Xu
@ 2025-02-07 18:02 ` William Roche
2025-02-10 16:48 ` Peter Xu
0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-07 18:02 UTC (permalink / raw)
To: Peter Xu, david
Cc: kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
On 2/5/25 18:07, Peter Xu wrote:
> On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote:
>> [...]
>> The HMP command "info ramblock" is implemented with the ram_block_format()
>> function which returns a message buffer built with a string for each
>> ramblock (protected by the RCU_READ_LOCK_GUARD). Our new function copies a
>> struct with the necessary information.
>>
>> Relaying on the buffer format to retrieve the information doesn't seem
>> reasonable, and more importantly, this buffer doesn't provide all the needed
>> data, like fd and fd_offset.
>>
>> I would say that ram_block_format() and qemu_ram_block_info_from_addr()
>> serve 2 different goals.
>>
>> (a reimplementation of ram_block_format() with an adapted version of
>> qemu_ram_block_info_from_addr() taking the extra information needed could be
>> doable for example, but may not be worth doing for now)
>
> IIUC admin should be aware of fd_offset because the admin should be fully
> aware of the start offset of FDs to specify in qemu cmdlines, or in
> Libvirt. But yes, we can always add fd_offset into ram_block_format() if
> it's helpful.
>
> Besides, the existing issues on this patch:
>
> - From outcome of this patch, it introduces one ramblock API (which is ok
> to me, so far), to do some error_report()s. It looks pretty much for
> debugging rather than something serious (e.g. report via QMP queries,
> QMP events etc.). From debug POV, I still don't see why this is
> needed.. per discussed above.
The reason why I want to inform the user of a large memory failure more
specifically than a standard sized page loss is because of the
significant behavior difference: Our current implementation can
transparently handle many situations without necessarily leading the VM
to a crash. But when it comes to large pages, there is no mechanism to
inform the VM of a large memory loss, and usually this situation leads
the VM to crash, and can also generate some weird situations like qemu
itself crashing or a loop of errors, for example.
So having a message informing of such a memory loss can help to
understand a more radical VM or qemu behavior -- it increases the
diagnosability of our code.
To verify that a SIGBUS appeared because of a large page loss, we
currently need to verify the targeted memory block backend page_size.
We should usually get this information from the SIGBUS siginfo data
(with a si_addr_lsb field giving an indication of the page size) but a
KVM weakness with a hardcoded si_addr_lsb=PAGE_SHIFT value in the SIGBUS
siginfo returned from the kernel prevents that: See
kvm_send_hwpoison_signal() function.
So I first wrote a small API addition called
qemu_ram_pagesize_from_addr() to retrieve only this page_size value from
the impacted address; and later on, this function turned into the richer
qemu_ram_block_info_from_addr() function to have the generated messages
match the existing memory messages as rightly requested by David.
So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(),
and the second reason is to have richer error messages.
> - From merge POV, this patch isn't a pure memory change, so I'll need to
> get ack from other maintainers, at least that should be how it works..
I agree :)
>
> I feel like when hwpoison becomes a serious topic, we need some more
> serious reporting facility than error reports. So that we could have this
> as separate topic to be revisited. It might speed up your prior patches
> from not being blocked on this.
I explained why I think that error messages are important, but I don't
want to get blocked on fixing the hugepage memory recovery because of that.
If you think that not displaying a specific message for large page loss
can help to get the recovery fixed, than I can change my proposal to do so.
Early next week, I'll send a simplified version of my first 3 patches
without this specific messages and without the preallocation handling in
all remap cases, so you can evaluate this possibility.
Thank again for your feedback
William.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-07 18:02 ` William Roche
@ 2025-02-10 16:48 ` Peter Xu
2025-02-11 21:22 ` William Roche
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-10 16:48 UTC (permalink / raw)
To: William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:
> On 2/5/25 18:07, Peter Xu wrote:
> > On Wed, Feb 05, 2025 at 05:27:13PM +0100, William Roche wrote:
> > > [...]
> > > The HMP command "info ramblock" is implemented with the ram_block_format()
> > > function which returns a message buffer built with a string for each
> > > ramblock (protected by the RCU_READ_LOCK_GUARD). Our new function copies a
> > > struct with the necessary information.
> > >
> > > Relaying on the buffer format to retrieve the information doesn't seem
> > > reasonable, and more importantly, this buffer doesn't provide all the needed
> > > data, like fd and fd_offset.
> > >
> > > I would say that ram_block_format() and qemu_ram_block_info_from_addr()
> > > serve 2 different goals.
> > >
> > > (a reimplementation of ram_block_format() with an adapted version of
> > > qemu_ram_block_info_from_addr() taking the extra information needed could be
> > > doable for example, but may not be worth doing for now)
> >
> > IIUC admin should be aware of fd_offset because the admin should be fully
> > aware of the start offset of FDs to specify in qemu cmdlines, or in
> > Libvirt. But yes, we can always add fd_offset into ram_block_format() if
> > it's helpful.
> >
> > Besides, the existing issues on this patch:
> >
> > - From outcome of this patch, it introduces one ramblock API (which is ok
> > to me, so far), to do some error_report()s. It looks pretty much for
> > debugging rather than something serious (e.g. report via QMP queries,
> > QMP events etc.). From debug POV, I still don't see why this is
> > needed.. per discussed above.
>
> The reason why I want to inform the user of a large memory failure more
> specifically than a standard sized page loss is because of the significant
> behavior difference: Our current implementation can transparently handle
> many situations without necessarily leading the VM to a crash. But when it
> comes to large pages, there is no mechanism to inform the VM of a large
> memory loss, and usually this situation leads the VM to crash, and can also
> generate some weird situations like qemu itself crashing or a loop of
> errors, for example.
>
> So having a message informing of such a memory loss can help to understand a
> more radical VM or qemu behavior -- it increases the diagnosability of our
> code.
>
> To verify that a SIGBUS appeared because of a large page loss, we currently
> need to verify the targeted memory block backend page_size.
> We should usually get this information from the SIGBUS siginfo data (with a
> si_addr_lsb field giving an indication of the page size) but a KVM weakness
> with a hardcoded si_addr_lsb=PAGE_SHIFT value in the SIGBUS siginfo returned
> from the kernel prevents that: See kvm_send_hwpoison_signal() function.
>
> So I first wrote a small API addition called qemu_ram_pagesize_from_addr()
> to retrieve only this page_size value from the impacted address; and later
> on, this function turned into the richer qemu_ram_block_info_from_addr()
> function to have the generated messages match the existing memory messages
> as rightly requested by David.
>
> So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
> the second reason is to have richer error messages.
This seems true, and I also remember something when I looked at this
previously but maybe nobody tried to fix it. ARM seems to be correct on
that field, otoh.
Is it possible we fix KVM on x86?
kvm_handle_error_pfn() has the fault context, so IIUC it should be able to
figure that out too like what ARM does (with get_vma_page_shift()).
>
>
>
> > - From merge POV, this patch isn't a pure memory change, so I'll need to
> > get ack from other maintainers, at least that should be how it works..
>
> I agree :)
>
> >
> > I feel like when hwpoison becomes a serious topic, we need some more
> > serious reporting facility than error reports. So that we could have this
> > as separate topic to be revisited. It might speed up your prior patches
> > from not being blocked on this.
>
> I explained why I think that error messages are important, but I don't want
> to get blocked on fixing the hugepage memory recovery because of that.
What is the major benefit of reporting in QEMU's stderr in this case?
For example, how should we consume the error reports that this patch
introduces? Is it still for debugging purpose?
I agree it's always better to dump something in QEMU when such happened,
but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
to deduce the page size of affected vaddr, especially if it's for debugging
purpose. However I could possibly have missed the goal here..
>
> If you think that not displaying a specific message for large page loss can
> help to get the recovery fixed, than I can change my proposal to do so.
>
> Early next week, I'll send a simplified version of my first 3 patches
> without this specific messages and without the preallocation handling in all
> remap cases, so you can evaluate this possibility.
Yes IMHO it'll always be helpful to separate it if possible.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-10 16:48 ` Peter Xu
@ 2025-02-11 21:22 ` William Roche
2025-02-11 21:45 ` Peter Xu
0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-11 21:22 UTC (permalink / raw)
To: Peter Xu
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On 2/10/25 17:48, Peter Xu wrote:
> On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:
>> [...]
>> So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
>> the second reason is to have richer error messages.
>
> This seems true, and I also remember something when I looked at this
> previously but maybe nobody tried to fix it. ARM seems to be correct on
> that field, otoh.
>
> Is it possible we fix KVM on x86?
Yes, very probably, and it would be a kernel fix.
This kernel modification would be needed to run on the hypervisor first
to influence a new code in qemu able to use the SIGBUS siginfo
information and identify the size of the page impacted (instead of using
an internal addition to kvm API).
But this mechanism could help to generate a large page memory error
specific message on SIGBUS receiving.
>>>
>>> I feel like when hwpoison becomes a serious topic, we need some more
>>> serious reporting facility than error reports. So that we could have this
>>> as separate topic to be revisited. It might speed up your prior patches
>>> from not being blocked on this.
>>
>> I explained why I think that error messages are important, but I don't want
>> to get blocked on fixing the hugepage memory recovery because of that.
>
> What is the major benefit of reporting in QEMU's stderr in this case?
Such messages can be collected into VM specific log file, as any other
error_report() message, like the existing x86 error injection messages
reported by Qemu.
This messages should help the administrator to better understand the
behavior of the VM.
> For example, how should we consume the error reports that this patch
> introduces? Is it still for debugging purpose?
Its not only debugging, but it's a trace of a significant event that can
have major consequences on the VM.
>
> I agree it's always better to dump something in QEMU when such happened,
> but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
> monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
> to deduce the page size of affected vaddr, especially if it's for debugging
> purpose. However I could possibly have missed the goal here..
You're right that knowing the address, the administrator can deduce what
memory area was impacted and the associated page size. But the goal of
these large page specific messages was to give details on the event type
and immediately qualify the consequences.
Using large pages can also have drawbacks, and a large page specific
message on memory error makes that more obvious ! Not only a debug msg,
but an indication that the VM lost an unusually large amount of its memory.
>>
>> If you think that not displaying a specific message for large page loss can
>> help to get the recovery fixed, than I can change my proposal to do so.
>>
>> Early next week, I'll send a simplified version of my first 3 patches
>> without this specific messages and without the preallocation handling in all
>> remap cases, so you can evaluate this possibility.
>
> Yes IMHO it'll always be helpful to separate it if possible.
I'm sending now a v8 version, without the specific messages and the
remap notification. It should fix the main recovery bug we currently
have. More messages and a notification dealing with pre-allocation can
be added in a second step.
Please let me know if this v8 version can be integrated without the
prealloc and specific messages ?
Thanks,
William.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
2025-02-11 21:22 ` William Roche
@ 2025-02-11 21:45 ` Peter Xu
0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2025-02-11 21:45 UTC (permalink / raw)
To: William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Tue, Feb 11, 2025 at 10:22:38PM +0100, William Roche wrote:
> On 2/10/25 17:48, Peter Xu wrote:
> > On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:
> > > [...]
> > > So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
> > > the second reason is to have richer error messages.
> >
> > This seems true, and I also remember something when I looked at this
> > previously but maybe nobody tried to fix it. ARM seems to be correct on
> > that field, otoh.
> >
> > Is it possible we fix KVM on x86?
>
> Yes, very probably, and it would be a kernel fix.
> This kernel modification would be needed to run on the hypervisor first to
> influence a new code in qemu able to use the SIGBUS siginfo information and
> identify the size of the page impacted (instead of using an internal
> addition to kvm API).
> But this mechanism could help to generate a large page memory error specific
> message on SIGBUS receiving.
Yes, QEMU should probably better be able to work on both old/new kernels,
even if this will be fixed.
>
>
> > > >
> > > > I feel like when hwpoison becomes a serious topic, we need some more
> > > > serious reporting facility than error reports. So that we could have this
> > > > as separate topic to be revisited. It might speed up your prior patches
> > > > from not being blocked on this.
> > >
> > > I explained why I think that error messages are important, but I don't want
> > > to get blocked on fixing the hugepage memory recovery because of that.
> >
> > What is the major benefit of reporting in QEMU's stderr in this case?
>
> Such messages can be collected into VM specific log file, as any other
> error_report() message, like the existing x86 error injection messages
> reported by Qemu.
> This messages should help the administrator to better understand the
> behavior of the VM.
I'll still put "better understand the behavior of VM" into debugging
category. :)
But I agree such can be important information. That's also why I was
curious whether it should be something like a QMP event instead. That's a
much formal way of sending important messages.
>
>
> > For example, how should we consume the error reports that this patch
> > introduces? Is it still for debugging purpose?
>
> Its not only debugging, but it's a trace of a significant event that can
> have major consequences on the VM.
>
> >
> > I agree it's always better to dump something in QEMU when such happened,
> > but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
> > monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
> > to deduce the page size of affected vaddr, especially if it's for debugging
> > purpose. However I could possibly have missed the goal here..
>
> You're right that knowing the address, the administrator can deduce what
> memory area was impacted and the associated page size. But the goal of these
> large page specific messages was to give details on the event type and
> immediately qualify the consequences.
> Using large pages can also have drawbacks, and a large page specific message
> on memory error makes that more obvious ! Not only a debug msg, but an
> indication that the VM lost an unusually large amount of its memory.
>
> > >
> > > If you think that not displaying a specific message for large page loss can
> > > help to get the recovery fixed, than I can change my proposal to do so.
> > >
> > > Early next week, I'll send a simplified version of my first 3 patches
> > > without this specific messages and without the preallocation handling in all
> > > remap cases, so you can evaluate this possibility.
> >
> > Yes IMHO it'll always be helpful to separate it if possible.
>
> I'm sending now a v8 version, without the specific messages and the remap
> notification. It should fix the main recovery bug we currently have. More
> messages and a notification dealing with pre-allocation can be added in a
> second step.
>
> Please let me know if this v8 version can be integrated without the prealloc
> and specific messages ?
IMHO fixing hugetlb page is still a progress on its own, even without any
added error message, or proactive allocation during reset.
One issue is the v8 still contains patch 3 which is for ARM kvm.. You may
need to post it separately for ARM maintainers to review & collect. I'll
be able to queue patch 1-2.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
2025-02-01 9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
` (2 preceding siblings ...)
2025-02-01 9:57 ` [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page “William Roche
@ 2025-02-01 9:57 ` “William Roche
2025-02-04 17:17 ` Peter Xu
2025-02-01 9:57 ` [PATCH v7 5/6] hostmem: Factor out applying settings “William Roche
2025-02-01 9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01 9:57 UTC (permalink / raw)
To: david, kvm, qemu-devel, qemu-arm
Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
From: David Hildenbrand <david@redhat.com>
Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
hw/core/numa.c | 11 +++++++++++
include/exec/ramlist.h | 3 +++
system/physmem.c | 1 +
3 files changed, 15 insertions(+)
diff --git a/hw/core/numa.c b/hw/core/numa.c
index 218576f745..003bcd8a66 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
}
}
}
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+ RAMBlockNotifier *notifier;
+
+ QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+ if (notifier->ram_block_remapped) {
+ notifier->ram_block_remapped(notifier, host, offset, size);
+ }
+ }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
size_t max_size);
void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
size_t new_size);
+ void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+ size_t size);
QLIST_ENTRY(RAMBlockNotifier) next;
};
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
void ram_block_notify_add(void *host, size_t size, size_t max_size);
void ram_block_notify_remove(void *host, size_t size, size_t max_size);
void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
GString *ram_block_format(void);
diff --git a/system/physmem.c b/system/physmem.c
index 686f569270..561b2c38c0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2259,6 +2259,7 @@ void qemu_ram_remap(ram_addr_t addr)
}
memory_try_enable_merging(vaddr, page_size);
qemu_ram_setup_dump(vaddr, page_size);
+ ram_block_notify_remap(block->host, offset, page_size);
}
break;
--
2.43.5
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
2025-02-01 9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
@ 2025-02-04 17:17 ` Peter Xu
2025-02-04 17:42 ` David Hildenbrand
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:17 UTC (permalink / raw)
To: “William Roche
Cc: david, kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson,
philmd, peter.maydell, mtosatti, imammedo, eduardo,
marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Sat, Feb 01, 2025 at 09:57:24AM +0000, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
>
> Notify registered listeners about the remap at the end of
> qemu_ram_remap() so e.g., a memory backend can re-apply its
> settings correctly.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
IIUC logically speaking we don't need a global remap notifier - here a
per-ramblock notifier looks more reasonable, like RAMBlock.resized().
It'll change the notify path from O(N**2) to O(N). After all, backend1's
notifier won't care other ramblock's remap() events but only itself's.
It's not a huge deal as I expect we don't have a huge amount of ramblocks,
but looks like this series will miss the recent pull anyway.. so let me
comment as so on this one for consideration when respin.
We could also merge partial of the series to fix hugetlb poisoning first,
as this one looks like can be separately done too.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
2025-02-04 17:17 ` Peter Xu
@ 2025-02-04 17:42 ` David Hildenbrand
0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-02-04 17:42 UTC (permalink / raw)
To: Peter Xu, “William Roche
Cc: kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
On 04.02.25 18:17, Peter Xu wrote:
> On Sat, Feb 01, 2025 at 09:57:24AM +0000, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Notify registered listeners about the remap at the end of
>> qemu_ram_remap() so e.g., a memory backend can re-apply its
>> settings correctly.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>
> IIUC logically speaking we don't need a global remap notifier - here a
> per-ramblock notifier looks more reasonable, like RAMBlock.resized().
Right. Note that qemu_ram_resize() also triggers global notifiers.
> It'll change the notify path from O(N**2) to O(N). After all, backend1's
> notifier won't care other ramblock's remap() events but only itself's.
>
> It's not a huge deal as I expect we don't have a huge amount of ramblocks,
> but looks like this series will miss the recent pull anyway.. so let me
> comment as so on this one for consideration when respin.
... and ram remap during reboot is not particularly the fast path we
care about (or should be caring about).
>
> We could also merge partial of the series to fix hugetlb poisoning first,
> as this one looks like can be separately done too.
hugetlb frequently uses preallocation, and the remaining patches in this
series make sure preallocation after remapping happens.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v7 5/6] hostmem: Factor out applying settings
2025-02-01 9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
` (3 preceding siblings ...)
2025-02-01 9:57 ` [PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap() “William Roche
@ 2025-02-01 9:57 ` “William Roche
2025-02-01 9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
5 siblings, 0 replies; 25+ messages in thread
From: “William Roche @ 2025-02-01 9:57 UTC (permalink / raw)
To: david, kvm, qemu-devel, qemu-arm
Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
From: David Hildenbrand <david@redhat.com>
We want to reuse the functionality when remapping RAM.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
1 file changed, 82 insertions(+), 73 deletions(-)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index bceca1a8d9..46d80f98b4 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
#endif
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+ void *ptr, uint64_t size,
+ Error **errp)
+{
+ bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+ if (backend->merge) {
+ qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+ }
+ if (!backend->dump) {
+ qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+ }
+#ifdef CONFIG_NUMA
+ unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+ /* lastbit == MAX_NODES means maxnode = 0 */
+ unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+ /*
+ * Ensure policy won't be ignored in case memory is preallocated
+ * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+ * this doesn't catch hugepage case.
+ */
+ unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+ int mode = backend->policy;
+
+ /*
+ * Check for invalid host-nodes and policies and give more verbose
+ * error messages than mbind().
+ */
+ if (maxnode && backend->policy == MPOL_DEFAULT) {
+ error_setg(errp, "host-nodes must be empty for policy default,"
+ " or you should explicitly specify a policy other"
+ " than default");
+ return;
+ } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+ error_setg(errp, "host-nodes must be set for policy %s",
+ HostMemPolicy_str(backend->policy));
+ return;
+ }
+
+ /*
+ * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+ * as argument to mbind() due to an old Linux bug (feature?) which
+ * cuts off the last specified node. This means backend->host_nodes
+ * must have MAX_NODES+1 bits available.
+ */
+ assert(sizeof(backend->host_nodes) >=
+ BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+ assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+ if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+ /*
+ * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+ * silently picks the first node.
+ */
+ mode = MPOL_PREFERRED_MANY;
+ }
+#endif
+
+ if (maxnode &&
+ mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+ if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+ error_setg_errno(errp, errno,
+ "cannot bind memory to host NUMA nodes");
+ return;
+ }
+ }
+#endif
+ /*
+ * Preallocate memory after the NUMA policy has been instantiated.
+ * This is necessary to guarantee memory is allocated with
+ * specified NUMA policy in place.
+ */
+ if (backend->prealloc &&
+ !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+ ptr, size, backend->prealloc_threads,
+ backend->prealloc_context, async, errp)) {
+ return;
+ }
+}
+
char *
host_memory_backend_get_name(HostMemoryBackend *backend)
{
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
void *ptr;
uint64_t sz;
size_t pagesize;
- bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
if (!bc->alloc) {
return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
return;
}
- if (backend->merge) {
- qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
- }
- if (!backend->dump) {
- qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
- }
-#ifdef CONFIG_NUMA
- unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
- /* lastbit == MAX_NODES means maxnode = 0 */
- unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
- /*
- * Ensure policy won't be ignored in case memory is preallocated
- * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
- * this doesn't catch hugepage case.
- */
- unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
- int mode = backend->policy;
-
- /* check for invalid host-nodes and policies and give more verbose
- * error messages than mbind(). */
- if (maxnode && backend->policy == MPOL_DEFAULT) {
- error_setg(errp, "host-nodes must be empty for policy default,"
- " or you should explicitly specify a policy other"
- " than default");
- return;
- } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
- error_setg(errp, "host-nodes must be set for policy %s",
- HostMemPolicy_str(backend->policy));
- return;
- }
-
- /*
- * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
- * as argument to mbind() due to an old Linux bug (feature?) which
- * cuts off the last specified node. This means backend->host_nodes
- * must have MAX_NODES+1 bits available.
- */
- assert(sizeof(backend->host_nodes) >=
- BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
- assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
- if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
- /*
- * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
- * silently picks the first node.
- */
- mode = MPOL_PREFERRED_MANY;
- }
-#endif
-
- if (maxnode &&
- mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
- if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
- error_setg_errno(errp, errno,
- "cannot bind memory to host NUMA nodes");
- return;
- }
- }
-#endif
- /*
- * Preallocate memory after the NUMA policy has been instantiated.
- * This is necessary to guarantee memory is allocated with
- * specified NUMA policy in place.
- */
- if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
- ptr, sz,
- backend->prealloc_threads,
- backend->prealloc_context,
- async, errp)) {
- return;
- }
+ host_memory_backend_apply_settings(backend, ptr, sz, errp);
}
static bool
--
2.43.5
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-01 9:57 [PATCH v7 0/6] Poisoned memory recovery on reboot “William Roche
` (4 preceding siblings ...)
2025-02-01 9:57 ` [PATCH v7 5/6] hostmem: Factor out applying settings “William Roche
@ 2025-02-01 9:57 ` “William Roche
2025-02-04 17:50 ` David Hildenbrand
5 siblings, 1 reply; 25+ messages in thread
From: “William Roche @ 2025-02-01 9:57 UTC (permalink / raw)
To: david, kvm, qemu-devel, qemu-arm
Cc: william.roche, peterx, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
From: William Roche <william.roche@oracle.com>
Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.
Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
backends/hostmem.c | 34 ++++++++++++++++++++++++++++++++++
include/system/hostmem.h | 1 +
system/physmem.c | 4 ----
3 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 46d80f98b4..4589467c77 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,37 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
backend->prealloc_threads = value;
}
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+ size_t offset, size_t size)
+{
+ HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+ ram_notifier);
+ Error *err = NULL;
+
+ if (!host_memory_backend_mr_inited(backend) ||
+ memory_region_get_ram_ptr(&backend->mr) != host) {
+ return;
+ }
+
+ host_memory_backend_apply_settings(backend, host + offset, size, &err);
+ if (err) {
+ /*
+ * If memory settings can't be successfully applied on remap,
+ * don't take the risk to continue without them.
+ */
+ error_report_err(err);
+ exit(1);
+ }
+}
+
static void host_memory_backend_init(Object *obj)
{
HostMemoryBackend *backend = MEMORY_BACKEND(obj);
MachineState *machine = MACHINE(qdev_get_machine());
+ backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+ ram_block_notifier_add(&backend->ram_notifier);
+
/* TODO: convert access to globals to compat properties */
backend->merge = machine_mem_merge(machine);
backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +405,13 @@ static void host_memory_backend_post_init(Object *obj)
object_apply_compat_props(obj);
}
+static void host_memory_backend_finalize(Object *obj)
+{
+ HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+ ram_block_notifier_remove(&backend->ram_notifier);
+}
+
bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
{
/*
@@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
.instance_size = sizeof(HostMemoryBackend),
.instance_init = host_memory_backend_init,
.instance_post_init = host_memory_backend_post_init,
+ .instance_finalize = host_memory_backend_finalize,
.interfaces = (InterfaceInfo[]) {
{ TYPE_USER_CREATABLE },
{ }
diff --git a/include/system/hostmem.h b/include/system/hostmem.h
index 5c21ca55c0..170849e8a4 100644
--- a/include/system/hostmem.h
+++ b/include/system/hostmem.h
@@ -83,6 +83,7 @@ struct HostMemoryBackend {
HostMemPolicy policy;
MemoryRegion mr;
+ RAMBlockNotifier ram_notifier;
};
bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
diff --git a/system/physmem.c b/system/physmem.c
index 561b2c38c0..a047fd680e 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2223,7 +2223,6 @@ void qemu_ram_remap(ram_addr_t addr)
{
RAMBlock *block;
uint64_t offset;
- void *vaddr;
size_t page_size;
RAMBLOCK_FOREACH(block) {
@@ -2233,7 +2232,6 @@ void qemu_ram_remap(ram_addr_t addr)
page_size = qemu_ram_pagesize(block);
offset = QEMU_ALIGN_DOWN(offset, page_size);
- vaddr = ramblock_ptr(block, offset);
if (block->flags & RAM_PREALLOC) {
;
} else if (xen_enabled()) {
@@ -2257,8 +2255,6 @@ void qemu_ram_remap(ram_addr_t addr)
exit(1);
}
}
- memory_try_enable_merging(vaddr, page_size);
- qemu_ram_setup_dump(vaddr, page_size);
ram_block_notify_remap(block->host, offset, page_size);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-01 9:57 ` [PATCH v7 6/6] hostmem: Handle remapping of RAM “William Roche
@ 2025-02-04 17:50 ` David Hildenbrand
2025-02-04 17:58 ` Peter Xu
0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-02-04 17:50 UTC (permalink / raw)
To: “William Roche, kvm, qemu-devel, qemu-arm
Cc: peterx, pbonzini, richard.henderson, philmd, peter.maydell,
mtosatti, imammedo, eduardo, marcel.apfelbaum, wangyanan55,
zhao1.liu, joao.m.martins
> /*
> @@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
> .instance_size = sizeof(HostMemoryBackend),
> .instance_init = host_memory_backend_init,
> .instance_post_init = host_memory_backend_post_init,
> + .instance_finalize = host_memory_backend_finalize,
> .interfaces = (InterfaceInfo[]) {
> { TYPE_USER_CREATABLE },
> { }
> diff --git a/include/system/hostmem.h b/include/system/hostmem.h
> index 5c21ca55c0..170849e8a4 100644
> --- a/include/system/hostmem.h
> +++ b/include/system/hostmem.h
> @@ -83,6 +83,7 @@ struct HostMemoryBackend {
> HostMemPolicy policy;
>
> MemoryRegion mr;
> + RAMBlockNotifier ram_notifier;
> };
Thinking about Peters comment, it would be a nice improvement to have a
single global memory-backend notifier that looks up the fitting memory
backend, instead of having one per memory backend.
A per-ramblock notifier might also be possible, but that's a bit
harder/ackward to configure: e.g., the resize callback is passed to
memory_region_init_resizeable_ram() right now.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-04 17:50 ` David Hildenbrand
@ 2025-02-04 17:58 ` Peter Xu
2025-02-04 18:55 ` David Hildenbrand
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 17:58 UTC (permalink / raw)
To: David Hildenbrand
Cc: “William Roche, kvm, qemu-devel, qemu-arm, pbonzini,
richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Tue, Feb 04, 2025 at 06:50:17PM +0100, David Hildenbrand wrote:
> > /*
> > @@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
> > .instance_size = sizeof(HostMemoryBackend),
> > .instance_init = host_memory_backend_init,
> > .instance_post_init = host_memory_backend_post_init,
> > + .instance_finalize = host_memory_backend_finalize,
> > .interfaces = (InterfaceInfo[]) {
> > { TYPE_USER_CREATABLE },
> > { }
> > diff --git a/include/system/hostmem.h b/include/system/hostmem.h
> > index 5c21ca55c0..170849e8a4 100644
> > --- a/include/system/hostmem.h
> > +++ b/include/system/hostmem.h
> > @@ -83,6 +83,7 @@ struct HostMemoryBackend {
> > HostMemPolicy policy;
> > MemoryRegion mr;
> > + RAMBlockNotifier ram_notifier;
> > };
>
> Thinking about Peters comment, it would be a nice improvement to have a
> single global memory-backend notifier that looks up the fitting memory
> backend, instead of having one per memory backend.
Yes, this could also avoid O(N**2).
>
> A per-ramblock notifier might also be possible, but that's a bit
> harder/ackward to configure: e.g., the resize callback is passed to
> memory_region_init_resizeable_ram() right now.
Yes, that can be some fuss on code to be touched up. We could avoid
passing that in when create the ramblock, instead we could allow ramblocks
to opt-in on hooks after ramblock created. Maybe we could move resize()
out too like that.
Either way looks good.
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-04 17:58 ` Peter Xu
@ 2025-02-04 18:55 ` David Hildenbrand
2025-02-04 20:16 ` Peter Xu
0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-02-04 18:55 UTC (permalink / raw)
To: Peter Xu
Cc: “William Roche, kvm, qemu-devel, qemu-arm, pbonzini,
richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On 04.02.25 18:58, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 06:50:17PM +0100, David Hildenbrand wrote:
>>> /*
>>> @@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
>>> .instance_size = sizeof(HostMemoryBackend),
>>> .instance_init = host_memory_backend_init,
>>> .instance_post_init = host_memory_backend_post_init,
>>> + .instance_finalize = host_memory_backend_finalize,
>>> .interfaces = (InterfaceInfo[]) {
>>> { TYPE_USER_CREATABLE },
>>> { }
>>> diff --git a/include/system/hostmem.h b/include/system/hostmem.h
>>> index 5c21ca55c0..170849e8a4 100644
>>> --- a/include/system/hostmem.h
>>> +++ b/include/system/hostmem.h
>>> @@ -83,6 +83,7 @@ struct HostMemoryBackend {
>>> HostMemPolicy policy;
>>> MemoryRegion mr;
>>> + RAMBlockNotifier ram_notifier;
>>> };
>>
>> Thinking about Peters comment, it would be a nice improvement to have a
>> single global memory-backend notifier that looks up the fitting memory
>> backend, instead of having one per memory backend.
>
> Yes, this could also avoid O(N**2).
Ah, and now I remember where these 3 patches originate from: virtio-mem
handling.
For virtio-mem I want to register also a remap handler, for example, to
perform the custom preallocation handling.
So there will be at least two instances getting notified (memory
backend, virtio-mem), and the per-ramblock one would have only allowed
to trigger one (at least with a simple callback as we have today for
->resize).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-04 18:55 ` David Hildenbrand
@ 2025-02-04 20:16 ` Peter Xu
2025-02-05 16:27 ` William Roche
0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2025-02-04 20:16 UTC (permalink / raw)
To: David Hildenbrand
Cc: “William Roche, kvm, qemu-devel, qemu-arm, pbonzini,
richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote:
> Ah, and now I remember where these 3 patches originate from: virtio-mem
> handling.
>
> For virtio-mem I want to register also a remap handler, for example, to
> perform the custom preallocation handling.
>
> So there will be at least two instances getting notified (memory backend,
> virtio-mem), and the per-ramblock one would have only allowed to trigger one
> (at least with a simple callback as we have today for ->resize).
I see, we can put something into commit log with such on decisions, then
we'll remember.
Said that, this still sounds like a per-ramblock thing, so instead of one
hook function we can also have per-ramblock notifier lists.
But I agree the perf issue isn't some immediate concern, so I'll leave that
to you and William. If so I think we should discuss that in the commit log
too, so we decide to not care about perf until necessary (or we just make
it per-ramblock..).
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-04 20:16 ` Peter Xu
@ 2025-02-05 16:27 ` William Roche
2025-02-05 17:58 ` Peter Xu
0 siblings, 1 reply; 25+ messages in thread
From: William Roche @ 2025-02-05 16:27 UTC (permalink / raw)
To: Peter Xu, David Hildenbrand
Cc: kvm, qemu-devel, qemu-arm, pbonzini, richard.henderson, philmd,
peter.maydell, mtosatti, imammedo, eduardo, marcel.apfelbaum,
wangyanan55, zhao1.liu, joao.m.martins
On 2/4/25 21:16, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote:
>> Ah, and now I remember where these 3 patches originate from: virtio-mem
>> handling.
>>
>> For virtio-mem I want to register also a remap handler, for example, to
>> perform the custom preallocation handling.
>>
>> So there will be at least two instances getting notified (memory backend,
>> virtio-mem), and the per-ramblock one would have only allowed to trigger one
>> (at least with a simple callback as we have today for ->resize).
>
> I see, we can put something into commit log with such on decisions, then
> we'll remember.
>
> Said that, this still sounds like a per-ramblock thing, so instead of one
> hook function we can also have per-ramblock notifier lists.
>
> But I agree the perf issue isn't some immediate concern, so I'll leave that
> to you and William. If so I think we should discuss that in the commit log
> too, so we decide to not care about perf until necessary (or we just make
> it per-ramblock..).
>
> Thanks,
>
I agree that we could split this fix in 2 parts: The one fixing the
hugetlbfs (ignoring the preallocation setting for the moment), and the
notification mechanism as a second set of patches.
The first part would be the 3 first patches (including a corrected
version of patch 2) and the second part could be an adaptation of the
next 3 patches, with their notification implementation dealing with
merging, dump *and* preallocation setup.
But I'd be happy to help with the implementation of this 2nd aspect too:
In order to apply settings like preallocation to a RAMBLock we need to
find its associated HostMemoryBackend (where we have the 'prealloc' flag).
To do so, we record a RAMBlockNotifier in the HostMemoryBackend struct,
so that the notification triggered by the remap action:
ram_block_notify_remap(block->host, offset, page_size);
will go through the list of notifiers ram_list.ramblock_notifiers to run
the not NULL ram_block_remapped entries on all of them.
For each of them, we know the associated HostMemoryBackend (as it
contains the RAMBlockNotifier), and we verify which one corresponds to
the host address given, so that we can apply the appropriate settings.
IIUC, my proposal (with David's code) currently has a
per-HostMemoryBackend notification.
Now if I want to implement a per-RAMBlock notification, would you
suggest to consider that the 'mr' attibute of a RAMBlock always points
to a HostMemoryBackend.mr, so that we could get the HostMemoryBackend
associated to the block from a
container_of(block->mr, HostMemoryBackend, mr) ?
If this is valid, than we could apply the appropriate settings from
there, but can't we have RAMBlocks not pointing to a HostMemoryBackend.mr ?
I'm probably confused about what you are referring to.
So how would you suggest that I make the notification per-ramblock ?
Thanks in advance for your feedback.
I'll send a corrected version of the first 3 patches, unless you want to
go with the current version of the patches 4/6, 5/6 and 6/6, so that we
can deal with preallocation.
Please let me know.
Thanks,
William.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v7 6/6] hostmem: Handle remapping of RAM
2025-02-05 16:27 ` William Roche
@ 2025-02-05 17:58 ` Peter Xu
0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2025-02-05 17:58 UTC (permalink / raw)
To: William Roche
Cc: David Hildenbrand, kvm, qemu-devel, qemu-arm, pbonzini,
richard.henderson, philmd, peter.maydell, mtosatti, imammedo,
eduardo, marcel.apfelbaum, wangyanan55, zhao1.liu, joao.m.martins
On Wed, Feb 05, 2025 at 05:27:50PM +0100, William Roche wrote:
> On 2/4/25 21:16, Peter Xu wrote:
> > On Tue, Feb 04, 2025 at 07:55:52PM +0100, David Hildenbrand wrote:
> > > Ah, and now I remember where these 3 patches originate from: virtio-mem
> > > handling.
> > >
> > > For virtio-mem I want to register also a remap handler, for example, to
> > > perform the custom preallocation handling.
> > >
> > > So there will be at least two instances getting notified (memory backend,
> > > virtio-mem), and the per-ramblock one would have only allowed to trigger one
> > > (at least with a simple callback as we have today for ->resize).
> >
> > I see, we can put something into commit log with such on decisions, then
> > we'll remember.
> >
> > Said that, this still sounds like a per-ramblock thing, so instead of one
> > hook function we can also have per-ramblock notifier lists.
> >
> > But I agree the perf issue isn't some immediate concern, so I'll leave that
> > to you and William. If so I think we should discuss that in the commit log
> > too, so we decide to not care about perf until necessary (or we just make
> > it per-ramblock..).
> >
> > Thanks,
> >
>
>
> I agree that we could split this fix in 2 parts: The one fixing the
> hugetlbfs (ignoring the preallocation setting for the moment), and the
> notification mechanism as a second set of patches.
>
> The first part would be the 3 first patches (including a corrected version
> of patch 2) and the second part could be an adaptation of the next 3
> patches, with their notification implementation dealing with merging, dump
> *and* preallocation setup.
>
>
> But I'd be happy to help with the implementation of this 2nd aspect too:
>
> In order to apply settings like preallocation to a RAMBLock we need to find
> its associated HostMemoryBackend (where we have the 'prealloc' flag).
> To do so, we record a RAMBlockNotifier in the HostMemoryBackend struct, so
> that the notification triggered by the remap action:
> ram_block_notify_remap(block->host, offset, page_size);
> will go through the list of notifiers ram_list.ramblock_notifiers to run the
> not NULL ram_block_remapped entries on all of them.
>
> For each of them, we know the associated HostMemoryBackend (as it contains
> the RAMBlockNotifier), and we verify which one corresponds to the host
> address given, so that we can apply the appropriate settings.
>
> IIUC, my proposal (with David's code) currently has a per-HostMemoryBackend
> notification.
>
> Now if I want to implement a per-RAMBlock notification, would you suggest to
> consider that the 'mr' attibute of a RAMBlock always points to a
> HostMemoryBackend.mr, so that we could get the HostMemoryBackend associated
> to the block from a
> container_of(block->mr, HostMemoryBackend, mr) ?
>
> If this is valid, than we could apply the appropriate settings from there,
> but can't we have RAMBlocks not pointing to a HostMemoryBackend.mr ?
Yes, QEMU definitely has ramblocks that are not backed by memory backends.
However each memory backend must have its ramblock.
IIUC what we need to do is let host_memory_backend_memory_complete()
register a per-ramblock notifier on top of its ramblock, which can be
referenced by backend->mr.ramblock.
>
>
> I'm probably confused about what you are referring to.
> So how would you suggest that I make the notification per-ramblock ?
> Thanks in advance for your feedback.
>
>
> I'll send a corrected version of the first 3 patches, unless you want to go
> with the current version of the patches 4/6, 5/6 and 6/6, so that we can
> deal with preallocation.
I don't feel strongly, but I can explain how the per-ramblock can be done.
One thing interesting I found is we actually have such notifier list
already in ramblocks.. see:
struct RAMBlock {
...
QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
...
}
I guess that's some leftover from the global ramblock notifier.. e.g. I
tried remove that line and qemu compiles all fine.
Then instead of removing it, we could make that the per-ramblock list.
One way to do this is:
- Patch 1: refactor current code, let RAMBlock.resized() to be a notifier
instead of a fn() pointer passed over from
memory_region_init_resizeable_ram(). It means we can remove
RAMBlock.resized() but make fw_cfg_resized() becomes a notifier, taking
RAM_BLOCK_RESIZED event instead.
- Patch 2: introduce another RAM_BLOCK_REMAPPED event, then host backends
(probably, host_memory_backend_memory_complete() after alloc() done so
that the ramblock will always be available..) can register a notifier
only looking for REMAPPED.
Then in the future virtio-mem can register similarly to specific ramblock
on REMAPPED only.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 25+ messages in thread