[PATCH 0/1] Qemu crashes on VM migration after an handled memory error

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] Qemu crashes on VM migration after an handled memory error
@ 2023-09-06 13:59 “William Roche
  2023-09-06 13:59 ` [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: “William Roche @ 2023-09-06 13:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: william.roche, joao.m.martins

From: William Roche <william.roche@oracle.com>

A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to correct this problem, I suggest to treat the poisoned pages as if
they were zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".
This fix is scripts/checkpatch.pl clean.

William Roche (1):
  migration: skip poisoned memory pages on "ram saving" phase

 accel/kvm/kvm-all.c    | 14 ++++++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   | 10 ++++++++++
 migration/ram.c        |  3 ++-
 4 files changed, 31 insertions(+), 1 deletion(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-06 13:59 [PATCH 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
@ 2023-09-06 13:59 ` “William Roche
  2023-09-06 14:19   ` Joao Martins
  0 siblings, 1 reply; 34+ messages in thread
From: “William Roche @ 2023-09-06 13:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: william.roche, joao.m.martins

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.

The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page_to_file
#5  save_zero_page
#6  ram_save_target_page_legacy
#7  ram_save_host_page
#8  ram_find_and_save_block
#9  ram_save_iterate
#10 qemu_savevm_state_iterate
#11 migration_iteration_run
#12 migration_thread
#13 qemu_thread_start

Fix it by considering poisoned pages as if they were zero-pages for
the migration copy. This fix also works with underlying large pages,
taking into account the RAMBlock segment "page-size".

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    | 14 ++++++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   | 10 ++++++++++
 migration/ram.c        |  3 ++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2ba7521695..24a7709495 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
+bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
+{
+    HWPoisonPage *pg;
+    ram_addr_t ram_addr = (ram_addr_t) offset;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if ((ram_addr >= pg->ram_addr) &&
+            (ram_addr - pg->ram_addr < block->page_size)) {
+            return true;
+        }
+    }
+    return false;
+}
+
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 235dc661bc..c0a31611df 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index ebdca41052..a2196e9e6b 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -580,4 +580,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_page - indicate if the given page is poisoned
+ * @block: memory block of the given page
+ * @ram_addr: offset of the page
+ *
+ * Returns: true: page is poisoned
+ *          false: page not yet poisoned
+ */
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index 9040d66e61..48d875b12d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1145,7 +1145,8 @@ static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
     uint8_t *p = block->host + offset;
     int len = 0;
 
-    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
+    if ((kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) ||
+        buffer_is_zero(p, TARGET_PAGE_SIZE)) {
         len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
         qemu_put_byte(file, 0);
         len += 1;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-06 13:59 ` [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
@ 2023-09-06 14:19   ` Joao Martins
  2023-09-06 15:16     ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Joao Martins @ 2023-09-06 14:19 UTC (permalink / raw)
  To: William Roche
  Cc: Paolo Bonzini, Juan Quintela, Peter Xu, Leonardo Bras, qemu-devel

On 06/09/2023 14:59, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> A memory page poisoned from the hypervisor level is no longer readable.
> Thus, it is now treated as a zero-page for the ram saving migration phase.
> 
> The migration of a VM will crash Qemu when it tries to read the
> memory address space and stumbles on the poisoned page with a similar
> stack trace:
> 
> Program terminated with signal SIGBUS, Bus error.
> #0  _mm256_loadu_si256
> #1  buffer_zero_avx2
> #2  select_accel_fn
> #3  buffer_is_zero
> #4  save_zero_page_to_file
> #5  save_zero_page
> #6  ram_save_target_page_legacy
> #7  ram_save_host_page
> #8  ram_find_and_save_block
> #9  ram_save_iterate
> #10 qemu_savevm_state_iterate
> #11 migration_iteration_run
> #12 migration_thread
> #13 qemu_thread_start
> 
> Fix it by considering poisoned pages as if they were zero-pages for
> the migration copy. This fix also works with underlying large pages,
> taking into account the RAMBlock segment "page-size".
> 
> Signed-off-by: William Roche <william.roche@oracle.com>

You forgot to CC the maintainers; Adding them now

./scripts/get_maintainer.pl is your friend for the next version :)

> ---
>  accel/kvm/kvm-all.c    | 14 ++++++++++++++
>  accel/stubs/kvm-stub.c |  5 +++++
>  include/sysemu/kvm.h   | 10 ++++++++++
>  migration/ram.c        |  3 ++-
>  4 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 2ba7521695..24a7709495 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
>      }
>  }
>  
> +bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
> +{
> +    HWPoisonPage *pg;
> +    ram_addr_t ram_addr = (ram_addr_t) offset;
> +
> +    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
> +        if ((ram_addr >= pg->ram_addr) &&
> +            (ram_addr - pg->ram_addr < block->page_size)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
>  void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>  {
>      HWPoisonPage *page;
> diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
> index 235dc661bc..c0a31611df 100644
> --- a/accel/stubs/kvm-stub.c
> +++ b/accel/stubs/kvm-stub.c
> @@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
>  {
>      return 0;
>  }
> +
> +bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
> +{
> +    return false;
> +}
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index ebdca41052..a2196e9e6b 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -580,4 +580,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
>  bool kvm_dirty_ring_enabled(void);
>  
>  uint32_t kvm_dirty_ring_size(void);
> +
> +/**
> + * kvm_hwpoisoned_page - indicate if the given page is poisoned
> + * @block: memory block of the given page
> + * @ram_addr: offset of the page
> + *
> + * Returns: true: page is poisoned
> + *          false: page not yet poisoned
> + */
> +bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
>  #endif
> diff --git a/migration/ram.c b/migration/ram.c
> index 9040d66e61..48d875b12d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1145,7 +1145,8 @@ static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
>      uint8_t *p = block->host + offset;
>      int len = 0;
>  
> -    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
> +    if ((kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) ||
> +        buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>          len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
>          qemu_put_byte(file, 0);
>          len += 1;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-06 14:19   ` Joao Martins
@ 2023-09-06 15:16     ` Peter Xu
  2023-09-06 21:29       ` William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-09-06 15:16 UTC (permalink / raw)
  To: Joao Martins, William Roche
  Cc: William Roche, Paolo Bonzini, Juan Quintela, Leonardo Bras,
	qemu-devel

On Wed, Sep 06, 2023 at 03:19:32PM +0100, Joao Martins wrote:
> On 06/09/2023 14:59, “William Roche wrote:
> > From: William Roche <william.roche@oracle.com>
> > 
> > A memory page poisoned from the hypervisor level is no longer readable.
> > Thus, it is now treated as a zero-page for the ram saving migration phase.
> > 
> > The migration of a VM will crash Qemu when it tries to read the
> > memory address space and stumbles on the poisoned page with a similar
> > stack trace:
> > 
> > Program terminated with signal SIGBUS, Bus error.
> > #0  _mm256_loadu_si256
> > #1  buffer_zero_avx2
> > #2  select_accel_fn
> > #3  buffer_is_zero
> > #4  save_zero_page_to_file
> > #5  save_zero_page
> > #6  ram_save_target_page_legacy
> > #7  ram_save_host_page
> > #8  ram_find_and_save_block
> > #9  ram_save_iterate
> > #10 qemu_savevm_state_iterate
> > #11 migration_iteration_run
> > #12 migration_thread
> > #13 qemu_thread_start
> > 
> > Fix it by considering poisoned pages as if they were zero-pages for
> > the migration copy. This fix also works with underlying large pages,
> > taking into account the RAMBlock segment "page-size".
> > 
> > Signed-off-by: William Roche <william.roche@oracle.com>
> 
> You forgot to CC the maintainers; Adding them now
> 
> ./scripts/get_maintainer.pl is your friend for the next version :)
> 
> > ---
> >  accel/kvm/kvm-all.c    | 14 ++++++++++++++
> >  accel/stubs/kvm-stub.c |  5 +++++
> >  include/sysemu/kvm.h   | 10 ++++++++++
> >  migration/ram.c        |  3 ++-
> >  4 files changed, 31 insertions(+), 1 deletion(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 2ba7521695..24a7709495 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
> >      }
> >  }
> >  
> > +bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
> > +{
> > +    HWPoisonPage *pg;
> > +    ram_addr_t ram_addr = (ram_addr_t) offset;
> > +
> > +    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
> > +        if ((ram_addr >= pg->ram_addr) &&
> > +            (ram_addr - pg->ram_addr < block->page_size)) {

Just a note..

Probably fine for now to reuse block page size, but IIUC the right thing to
do is to fetch it from the signal info (in QEMU's sigbus_handler()) of
kernel_siginfo.si_addr_lsb.

At least for x86 I think that stores the "shift" of covered poisoned page
(one needs to track the Linux handling of VM_FAULT_HWPOISON_LARGE for a
huge page, though.. not aware of any man page for that).  It'll then work
naturally when Linux huge pages will start to support sub-huge-page-size
poisoning someday.  We can definitely leave that for later.

> > +            return true;
> > +        }
> > +    }
> > +    return false;
> > +}
> > +
> >  void kvm_hwpoison_page_add(ram_addr_t ram_addr)
> >  {
> >      HWPoisonPage *page;
> > diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
> > index 235dc661bc..c0a31611df 100644
> > --- a/accel/stubs/kvm-stub.c
> > +++ b/accel/stubs/kvm-stub.c
> > @@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
> >  {
> >      return 0;
> >  }
> > +
> > +bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
> > +{
> > +    return false;
> > +}
> > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> > index ebdca41052..a2196e9e6b 100644
> > --- a/include/sysemu/kvm.h
> > +++ b/include/sysemu/kvm.h
> > @@ -580,4 +580,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
> >  bool kvm_dirty_ring_enabled(void);
> >  
> >  uint32_t kvm_dirty_ring_size(void);
> > +
> > +/**
> > + * kvm_hwpoisoned_page - indicate if the given page is poisoned
> > + * @block: memory block of the given page
> > + * @ram_addr: offset of the page
> > + *
> > + * Returns: true: page is poisoned
> > + *          false: page not yet poisoned
> > + */
> > +bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
> >  #endif
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 9040d66e61..48d875b12d 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1145,7 +1145,8 @@ static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
> >      uint8_t *p = block->host + offset;
> >      int len = 0;
> >  
> > -    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
> > +    if ((kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) ||

Can we move this out of zero page handling?  Zero detection is not
guaranteed to always be the 1st thing to do when processing a guest page.
Currently it'll already skip either rdma or when compression enabled, so
it'll keep crashing there.

Perhaps at the entry of ram_save_target_page_legacy()?

> > +        buffer_is_zero(p, TARGET_PAGE_SIZE)) {
> >          len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
> >          qemu_put_byte(file, 0);
> >          len += 1;
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-06 15:16     ` Peter Xu
@ 2023-09-06 21:29       ` William Roche
  2023-09-09 14:57         ` Joao Martins
  0 siblings, 1 reply; 34+ messages in thread
From: William Roche @ 2023-09-06 21:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Juan Quintela, Leonardo Bras, qemu-devel,
	Joao Martins

On 9/6/23 17:16, Peter Xu wrote:
> 
> Just a note..
> 
> Probably fine for now to reuse block page size, but IIUC the right thing to
> do is to fetch it from the signal info (in QEMU's sigbus_handler()) of
> kernel_siginfo.si_addr_lsb.
> 
> At least for x86 I think that stores the "shift" of covered poisoned page
> (one needs to track the Linux handling of VM_FAULT_HWPOISON_LARGE for a
> huge page, though.. not aware of any man page for that).  It'll then work
> naturally when Linux huge pages will start to support sub-huge-page-size
> poisoning someday.  We can definitely leave that for later.
> 

I totally agree with that !


>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -1145,7 +1145,8 @@ static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
>>>       uint8_t *p = block->host + offset;
>>>       int len = 0;
>>>   
>>> -    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>>> +    if ((kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) ||
> 
> Can we move this out of zero page handling?  Zero detection is not
> guaranteed to always be the 1st thing to do when processing a guest page.
> Currently it'll already skip either rdma or when compression enabled, so
> it'll keep crashing there.
> 
> Perhaps at the entry of ram_save_target_page_legacy()?

Right, as expected, using migration compression with poisoned pages 
crashes even with this fix...

The difficulty I see to place the poisoned page verification on the
entry of ram_save_target_page_legacy() is what to do to skip the found 
poison page(s) if any ?

Should I continue to treat them as zero pages written with 
save_zero_page_to_file ? Or should I consider the case of an ongoing 
compression use and create a new code compressing an empty page with 
save_compress_page() ?

And what about an RDMA memory region impacted by a memory error ?
This is an important aspect.
Does anyone know how this situation is dealt with ? And how it should be 
handled in Qemu ?

--
Thanks,
William.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-06 21:29       ` William Roche
@ 2023-09-09 14:57         ` Joao Martins
  2023-09-11 19:48           ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Joao Martins @ 2023-09-09 14:57 UTC (permalink / raw)
  To: William Roche, Peter Xu
  Cc: Paolo Bonzini, Juan Quintela, Leonardo Bras, qemu-devel

On 06/09/2023 22:29, William Roche wrote:
> On 9/6/23 17:16, Peter Xu wrote:
>>
>> Just a note..
>>
>> Probably fine for now to reuse block page size, but IIUC the right thing to
>> do is to fetch it from the signal info (in QEMU's sigbus_handler()) of
>> kernel_siginfo.si_addr_lsb.
>>
>> At least for x86 I think that stores the "shift" of covered poisoned page
>> (one needs to track the Linux handling of VM_FAULT_HWPOISON_LARGE for a
>> huge page, though.. not aware of any man page for that).  It'll then work
>> naturally when Linux huge pages will start to support sub-huge-page-size
>> poisoning someday.  We can definitely leave that for later.
>>
> 
> I totally agree with that !
>

Provided this bug affects all qemu versions thus far, perhaps should be a follow
up series, to make the changer easier to bring into stable tree.

> 
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -1145,7 +1145,8 @@ static int save_zero_page_to_file(PageSearchStatus
>>>> *pss, QEMUFile *file,
>>>>       uint8_t *p = block->host + offset;
>>>>       int len = 0;
>>>>   -    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>>>> +    if ((kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) ||
>>
>> Can we move this out of zero page handling?  Zero detection is not
>> guaranteed to always be the 1st thing to do when processing a guest page.
>> Currently it'll already skip either rdma or when compression enabled, so
>> it'll keep crashing there.
>>
>> Perhaps at the entry of ram_save_target_page_legacy()?
> 
> Right, as expected, using migration compression with poisoned pages crashes even
> with this fix...
> 
> The difficulty I see to place the poisoned page verification on the
> entry of ram_save_target_page_legacy() is what to do to skip the found poison
> page(s) if any ?
> 
> Should I continue to treat them as zero pages written with
> save_zero_page_to_file ? 

MCE had already been forward to the guest, so guest is supposed to not be using
the page (nor rely on its contents). Hence destination ought to just see a zero
page. So what you said seems like the best course of action.

> Or should I consider the case of an ongoing compression
> use and create a new code compressing an empty page with save_compress_page() ?
> 
The compress code looks to be a tentative compression (not guaranteed IIUC), so
I am not sure it needs any more logic that just adding at the top of
ram_save_target_page_legacy() as Peter suggested?

> And what about an RDMA memory region impacted by a memory error ?
> This is an important aspect.
> Does anyone know how this situation is dealt with ? And how it should be handled
> in Qemu ?
> 

If you refer to guest RDMA MRs that is just guest RAM, not sure we are even
aware of those from qemu. But if you refer to the RDMA transport that sits below
the Qemu file (or rather acts as an implementation of QemuFile), so handling in
ram_save_target_page_legacy() already seems to cover it.

> -- 
> Thanks,
> William.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-09 14:57         ` Joao Martins
@ 2023-09-11 19:48           ` Peter Xu
  2023-09-12 18:44             ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-09-11 19:48 UTC (permalink / raw)
  To: Joao Martins
  Cc: William Roche, Paolo Bonzini, Juan Quintela, Leonardo Bras,
	qemu-devel

On Sat, Sep 09, 2023 at 03:57:44PM +0100, Joao Martins wrote:
> > Should I continue to treat them as zero pages written with
> > save_zero_page_to_file ? 
> 
> MCE had already been forward to the guest, so guest is supposed to not be using
> the page (nor rely on its contents). Hence destination ought to just see a zero
> page. So what you said seems like the best course of action.
> 
> > Or should I consider the case of an ongoing compression
> > use and create a new code compressing an empty page with save_compress_page() ?
> > 
> The compress code looks to be a tentative compression (not guaranteed IIUC), so
> I am not sure it needs any more logic that just adding at the top of
> ram_save_target_page_legacy() as Peter suggested?
> 
> > And what about an RDMA memory region impacted by a memory error ?
> > This is an important aspect.
> > Does anyone know how this situation is dealt with ? And how it should be handled
> > in Qemu ?
> > 
> 
> If you refer to guest RDMA MRs that is just guest RAM, not sure we are even
> aware of those from qemu. But if you refer to the RDMA transport that sits below
> the Qemu file (or rather acts as an implementation of QemuFile), so handling in
> ram_save_target_page_legacy() already seems to cover it.

I'm also not familiar enough with RDMA, but it looks tricky indeed. AFAIU
it's leveraging RDMA_CONTROL_COMPRESS for zero pages for now (with
RDMACompress.value==0), so it doesn't seem to be using generic migration
protocols.

If we want to fix all places well, one way to consider is to introduce
migration_buffer_is_zero(), which can be a wrapper for buffer_is_zero() by
default, but also returns true for poisoned pages before reading the
buffer.  Then we use it in all three places:

  - For compression, in do_compress_ram_page()
  - For RDMA, in qemu_rdma_write_one()
  - For generic migration, in save_zero_page_to_file() (your current patch)

I suppose then all cases will be fixed.  We need to make sure we'll always
use migration_buffer_is_zero() as the 1st thing to call when QEMU wants to
migrate a target page.  Maybe it'll worth a comment above that function.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-11 19:48           ` Peter Xu
@ 2023-09-12 18:44             ` Peter Xu
  2023-09-14 20:20               ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-09-12 18:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: William Roche, Paolo Bonzini, Juan Quintela, Leonardo Bras,
	qemu-devel, lizhijian, lidongchen

On Mon, Sep 11, 2023 at 03:48:38PM -0400, Peter Xu wrote:
> On Sat, Sep 09, 2023 at 03:57:44PM +0100, Joao Martins wrote:
> > > Should I continue to treat them as zero pages written with
> > > save_zero_page_to_file ? 
> > 
> > MCE had already been forward to the guest, so guest is supposed to not be using
> > the page (nor rely on its contents). Hence destination ought to just see a zero
> > page. So what you said seems like the best course of action.
> > 
> > > Or should I consider the case of an ongoing compression
> > > use and create a new code compressing an empty page with save_compress_page() ?
> > > 
> > The compress code looks to be a tentative compression (not guaranteed IIUC), so
> > I am not sure it needs any more logic that just adding at the top of
> > ram_save_target_page_legacy() as Peter suggested?
> > 
> > > And what about an RDMA memory region impacted by a memory error ?
> > > This is an important aspect.
> > > Does anyone know how this situation is dealt with ? And how it should be handled
> > > in Qemu ?
> > > 
> > 
> > If you refer to guest RDMA MRs that is just guest RAM, not sure we are even
> > aware of those from qemu. But if you refer to the RDMA transport that sits below
> > the Qemu file (or rather acts as an implementation of QemuFile), so handling in
> > ram_save_target_page_legacy() already seems to cover it.
> 
> I'm also not familiar enough with RDMA, but it looks tricky indeed. AFAIU
> it's leveraging RDMA_CONTROL_COMPRESS for zero pages for now (with
> RDMACompress.value==0), so it doesn't seem to be using generic migration
> protocols.
> 
> If we want to fix all places well, one way to consider is to introduce
> migration_buffer_is_zero(), which can be a wrapper for buffer_is_zero() by
> default, but also returns true for poisoned pages before reading the
> buffer.  Then we use it in all three places:
> 
>   - For compression, in do_compress_ram_page()
>   - For RDMA, in qemu_rdma_write_one()

Ah, this may not be enough.. sorry.

It seems this is only one path that RDMA will use to save a target page,
for (!rdma->pin_all || !block->is_ram_block) && !block->remote_keys[chunk].

RDMA seems to also possible to merge buffers if virtually continuous
(qemu_rdma_buffer_mergable()), so IIUC it may not trigger an immediate
access to the guest page until later if it finds continuous pages and skip
even more logic.  I suspect that's also problematic for poisoned pages so
we should not allow any merged buffer to contain a poisoned page.

Not sure how complicated will it be to fix rdma specifically, copy again
two rdma developers.  One option is we state the issue in rdma and fix
non-rdma first.  Looks like rdma needs its own fix anyway.

>   - For generic migration, in save_zero_page_to_file() (your current patch)
> 
> I suppose then all cases will be fixed.  We need to make sure we'll always
> use migration_buffer_is_zero() as the 1st thing to call when QEMU wants to
> migrate a target page.  Maybe it'll worth a comment above that function.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error
  2023-09-12 18:44             ` Peter Xu
@ 2023-09-14 20:20               ` “William Roche
  2023-09-14 20:20                 ` [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
  2023-09-14 21:50                 ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error Peter Xu
  0 siblings, 2 replies; 34+ messages in thread
From: “William Roche @ 2023-09-14 20:20 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: pbonzini, quintela, leobras, joao.m.martins, william.roche

From: William Roche <william.roche@oracle.com>

A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to correct this problem, I suggest to treat the poisoned pages as if
they were zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".
This fix is scripts/checkpatch.pl clean.

v2:
  - adding compressed transfer handling of poisoned pages

Testing: I could verify that migration now works with a poisoned page
through standard and compressed migration with 4k and large (2M) pages.

The RDMA transfer is not considered by this patch.

William Roche (1):
  migration: skip poisoned memory pages on "ram saving" phase

 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 23 +++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 54 insertions(+), 3 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-14 20:20               ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
@ 2023-09-14 20:20                 ` “William Roche
  2023-09-15  3:13                   ` Zhijian Li (Fujitsu)
  2023-09-14 21:50                 ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error Peter Xu
  1 sibling, 1 reply; 34+ messages in thread
From: “William Roche @ 2023-09-14 20:20 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: pbonzini, quintela, leobras, joao.m.martins, william.roche

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.

The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page_to_file
#5  save_zero_page
#6  ram_save_target_page_legacy
#7  ram_save_host_page
#8  ram_find_and_save_block
#9  ram_save_iterate
#10 qemu_savevm_state_iterate
#11 migration_iteration_run
#12 migration_thread
#13 qemu_thread_start

Fix it by considering poisoned pages as if they were zero-pages for
the migration copy. This fix also works with underlying large pages,
taking into account the RAMBlock segment "page-size".

Standard migration and compressed transfers are handled by this code.
RDMA transfer isn't touched.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 23 +++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index ff1578bb32..7fb13c8a56 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
+bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
+{
+    HWPoisonPage *pg;
+    ram_addr_t ram_addr = (ram_addr_t) offset;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if ((ram_addr >= pg->ram_addr) &&
+            (ram_addr - pg->ram_addr < block->page_size)) {
+            return true;
+        }
+    }
+    return false;
+}
+
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 235dc661bc..c0a31611df 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index ee9025f8e9..858688227a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -570,4 +570,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_page - indicate if the given page is poisoned
+ * @block: memory block of the given page
+ * @ram_addr: offset of the page
+ *
+ * Returns: true: page is poisoned
+ *          false: page not yet poisoned
+ */
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
 #endif
diff --git a/migration/ram-compress.c b/migration/ram-compress.c
index 06254d8c69..1916ce709d 100644
--- a/migration/ram-compress.c
+++ b/migration/ram-compress.c
@@ -34,6 +34,7 @@
 #include "qemu/error-report.h"
 #include "migration.h"
 #include "options.h"
+#include "ram.h"
 #include "io/channel-null.h"
 #include "exec/target_page.h"
 #include "exec/ramblock.h"
@@ -198,7 +199,7 @@ static CompressResult do_compress_ram_page(QEMUFile *f, z_stream *stream,
 
     assert(qemu_file_buffer_empty(f));
 
-    if (buffer_is_zero(p, page_size)) {
+    if (migration_buffer_is_zero(block, offset, page_size)) {
         return RES_ZEROPAGE;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 9040d66e61..fd337f7e65 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1129,6 +1129,26 @@ void ram_release_page(const char *rbname, uint64_t offset)
     ram_discard_range(rbname, offset, TARGET_PAGE_SIZE);
 }
 
+/**
+ * migration_buffer_is_zero: indicate if the page at the given
+ * location is entirely filled with zero, or is a poisoned page.
+ *
+ * @block: block that contains the page
+ * @offset: offset inside the block for the page
+ * @len: size to consider
+ */
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset,
+                                     size_t len)
+{
+    uint8_t *p = block->host + offset;
+
+    if (kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) {
+        return true;
+    }
+
+    return buffer_is_zero(p, len);
+}
+
 /**
  * save_zero_page_to_file: send the zero page to the file
  *
@@ -1142,10 +1162,9 @@ void ram_release_page(const char *rbname, uint64_t offset)
 static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
                                   RAMBlock *block, ram_addr_t offset)
 {
-    uint8_t *p = block->host + offset;
     int len = 0;
 
-    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
+    if (migration_buffer_is_zero(block, offset, TARGET_PAGE_SIZE)) {
         len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
         qemu_put_byte(file, 0);
         len += 1;
diff --git a/migration/ram.h b/migration/ram.h
index 145c915ca7..805ea2a211 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,8 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 void ram_transferred_add(uint64_t bytes);
 void ram_release_page(const char *rbname, uint64_t offset);
 
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset, size_t len);
+
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error
  2023-09-14 20:20               ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
  2023-09-14 20:20                 ` [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
@ 2023-09-14 21:50                 ` Peter Xu
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Xu @ 2023-09-14 21:50 UTC (permalink / raw)
  To: “William Roche
  Cc: qemu-devel, pbonzini, quintela, leobras, joao.m.martins,
	lizhijian, lidongchen

On Thu, Sep 14, 2023 at 08:20:53PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> A Qemu VM can survive a memory error, as qemu can relay the error to the
> VM kernel which could also deal with it -- poisoning/off-lining the impacted
> page.
> This situation creates a hole in the VM memory address space that the VM kernel
> knows about (an unreadable page or set of pages).
> 
> But the migration of this VM (live migration through the network or
> pseudo-migration with the creation of a state file) will crash Qemu when
> it sequentially reads the memory address space and stumbles on the
> existing hole.
> 
> In order to correct this problem, I suggest to treat the poisoned pages as if
> they were zero-pages for the migration copy.
> This fix also works with underlying large pages, taking into account the
> RAMBlock segment "page-size".
> This fix is scripts/checkpatch.pl clean.
> 
> v2:
>   - adding compressed transfer handling of poisoned pages
>  
> Testing: I could verify that migration now works with a poisoned page
> through standard and compressed migration with 4k and large (2M) pages.
> 
> The RDMA transfer is not considered by this patch.
> 
> William Roche (1):
>   migration: skip poisoned memory pages on "ram saving" phase

If there's a new version, please consider adding a TODO above
control_save_page() that poison page is probably broken there, so we can
still remember.

Reviewed-by: Peter Xu <peterx@redhat.com>

Copy:

lizhijian@fujitsu.com, lidongchen@tencent.com

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-14 20:20                 ` [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
@ 2023-09-15  3:13                   ` Zhijian Li (Fujitsu)
  2023-09-15 11:31                     ` William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-09-15  3:13 UTC (permalink / raw)
  To: “William Roche, qemu-devel@nongnu.org, peterx@redhat.com
  Cc: pbonzini@redhat.com, quintela@redhat.com, leobras@redhat.com,
	joao.m.martins@oracle.com



On 15/09/2023 04:20, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> A memory page poisoned from the hypervisor level is no longer readable.
> Thus, it is now treated as a zero-page for the ram saving migration phase.
> 
> The migration of a VM will crash Qemu when it tries to read the
> memory address space and stumbles on the poisoned page with a similar
> stack trace:
> 
> Program terminated with signal SIGBUS, Bus error.
> #0  _mm256_loadu_si256
> #1  buffer_zero_avx2
> #2  select_accel_fn
> #3  buffer_is_zero
> #4  save_zero_page_to_file
> #5  save_zero_page
> #6  ram_save_target_page_legacy
> #7  ram_save_host_page
> #8  ram_find_and_save_block
> #9  ram_save_iterate
> #10 qemu_savevm_state_iterate
> #11 migration_iteration_run
> #12 migration_thread
> #13 qemu_thread_start
> 
> Fix it by considering poisoned pages as if they were zero-pages for
> the migration copy. This fix also works with underlying large pages,
> taking into account the RAMBlock segment "page-size".
> 
> Standard migration and compressed transfers are handled by this code.
> RDMA transfer isn't touched.
> 


I'm okay with "RDMA isn't touched".
BTW, could you share your reproducing program/hacking to poison the page, so that
i am able to take a look the RDMA part later when i'm free.

Not sure it's suitable to acknowledge a not touched part. Anyway
Acked-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA


> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c      | 14 ++++++++++++++
>   accel/stubs/kvm-stub.c   |  5 +++++
>   include/sysemu/kvm.h     | 10 ++++++++++
>   migration/ram-compress.c |  3 ++-
>   migration/ram.c          | 23 +++++++++++++++++++++--
>   migration/ram.h          |  2 ++
>   6 files changed, 54 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index ff1578bb32..7fb13c8a56 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
>       }
>   }
>   
> +bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
> +{
> +    HWPoisonPage *pg;
> +    ram_addr_t ram_addr = (ram_addr_t) offset;
> +
> +    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
> +        if ((ram_addr >= pg->ram_addr) &&
> +            (ram_addr - pg->ram_addr < block->page_size)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
>   void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>   {
>       HWPoisonPage *page;
> diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
> index 235dc661bc..c0a31611df 100644
> --- a/accel/stubs/kvm-stub.c
> +++ b/accel/stubs/kvm-stub.c
> @@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
>   {
>       return 0;
>   }
> +
> +bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
> +{
> +    return false;
> +}
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index ee9025f8e9..858688227a 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -570,4 +570,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
>   bool kvm_dirty_ring_enabled(void);
>   
>   uint32_t kvm_dirty_ring_size(void);
> +
> +/**
> + * kvm_hwpoisoned_page - indicate if the given page is poisoned
> + * @block: memory block of the given page
> + * @ram_addr: offset of the page
> + *
> + * Returns: true: page is poisoned
> + *          false: page not yet poisoned
> + */
> +bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
>   #endif
> diff --git a/migration/ram-compress.c b/migration/ram-compress.c
> index 06254d8c69..1916ce709d 100644
> --- a/migration/ram-compress.c
> +++ b/migration/ram-compress.c
> @@ -34,6 +34,7 @@
>   #include "qemu/error-report.h"
>   #include "migration.h"
>   #include "options.h"
> +#include "ram.h"
>   #include "io/channel-null.h"
>   #include "exec/target_page.h"
>   #include "exec/ramblock.h"
> @@ -198,7 +199,7 @@ static CompressResult do_compress_ram_page(QEMUFile *f, z_stream *stream,
>   
>       assert(qemu_file_buffer_empty(f));
>   
> -    if (buffer_is_zero(p, page_size)) {
> +    if (migration_buffer_is_zero(block, offset, page_size)) {
>           return RES_ZEROPAGE;
>       }
>   
> diff --git a/migration/ram.c b/migration/ram.c
> index 9040d66e61..fd337f7e65 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1129,6 +1129,26 @@ void ram_release_page(const char *rbname, uint64_t offset)
>       ram_discard_range(rbname, offset, TARGET_PAGE_SIZE);
>   }
>   
> +/**
> + * migration_buffer_is_zero: indicate if the page at the given
> + * location is entirely filled with zero, or is a poisoned page.
> + *
> + * @block: block that contains the page
> + * @offset: offset inside the block for the page
> + * @len: size to consider
> + */
> +bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset,
> +                                     size_t len)
> +{
> +    uint8_t *p = block->host + offset;
> +
> +    if (kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) {
> +        return true;
> +    }
> +
> +    return buffer_is_zero(p, len);
> +}
> +
>   /**
>    * save_zero_page_to_file: send the zero page to the file
>    *
> @@ -1142,10 +1162,9 @@ void ram_release_page(const char *rbname, uint64_t offset)
>   static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
>                                     RAMBlock *block, ram_addr_t offset)
>   {
> -    uint8_t *p = block->host + offset;
>       int len = 0;
>   
> -    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
> +    if (migration_buffer_is_zero(block, offset, TARGET_PAGE_SIZE)) {
>           len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
>           qemu_put_byte(file, 0);
>           len += 1;
> diff --git a/migration/ram.h b/migration/ram.h
> index 145c915ca7..805ea2a211 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -65,6 +65,8 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>   void ram_transferred_add(uint64_t bytes);
>   void ram_release_page(const char *rbname, uint64_t offset);
>   
> +bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset, size_t len);
> +
>   int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
>   bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
>   void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-15  3:13                   ` Zhijian Li (Fujitsu)
@ 2023-09-15 11:31                     ` William Roche
  2023-09-18  3:47                       ` Zhijian Li (Fujitsu)
  2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
  0 siblings, 2 replies; 34+ messages in thread
From: William Roche @ 2023-09-15 11:31 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), qemu-devel@nongnu.org, peterx@redhat.com
  Cc: pbonzini@redhat.com, quintela@redhat.com, leobras@redhat.com,
	joao.m.martins@oracle.com, lidongchen

[-- Attachment #1: Type: text/plain, Size: 4667 bytes --]

On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
> 
> 
> I'm okay with "RDMA isn't touched".
> BTW, could you share your reproducing program/hacking to poison the page, so that
> i am able to take a look the RDMA part later when i'm free.
> 
> Not sure it's suitable to acknowledge a not touched part. Anyway
> Acked-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
> 

Thanks.
As you asked for a procedure to inject memory errors into a running VM,
I've attached to this email the source code (mce_process_react.c) of a
program that will help to target the error injection in the VM.

(Be careful that error injection is currently nor working on AMD
platforms -- this is a work in progress is a separate qemu thread)


The general idea:
We are going to target a process memory page running inside a VM to see
what happens when we inject an error on the underlying physical page at
the platform (hypervisor) level.
To have a better view of what's going on, we'll use a process made for
this: It's goal is to allocate a memory page, and create a SIGBUS
handler to inform when it receives this signal. It will also wait before
touching this page to see what happens next.

     Compiling this tool:
     $ gcc -o mce_process_react_x86 mce_process_react.c


Let's try that:
This procedure shows the best case scenario, where an error injected at
the platform level is reported up to the guest process using it.
Note that qemu should be started with root privilege.

     1. Choose a process running in the VM (and identify a memory page
you want to target, and get its physical address – crash(8) vtop can
help with that) or run the attached mce_process_react example (compiled
for your platform mce_process_react_[x86|arm]) with an option to be
early informed of _AO error (-e) and wait ENTER to continue with reading
the allocated page (-w 0):

[root@VM ]# ./mce_process_react_x86 -e -w 0
Setting Early kill... Ok

Data pages at 0x7fa0f9b25000  physically 0x200f2fa000

Press ENTER to continue with page reading


     2. Go into the VM monitor to get the translation from "Guest
Physical Address to Host Physical Address" or "Host Virtual Address":

  (qemu) gpa2hpa 0x200f2fa000'
Host physical address for 0x200f2fa000 (ram-node1) is 0x46f12fa000


     3. Before we inject the error, we want to keep track of the VM
console output (in a separate window).
If you are using libvirt: # virsh console myvm


     4. We now prepare for the error injection at the platform level to
the address we found.  To do so, we'll need to use the hwpoison-inject
module (x86)
Be careful, as hwpoison takes Page Frame Numbers and this PFN is not the
physical address – you need to remove the last 12 bits (the last 3 zeros
of the above address) !

[root@hv ]# modprobe hwpoison-inject
[root@hv ]# echo 0x46f12fa > /sys/kernel/debug/hwpoison/corrupt-pfn

        If you see "Operation not permitted" error when writing as root
on corrupt-pfn, you may be facing a "kernel_lockdown(7)" which is
enabled on SecureBoot systems (can be verified with
"mokutil --sb-state"). In this case, turn SecureBoot off  (at the UEFI
level for example)

     5. Look at the qemu output (either on the terminal where qemu was
started or  if you are using libvirt:  tail /var/log/libvirt/qemu/myvm

2022-08-31T13:52:25.645398Z qemu-system-x86_64: warning: Guest MCE 
Memory Error at QEMU addr 0x7eeeace00000 and GUEST addr 0x200f200 of 
type BUS_MCEERR_AO injected

     6. On the guest console:
We'll see the VM reaction to the injected error:

[  155.805149] Disabling lock debugging due to kernel taint
[  155.806174] mce: [Hardware Error]: Machine check events logged
[  155.807120] Memory failure: 0x200f200: Killing mce_process_rea:3548 
due to hardware memory corruption
[  155.808877] Memory failure: 0x200f200: recovery action for dirty LRU 
page: Recovered

     7. The Guest process that we started at the first step gives:

Signal 7 received
BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000

At this stage, the VM has a poisoned page, and a migration of this VM
needs to be fixed in order to avoid accessing the poisoned page.

     8. The process continues to run (as it handled the SIGBUS).
Now if you press ENTER on this process terminal, it will try to read the
page which will generate a new MCE (a synchronous one) at VM level which
will be sent to this process:

Signal 7 received
BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
Exit from the signal handler on BUS_MCEERR_AR

     9. The VM console shows:
[ 2520.895263] MCE: Killing mce_process_rea:3548 due to hardware memory 
corruption fault at 7f45e5265000

     10. The VM continues to run...
With a poisoned page in its address space

HTH,
William.

[-- Attachment #2: mce_process_react.c --]
[-- Type: text/x-csrc, Size: 5698 bytes --]

#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <stdint.h>
#include <signal.h>
#include <string.h>

#define PAGEMAP_ENTRY 8
#define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y
#define GET_PFN(X) X & 0x7FFFFFFFFFFFFF
#define ALLOC_PAGES 1

const int __endian_bit = 1;
#define is_bigendian() ( (*(char*)&__endian_bit) == 0 )

/*
 * Set the early kill mode reaction state to MCE error.
 */
static void early_reaction() {
   printf("Setting Early kill... ");
   if (prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) == 0)
      printf("Ok\n");
   else
      printf("Failure !\n");
}

/*
 * Generate an error on the given page.
 */
static void memory_error_advise(void* virtual_page) {
   int ret;

   ret = madvise(virtual_page, 4096, MADV_HWPOISON);
   if (ret)
      printf("Poisoning failed - madvise: %s", strerror(errno));
}

/*
 * Return the physical address associated to a given local virtual address,
 * or -1 in case of an error.
 */
static uint64_t physical_address(uint64_t virt_addr) {
   char path_buf [0x100];
   FILE * f;
   uint64_t read_val, file_offset, pfn = 0;
   long pgsz;
   unsigned char c_buf[PAGEMAP_ENTRY];
   pid_t my_pid = getpid();
   int status, i;

   sprintf(path_buf, "/proc/%u/pagemap", my_pid);
   
   f = fopen(path_buf, "rb");
   if(!f){
      printf("Error! Cannot open %s\n", path_buf);
      return (uint64_t)-1;
   }
   
   //Shifting by virt-addr-offset number of bytes
   //and multiplying by the size of an address
   //(the size of an entry in pagemap file)
   pgsz = sysconf(_SC_PAGESIZE);
   file_offset = virt_addr / (uint64_t)pgsz * PAGEMAP_ENTRY;
   status = fseek(f, (long)file_offset, SEEK_SET);
   if(status){
      perror("Failed to do fseek!");
      fclose(f);
      return (uint64_t)-1;
   }

   for(i=0; i < PAGEMAP_ENTRY; i++){
      int c = getc(f);
      if(c==EOF){
         fclose(f);
         return (uint64_t)-1;
      }
      if(is_bigendian())
           c_buf[i] = (unsigned char)c;
      else
           c_buf[PAGEMAP_ENTRY - i - 1] = (unsigned char)c;
   }
   fclose(f);

   read_val = 0;
   for(i=0; i < PAGEMAP_ENTRY; i++){
      read_val = (read_val << 8) + c_buf[i];
   }

   if(GET_BIT(read_val, 63)) { // Bit  63    page present
      pfn = GET_PFN(read_val);
   } else {
      printf("Page not present !\n");
   }
   if(GET_BIT(read_val, 62)) // Bit  62    page swapped
      printf("Page swapped\n");

   if (pfn == 0)
      return (uint64_t)-1;

   return pfn * (uint64_t)pgsz;
}

/*
 * SIGBUS handler to display the given information.
 */
static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) {
   printf("Signal %d received: ", signum);
   printf("%s on vaddr: %llx\n",
      (siginfo->si_code == 4? "BUS_MCEERR_AR":"BUS_MCEERR_AO"),
      siginfo->si_addr);

  if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */
	fprintf(stderr, "Exit from the signal handler on BUS_MCEERR_AR\n");
	_exit(1);
  }
}

int main(int argc, char ** argv) {
   int opt, early_react = 0, madvise_error=0, wait_time=5, i;
   struct sigaction my_sigaction;
   uint64_t virt_addr = 0, phys_addr;
   void *local_pnt;

   // Need to have the CAP_SYS_ADMIN capability to get PFNs values in pagemap.
   if (getuid() != 0) {
      fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]);
      exit(EXIT_FAILURE);
   }

   while ((opt = getopt(argc, argv, "emw:")) != -1) {
      switch (opt) {
      case 'e':
         early_react = 1;
         break;
      case 'm':
         madvise_error=1;
         break;
      case 'w':
         wait_time=atoi(optarg);
         break;
      default: /* '?' */
         fprintf(stderr, "Usage: %s [-e] [-m] [-w seconds]\n", argv[0]);
         exit(EXIT_FAILURE);
      }
   }

   // attach our SIGBUS handler.
   my_sigaction.sa_sigaction = sigbus_action;
   my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER | SA_SIGINFO;
   if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) {
      perror("Signal handler attach failed");
      exit(EXIT_FAILURE);
   }

   if (early_react)
      early_reaction();

   // Allocate nx4K private pages.
   local_pnt = mmap(NULL, ALLOC_PAGES*4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
   if (local_pnt == MAP_FAILED) {
      fprintf(stderr, "Memory Allocation failed !\n");
      exit(EXIT_FAILURE);
   }
   virt_addr = (uint64_t)local_pnt;

   // Dirty / map the pages.
   for (i=0; i < ALLOC_PAGES; i++) {
      sprintf(((char *)local_pnt + i*4096), "My page number %d\n", i);
   }

   phys_addr = physical_address(virt_addr);
   if (phys_addr == -1) {
      fprintf(stderr, "Virtual address translation 0x%llx failed\n", 
         (unsigned long long)virt_addr);
      exit(EXIT_FAILURE);
   }
   printf("\nData pages at 0x%llx  physically 0x%llx\n",
      (unsigned long long)virt_addr, (unsigned long long)phys_addr);
   fflush(stdout);

   // Explicit error
   if (madvise_error)
      memory_error_advise((void*) virt_addr);

   // Now Wait !
   if (wait_time > 0) {
      sleep((unsigned int)wait_time);
   } else {
      printf("\nPress ENTER to continue with page reading\n");
      i = fgetc(stdin);
   }
   
   // read the strings at the beginning of each page.
   for (i=0; i < ALLOC_PAGES; i++) {
      printf("%s", ((char *)local_pnt + i*4096));
   }

   phys_addr = physical_address(virt_addr);
   if (phys_addr == -1) {
      fprintf(stderr, "Virtual address translation 0x%llx failed\n", 
         (unsigned long long)virt_addr);
   } else {
      printf("\nData pages at 0x%llx  physically 0x%llx\n",
         (unsigned long long)virt_addr, (unsigned long long)phys_addr);
   }

   return 0;
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-15 11:31                     ` William Roche
@ 2023-09-18  3:47                       ` Zhijian Li (Fujitsu)
  2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
  1 sibling, 0 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-09-18  3:47 UTC (permalink / raw)
  To: William Roche, qemu-devel@nongnu.org, peterx@redhat.com
  Cc: pbonzini@redhat.com, quintela@redhat.com, leobras@redhat.com,
	joao.m.martins@oracle.com, lidongchen@tencent.com



On 15/09/2023 19:31, William Roche wrote:
> On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
>>
>>
>> I'm okay with "RDMA isn't touched".
>> BTW, could you share your reproducing program/hacking to poison the page, so that
>> i am able to take a look the RDMA part later when i'm free.
>>
>> Not sure it's suitable to acknowledge a not touched part. Anyway
>> Acked-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
>>
> 
> Thanks.
> As you asked for a procedure to inject memory errors into a running VM,
> I've attached to this email the source code (mce_process_react.c) of a
> program that will help to target the error injection in the VM.
> 


Very very thanks for your details, Mark it :)

Thanks
Zhijian



> (Be careful that error injection is currently nor working on AMD
> platforms -- this is a work in progress is a separate qemu thread)
> 
> 
> The general idea:
> We are going to target a process memory page running inside a VM to see
> what happens when we inject an error on the underlying physical page at
> the platform (hypervisor) level.
> To have a better view of what's going on, we'll use a process made for
> this: It's goal is to allocate a memory page, and create a SIGBUS
> handler to inform when it receives this signal. It will also wait before
> touching this page to see what happens next.
> 
>      Compiling this tool:
>      $ gcc -o mce_process_react_x86 mce_process_react.c
> 
> 
> Let's try that:
> This procedure shows the best case scenario, where an error injected at
> the platform level is reported up to the guest process using it.
> Note that qemu should be started with root privilege.
> 
>      1. Choose a process running in the VM (and identify a memory page
> you want to target, and get its physical address – crash(8) vtop can
> help with that) or run the attached mce_process_react example (compiled
> for your platform mce_process_react_[x86|arm]) with an option to be
> early informed of _AO error (-e) and wait ENTER to continue with reading
> the allocated page (-w 0):
> 
> [root@VM ]# ./mce_process_react_x86 -e -w 0
> Setting Early kill... Ok
> 
> Data pages at 0x7fa0f9b25000  physically 0x200f2fa000
> 
> Press ENTER to continue with page reading
> 
> 
>      2. Go into the VM monitor to get the translation from "Guest
> Physical Address to Host Physical Address" or "Host Virtual Address":
> 
>   (qemu) gpa2hpa 0x200f2fa000'
> Host physical address for 0x200f2fa000 (ram-node1) is 0x46f12fa000
> 
> 
>      3. Before we inject the error, we want to keep track of the VM
> console output (in a separate window).
> If you are using libvirt: # virsh console myvm
> 
> 
>      4. We now prepare for the error injection at the platform level to
> the address we found.  To do so, we'll need to use the hwpoison-inject
> module (x86)
> Be careful, as hwpoison takes Page Frame Numbers and this PFN is not the
> physical address – you need to remove the last 12 bits (the last 3 zeros
> of the above address) !
> 
> [root@hv ]# modprobe hwpoison-inject
> [root@hv ]# echo 0x46f12fa > /sys/kernel/debug/hwpoison/corrupt-pfn
> 
>         If you see "Operation not permitted" error when writing as root
> on corrupt-pfn, you may be facing a "kernel_lockdown(7)" which is
> enabled on SecureBoot systems (can be verified with
> "mokutil --sb-state"). In this case, turn SecureBoot off  (at the UEFI
> level for example)
> 
>      5. Look at the qemu output (either on the terminal where qemu was
> started or  if you are using libvirt:  tail /var/log/libvirt/qemu/myvm
> 
> 2022-08-31T13:52:25.645398Z qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr 0x7eeeace00000 and GUEST addr 0x200f200 of type BUS_MCEERR_AO injected
> 
>      6. On the guest console:
> We'll see the VM reaction to the injected error:
> 
> [  155.805149] Disabling lock debugging due to kernel taint
> [  155.806174] mce: [Hardware Error]: Machine check events logged
> [  155.807120] Memory failure: 0x200f200: Killing mce_process_rea:3548 due to hardware memory corruption
> [  155.808877] Memory failure: 0x200f200: recovery action for dirty LRU page: Recovered
> 
>      7. The Guest process that we started at the first step gives:
> 
> Signal 7 received
> BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000
> 
> At this stage, the VM has a poisoned page, and a migration of this VM
> needs to be fixed in order to avoid accessing the poisoned page.
> 
>      8. The process continues to run (as it handled the SIGBUS).
> Now if you press ENTER on this process terminal, it will try to read the
> page which will generate a new MCE (a synchronous one) at VM level which
> will be sent to this process:
> 
> Signal 7 received
> BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
> Exit from the signal handler on BUS_MCEERR_AR
> 
>      9. The VM console shows:
> [ 2520.895263] MCE: Killing mce_process_rea:3548 due to hardware memory corruption fault at 7f45e5265000
> 
>      10. The VM continues to run...
> With a poisoned page in its address space
> 
> HTH,
> William.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-15 11:31                     ` William Roche
  2023-09-18  3:47                       ` Zhijian Li (Fujitsu)
@ 2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
  2023-09-20 12:11                         ` William Roche
  2023-09-20 23:53                         ` [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
  1 sibling, 2 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-09-20 10:04 UTC (permalink / raw)
  To: William Roche, qemu-devel@nongnu.org, peterx@redhat.com
  Cc: pbonzini@redhat.com, quintela@redhat.com, leobras@redhat.com,
	joao.m.martins@oracle.com, lidongchen@tencent.com



On 15/09/2023 19:31, William Roche wrote:
> On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
>>
>>
>> I'm okay with "RDMA isn't touched".
>> BTW, could you share your reproducing program/hacking to poison the page, so that
>> i am able to take a look the RDMA part later when i'm free.
>>
>> Not sure it's suitable to acknowledge a not touched part. Anyway
>> Acked-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
>>
> 
> Thanks.
> As you asked for a procedure to inject memory errors into a running VM,
> I've attached to this email the source code (mce_process_react.c) of a
> program that will help to target the error injection in the VM.


I just tried you hwpoison program and do RDMA migration. Migration failed, but fortunately
the source side is still alive :).

(qemu) Failed to register chunk!: Bad address
Chunk details: block: 0 chunk index 671 start 139955096518656 end 139955097567232 host 139955096518656 local 139954392924160 registrations: 636
qemu-system-x86_64: cannot get lkey
qemu-system-x86_64: rdma migration: write error! -22
qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
qemu-system-x86_64: failed to save SaveStateEntry with id(name): 2(ram): -22
qemu-system-x86_64: Early error. Sending error.


Since current RDMA migration transfers guest memory in a chunk size(1M) by default, we may need to

option 1: reduce all chunk size to 1 page
option 2: handle the hwpoison chunk specially

However, because there may be a chance to use another protocol, it's also possible to temporarily not fix the issue.

Tested-by: Li Zhijian <lizhijian@fujitsu.com>

Thanks
Zhijian




> 
> (Be careful that error injection is currently nor working on AMD
> platforms -- this is a work in progress is a separate qemu thread)
> 
> 
> The general idea:
> We are going to target a process memory page running inside a VM to see
> what happens when we inject an error on the underlying physical page at
> the platform (hypervisor) level.
> To have a better view of what's going on, we'll use a process made for
> this: It's goal is to allocate a memory page, and create a SIGBUS
> handler to inform when it receives this signal. It will also wait before
> touching this page to see what happens next.
> 
>      Compiling this tool:
>      $ gcc -o mce_process_react_x86 mce_process_react.c
> 
> 
> Let's try that:
> This procedure shows the best case scenario, where an error injected at
> the platform level is reported up to the guest process using it.
> Note that qemu should be started with root privilege.
> 
>      1. Choose a process running in the VM (and identify a memory page
> you want to target, and get its physical address – crash(8) vtop can
> help with that) or run the attached mce_process_react example (compiled
> for your platform mce_process_react_[x86|arm]) with an option to be
> early informed of _AO error (-e) and wait ENTER to continue with reading
> the allocated page (-w 0):
> 
> [root@VM ]# ./mce_process_react_x86 -e -w 0
> Setting Early kill... Ok
> 
> Data pages at 0x7fa0f9b25000  physically 0x200f2fa000
> 
> Press ENTER to continue with page reading
> 
> 
>      2. Go into the VM monitor to get the translation from "Guest
> Physical Address to Host Physical Address" or "Host Virtual Address":
> 
>   (qemu) gpa2hpa 0x200f2fa000'
> Host physical address for 0x200f2fa000 (ram-node1) is 0x46f12fa000
> 
> 
>      3. Before we inject the error, we want to keep track of the VM
> console output (in a separate window).
> If you are using libvirt: # virsh console myvm
> 
> 
>      4. We now prepare for the error injection at the platform level to
> the address we found.  To do so, we'll need to use the hwpoison-inject
> module (x86)
> Be careful, as hwpoison takes Page Frame Numbers and this PFN is not the
> physical address – you need to remove the last 12 bits (the last 3 zeros
> of the above address) !
> 
> [root@hv ]# modprobe hwpoison-inject
> [root@hv ]# echo 0x46f12fa > /sys/kernel/debug/hwpoison/corrupt-pfn
> 
>         If you see "Operation not permitted" error when writing as root
> on corrupt-pfn, you may be facing a "kernel_lockdown(7)" which is
> enabled on SecureBoot systems (can be verified with
> "mokutil --sb-state"). In this case, turn SecureBoot off  (at the UEFI
> level for example)
> 
>      5. Look at the qemu output (either on the terminal where qemu was
> started or  if you are using libvirt:  tail /var/log/libvirt/qemu/myvm
> 
> 2022-08-31T13:52:25.645398Z qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr 0x7eeeace00000 and GUEST addr 0x200f200 of type BUS_MCEERR_AO injected
> 
>      6. On the guest console:
> We'll see the VM reaction to the injected error:
> 
> [  155.805149] Disabling lock debugging due to kernel taint
> [  155.806174] mce: [Hardware Error]: Machine check events logged
> [  155.807120] Memory failure: 0x200f200: Killing mce_process_rea:3548 due to hardware memory corruption
> [  155.808877] Memory failure: 0x200f200: recovery action for dirty LRU page: Recovered
> 
>      7. The Guest process that we started at the first step gives:
> 
> Signal 7 received
> BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000
> 
> At this stage, the VM has a poisoned page, and a migration of this VM
> needs to be fixed in order to avoid accessing the poisoned page.
> 
>      8. The process continues to run (as it handled the SIGBUS).
> Now if you press ENTER on this process terminal, it will try to read the
> page which will generate a new MCE (a synchronous one) at VM level which
> will be sent to this process:
> 
> Signal 7 received
> BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
> Exit from the signal handler on BUS_MCEERR_AR
> 
>      9. The VM console shows:
> [ 2520.895263] MCE: Killing mce_process_rea:3548 due to hardware memory corruption fault at 7f45e5265000
> 
>      10. The VM continues to run...
> With a poisoned page in its address space
> 
> HTH,
> William.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
@ 2023-09-20 12:11                         ` William Roche
  2023-09-20 23:53                         ` [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
  1 sibling, 0 replies; 34+ messages in thread
From: William Roche @ 2023-09-20 12:11 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), qemu-devel@nongnu.org, peterx@redhat.com
  Cc: pbonzini@redhat.com, quintela@redhat.com, leobras@redhat.com,
	joao.m.martins@oracle.com, lidongchen@tencent.com

Thank you Zhijian for your feedback.

So I'll try to push this change today.

Cheers,
William.


On 9/20/23 12:04, Zhijian Li (Fujitsu) wrote:
> 
> 
> On 15/09/2023 19:31, William Roche wrote:
>> On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
>>>
>>>
>>> I'm okay with "RDMA isn't touched".
>>> BTW, could you share your reproducing program/hacking to poison the page, so that
>>> i am able to take a look the RDMA part later when i'm free.
>>>
>>> Not sure it's suitable to acknowledge a not touched part. Anyway
>>> Acked-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
>>>
>>
>> Thanks.
>> As you asked for a procedure to inject memory errors into a running VM,
>> I've attached to this email the source code (mce_process_react.c) of a
>> program that will help to target the error injection in the VM.
> 
> 
> I just tried you hwpoison program and do RDMA migration. Migration failed, but fortunately
> the source side is still alive :).
> 
> (qemu) Failed to register chunk!: Bad address
> Chunk details: block: 0 chunk index 671 start 139955096518656 end 139955097567232 host 139955096518656 local 139954392924160 registrations: 636
> qemu-system-x86_64: cannot get lkey
> qemu-system-x86_64: rdma migration: write error! -22
> qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
> qemu-system-x86_64: failed to save SaveStateEntry with id(name): 2(ram): -22
> qemu-system-x86_64: Early error. Sending error.
> 
> 
> Since current RDMA migration transfers guest memory in a chunk size(1M) by default, we may need to
> 
> option 1: reduce all chunk size to 1 page
> option 2: handle the hwpoison chunk specially
> 
> However, because there may be a chance to use another protocol, it's also possible to temporarily not fix the issue.
> 
> Tested-by: Li Zhijian <lizhijian@fujitsu.com>
> 
> Thanks
> Zhijian


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error
  2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
  2023-09-20 12:11                         ` William Roche
@ 2023-09-20 23:53                         ` “William Roche
  2023-09-20 23:53                           ` [PATCH v3 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
  2023-10-13 15:08                           ` [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  1 sibling, 2 replies; 34+ messages in thread
From: “William Roche @ 2023-09-20 23:53 UTC (permalink / raw)
  To: qemu-devel, peterx, lizhijian
  Cc: pbonzini, quintela, leobras, joao.m.martins, lidongchen,
	william.roche

From: William Roche <william.roche@oracle.com>

A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to correct this problem, I suggest to treat the poisoned pages as if
they were zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".
This fix is scripts/checkpatch.pl clean.

v2:
  - adding compressed transfer handling of poisoned pages

Testing: I could verify that migration now works with a poisoned page
through standard and compressed migration with 4k and large (2M) pages.

v3:
  - Included the Reviewed-by and Tested-by information
  - added a TODO comment above control_save_page()
    mentioning Zhijian's feedback about migration failure.

William Roche (1):
  migration: skip poisoned memory pages on "ram saving" phase

 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 ++++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 55 insertions(+), 3 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 1/1] migration: skip poisoned memory pages on "ram saving" phase
  2023-09-20 23:53                         ` [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
@ 2023-09-20 23:53                           ` “William Roche
  2023-10-13 15:08                           ` [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  1 sibling, 0 replies; 34+ messages in thread
From: “William Roche @ 2023-09-20 23:53 UTC (permalink / raw)
  To: qemu-devel, peterx, lizhijian
  Cc: pbonzini, quintela, leobras, joao.m.martins, lidongchen,
	william.roche

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.

The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page_to_file
#5  save_zero_page
#6  ram_save_target_page_legacy
#7  ram_save_host_page
#8  ram_find_and_save_block
#9  ram_save_iterate
#10 qemu_savevm_state_iterate
#11 migration_iteration_run
#12 migration_thread
#13 qemu_thread_start

Fix it by considering poisoned pages as if they were zero-pages for
the migration copy. This fix also works with underlying large pages,
taking into account the RAMBlock segment "page-size".

Standard migration and compressed transfers are handled by this code.
RDMA transfer isn't touched.

Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 ++++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index ff1578bb32..7fb13c8a56 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
+bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
+{
+    HWPoisonPage *pg;
+    ram_addr_t ram_addr = (ram_addr_t) offset;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if ((ram_addr >= pg->ram_addr) &&
+            (ram_addr - pg->ram_addr < block->page_size)) {
+            return true;
+        }
+    }
+    return false;
+}
+
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 235dc661bc..c0a31611df 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index ee9025f8e9..858688227a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -570,4 +570,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_page - indicate if the given page is poisoned
+ * @block: memory block of the given page
+ * @ram_addr: offset of the page
+ *
+ * Returns: true: page is poisoned
+ *          false: page not yet poisoned
+ */
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
 #endif
diff --git a/migration/ram-compress.c b/migration/ram-compress.c
index 06254d8c69..1916ce709d 100644
--- a/migration/ram-compress.c
+++ b/migration/ram-compress.c
@@ -34,6 +34,7 @@
 #include "qemu/error-report.h"
 #include "migration.h"
 #include "options.h"
+#include "ram.h"
 #include "io/channel-null.h"
 #include "exec/target_page.h"
 #include "exec/ramblock.h"
@@ -198,7 +199,7 @@ static CompressResult do_compress_ram_page(QEMUFile *f, z_stream *stream,
 
     assert(qemu_file_buffer_empty(f));
 
-    if (buffer_is_zero(p, page_size)) {
+    if (migration_buffer_is_zero(block, offset, page_size)) {
         return RES_ZEROPAGE;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 9040d66e61..21357666dc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1129,6 +1129,26 @@ void ram_release_page(const char *rbname, uint64_t offset)
     ram_discard_range(rbname, offset, TARGET_PAGE_SIZE);
 }
 
+/**
+ * migration_buffer_is_zero: indicate if the page at the given
+ * location is entirely filled with zero, or is a poisoned page.
+ *
+ * @block: block that contains the page
+ * @offset: offset inside the block for the page
+ * @len: size to consider
+ */
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset,
+                                     size_t len)
+{
+    uint8_t *p = block->host + offset;
+
+    if (kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) {
+        return true;
+    }
+
+    return buffer_is_zero(p, len);
+}
+
 /**
  * save_zero_page_to_file: send the zero page to the file
  *
@@ -1142,10 +1162,9 @@ void ram_release_page(const char *rbname, uint64_t offset)
 static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
                                   RAMBlock *block, ram_addr_t offset)
 {
-    uint8_t *p = block->host + offset;
     int len = 0;
 
-    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
+    if (migration_buffer_is_zero(block, offset, TARGET_PAGE_SIZE)) {
         len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
         qemu_put_byte(file, 0);
         len += 1;
@@ -1182,6 +1201,7 @@ static int save_zero_page(PageSearchStatus *pss, QEMUFile *f, RAMBlock *block,
  *        > 0 - number of pages written
  *
  * Return true if the pages has been saved, otherwise false is returned.
+ * TODO: hwpoison pages fail RDMA migration, should be handled.
  */
 static bool control_save_page(PageSearchStatus *pss, RAMBlock *block,
                               ram_addr_t offset, int *pages)
diff --git a/migration/ram.h b/migration/ram.h
index 145c915ca7..805ea2a211 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,8 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 void ram_transferred_add(uint64_t bytes);
 void ram_release_page(const char *rbname, uint64_t offset);
 
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset, size_t len);
+
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error
  2023-09-20 23:53                         ` [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
  2023-09-20 23:53                           ` [PATCH v3 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
@ 2023-10-13 15:08                           ` “William Roche
  2023-10-13 15:08                             ` [PATCH v4 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
  2023-10-13 15:08                             ` [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
  1 sibling, 2 replies; 34+ messages in thread
From: “William Roche @ 2023-10-13 15:08 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to thoroughly correct this problem, the poison information should
follow the migration which represents several difficulties:
- poisoning a page on the destination machine to replicate the source
  poison requires CAP_SYS_ADMIN priviledges, and qemu process may not
  always run as a root process
- the destination kernel needs to be configured with CONFIG_MEMORY_FAILURE
- the poison information would require a memory transfer protocol
  enhancement to provide this information
(The current patches don't provide any of that)

But if we rely on the fact that the a running VM kernel is correctly
dealing with memory poison it is informed about: marking the poison page
as inaccessible, we could count on the VM kernel to make sure that
poisoned pages are not used, even after a migration.
In this case, I suggest to treat the poisoned pages as if they were
zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".

Now, it leaves a case that we have to deal with: if a memory error is
reported to qemu but not injected into the running kernel...
As the migration will go from a poisoned page to an all-zero page, if
the VM kernel doesn't prevent the access to this page, a memory read
that would generate a BUS_MCEERR_AR error on the source platform, could
be reading zeros on the destination. This is a memory corruption. 

So we have to ensure that all poisoned pages we set to zero are known by
the running kernel. But we have a problem with platforms where BUS_MCEERR_AO
errors are ignored, which means that qemu knows about the poison but the VM
doesn't. For the moment it's only the case for ARM, but could later be
also needed for AMD VMs.
See https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/

In order to avoid this possible silent data corruption situation, we should
prevent the migration when we know that a poisoned page is ignored from the VM.

Which is, according to me, the smallest fix we need  to avoid qemu crashes
on migration after an handled memory error, without introducing a possible
corruption situation.

This fix is scripts/checkpatch.pl clean.
Unit test: Migration blocking succesfully tested on ARM -- injected AO error
blocks it. On x86 the same type of error being relayed doesn't block.

v2:
  - adding compressed transfer handling of poisoned pages

v3:
  - Included the Reviewed-by and Tested-by information on first patch
  - added a TODO comment above control_save_page()
    mentioning Zhijian's feedback about RDMA migration failure.

v4:
  - adding a patch to deal with unknown poison tracking
    (not using migrate_add_blocker as this is not devices related and
    we want to avoid the interaction with --only-migratable mechanism)

William Roche (2):
  migration: skip poisoned memory pages on "ram saving" phase
  migration: prevent migration when a poisoned page is unknown from the
    VM

 accel/kvm/kvm-all.c      | 41 +++++++++++++++++++++++++++++++++++++++-
 accel/stubs/kvm-stub.c   | 10 ++++++++++
 include/sysemu/kvm.h     | 16 ++++++++++++++++
 include/sysemu/kvm_int.h |  3 ++-
 migration/migration.c    |  6 ++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 +++++++++++++++++++++--
 migration/ram.h          |  2 ++
 target/arm/kvm64.c       |  6 +++++-
 target/i386/kvm/kvm.c    |  2 +-
 10 files changed, 106 insertions(+), 7 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/2] migration: skip poisoned memory pages on "ram saving" phase
  2023-10-13 15:08                           ` [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
@ 2023-10-13 15:08                             ` “William Roche
  2023-10-13 15:08                             ` [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
  1 sibling, 0 replies; 34+ messages in thread
From: “William Roche @ 2023-10-13 15:08 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.

The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page_to_file
#5  save_zero_page
#6  ram_save_target_page_legacy
#7  ram_save_host_page
#8  ram_find_and_save_block
#9  ram_save_iterate
#10 qemu_savevm_state_iterate
#11 migration_iteration_run
#12 migration_thread
#13 qemu_thread_start

Fix it by considering poisoned pages as if they were zero-pages for
the migration copy. This fix also works with underlying large pages,
taking into account the RAMBlock segment "page-size".

Standard migration and compressed transfers are handled by this code.
RDMA transfer isn't touched.

Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 ++++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 72e1d1141c..850577ea0e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1152,6 +1152,20 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
+bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
+{
+    HWPoisonPage *pg;
+    ram_addr_t ram_addr = (ram_addr_t) offset;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if ((ram_addr >= pg->ram_addr) &&
+            (ram_addr - pg->ram_addr < block->page_size)) {
+            return true;
+        }
+    }
+    return false;
+}
+
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 235dc661bc..c0a31611df 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -133,3 +133,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index ee9025f8e9..858688227a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -570,4 +570,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_page - indicate if the given page is poisoned
+ * @block: memory block of the given page
+ * @ram_addr: offset of the page
+ *
+ * Returns: true: page is poisoned
+ *          false: page not yet poisoned
+ */
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
 #endif
diff --git a/migration/ram-compress.c b/migration/ram-compress.c
index 06254d8c69..1916ce709d 100644
--- a/migration/ram-compress.c
+++ b/migration/ram-compress.c
@@ -34,6 +34,7 @@
 #include "qemu/error-report.h"
 #include "migration.h"
 #include "options.h"
+#include "ram.h"
 #include "io/channel-null.h"
 #include "exec/target_page.h"
 #include "exec/ramblock.h"
@@ -198,7 +199,7 @@ static CompressResult do_compress_ram_page(QEMUFile *f, z_stream *stream,
 
     assert(qemu_file_buffer_empty(f));
 
-    if (buffer_is_zero(p, page_size)) {
+    if (migration_buffer_is_zero(block, offset, page_size)) {
         return RES_ZEROPAGE;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 2f5ce4d60b..5a53802ddc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1137,6 +1137,26 @@ void ram_release_page(const char *rbname, uint64_t offset)
     ram_discard_range(rbname, offset, TARGET_PAGE_SIZE);
 }
 
+/**
+ * migration_buffer_is_zero: indicate if the page at the given
+ * location is entirely filled with zero, or is a poisoned page.
+ *
+ * @block: block that contains the page
+ * @offset: offset inside the block for the page
+ * @len: size to consider
+ */
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset,
+                                     size_t len)
+{
+    uint8_t *p = block->host + offset;
+
+    if (kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) {
+        return true;
+    }
+
+    return buffer_is_zero(p, len);
+}
+
 /**
  * save_zero_page_to_file: send the zero page to the file
  *
@@ -1150,10 +1170,9 @@ void ram_release_page(const char *rbname, uint64_t offset)
 static int save_zero_page_to_file(PageSearchStatus *pss, QEMUFile *file,
                                   RAMBlock *block, ram_addr_t offset)
 {
-    uint8_t *p = block->host + offset;
     int len = 0;
 
-    if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
+    if (migration_buffer_is_zero(block, offset, TARGET_PAGE_SIZE)) {
         len += save_page_header(pss, file, block, offset | RAM_SAVE_FLAG_ZERO);
         qemu_put_byte(file, 0);
         len += 1;
@@ -1190,6 +1209,7 @@ static int save_zero_page(PageSearchStatus *pss, QEMUFile *f, RAMBlock *block,
  *        > 0 - number of pages written
  *
  * Return true if the pages has been saved, otherwise false is returned.
+ * TODO: hwpoison pages fail RDMA migration, should be handled.
  */
 static bool control_save_page(PageSearchStatus *pss, RAMBlock *block,
                               ram_addr_t offset, int *pages)
diff --git a/migration/ram.h b/migration/ram.h
index 145c915ca7..805ea2a211 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,8 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 void ram_transferred_add(uint64_t bytes);
 void ram_release_page(const char *rbname, uint64_t offset);
 
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset, size_t len);
+
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-10-13 15:08                           ` [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  2023-10-13 15:08                             ` [PATCH v4 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
@ 2023-10-13 15:08                             ` “William Roche
  2023-10-16 16:48                               ` Peter Xu
  1 sibling, 1 reply; 34+ messages in thread
From: “William Roche @ 2023-10-13 15:08 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

Migrating a poisoned page as a zero-page can only be done when the
running guest kernel knows about this poison, so that it marks this
page as unaccessible and any access in the VM would fail.

But if a poison information is not relayed to the VM, the kernel
does not prevent its access. In this case, transforming a poisoned
page into a zero-page could create a case of silent data corruption.

So we have to keep track of poisons not injected into the guest,
like the ARM VM emulation ignoring BUS_MCEERR_AO errors.
When such a page exists, the migration has to be blocked.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 27 ++++++++++++++++++++++++++-
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     |  6 ++++++
 include/sysemu/kvm_int.h |  3 ++-
 migration/migration.c    |  6 ++++++
 target/arm/kvm64.c       |  6 +++++-
 target/i386/kvm/kvm.c    |  2 +-
 7 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 850577ea0e..2829b6372a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1133,8 +1133,17 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
     return ret;
 }
 
+/*
+ * We track the poisoned pages to be able to:
+ * - replace them on VM reset
+ * - skip them when migrating
+ * - block a migration for a VM where a poisoned page is ignored
+ *   as this VM kernel (not knowing about the error) could
+ *   incorrectly access the page.
+ */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    bool       vm_known;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1166,20 +1175,36 @@ bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
     return false;
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, bool known)
 {
     HWPoisonPage *page;
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
+            if (known && !page->vm_known) {
+                page->vm_known = true;
+            }
             return;
         }
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->vm_known = known;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
+bool kvm_hwpoisoned_unknown(void)
+{
+    HWPoisonPage *pg;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if (!pg->vm_known) {
+            return true;
+        }
+    }
+    return false;
+}
+
 static uint32_t adjust_ioeventfd_endianness(uint32_t val, uint32_t size)
 {
 #if HOST_BIG_ENDIAN != TARGET_BIG_ENDIAN
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index c0a31611df..c43de44263 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -138,3 +138,8 @@ bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
 {
     return false;
 }
+
+bool kvm_hwpoisoned_unknown(void)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 858688227a..37c8316ce4 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -580,4 +580,10 @@ uint32_t kvm_dirty_ring_size(void);
  *          false: page not yet poisoned
  */
 bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
+
+/**
+ * kvm_hwpoisoned_unknown - indicate if a qemu reported memory error
+ * is still unknown to (hasn't been injected into) the VM kernel.
+ */
+bool kvm_hwpoisoned_unknown(void);
 #endif
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a5b9122cb8..2dfde40690 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -136,10 +136,11 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @known: indicate if the error is injected to the VM kernel
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, bool known);
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 1c6c81ad49..27e9571aaf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -66,6 +66,7 @@
 #include "sysemu/qtest.h"
 #include "options.h"
 #include "sysemu/dirtylimit.h"
+#include "sysemu/kvm.h"
 
 static NotifierList migration_state_notifiers =
     NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
@@ -1646,6 +1647,11 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
         return false;
     }
 
+    if (kvm_hwpoisoned_unknown()) {
+        error_setg(errp, "Can't migrate this vm with ignored poisoned page");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 5e95c496bb..e8db6380c1 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -1158,7 +1158,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
@@ -1169,7 +1168,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              * called synchronously from the vCPU thread, or a bit
              * later from the main thread, so doing the injection of
              * the error would be more complicated.
+             * In this case, BUS_MCEERR_AO errors are unknown from the
+             * guest, and we will prevent migration as long as this
+             * poisoned page hasn't generated a BUS_MCEERR_AR error
+             * that the guest takes into account.
              */
+            kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
             if (code == BUS_MCEERR_AR) {
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index f6c7f7e268..f9365b4457 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -649,7 +649,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, true);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-10-13 15:08                             ` [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
@ 2023-10-16 16:48                               ` Peter Xu
  2023-10-17  0:38                                 ` William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-10-16 16:48 UTC (permalink / raw)
  To: “William Roche
  Cc: qemu-devel, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

On Fri, Oct 13, 2023 at 03:08:39PM +0000, “William Roche wrote:
> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> index 5e95c496bb..e8db6380c1 100644
> --- a/target/arm/kvm64.c
> +++ b/target/arm/kvm64.c
> @@ -1158,7 +1158,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>          ram_addr = qemu_ram_addr_from_host(addr);
>          if (ram_addr != RAM_ADDR_INVALID &&
>              kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
> -            kvm_hwpoison_page_add(ram_addr);
>              /*
>               * If this is a BUS_MCEERR_AR, we know we have been called
>               * synchronously from the vCPU thread, so we can easily
> @@ -1169,7 +1168,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>               * called synchronously from the vCPU thread, or a bit
>               * later from the main thread, so doing the injection of
>               * the error would be more complicated.
> +             * In this case, BUS_MCEERR_AO errors are unknown from the
> +             * guest, and we will prevent migration as long as this
> +             * poisoned page hasn't generated a BUS_MCEERR_AR error
> +             * that the guest takes into account.
>               */
> +            kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));

I'm curious why ARM doesn't forward this event to guest even if it's AO.
X86 does it, and makes more sense to me.  Not familiar with arm, do you
know the reason?

I think this patch needs review from ARM and/or KVM side.  Do you want to
have the 1st patch merged, or rather wait for the whole set?

Another thing to mention: feel free to look at a recent addition of ioctl
from userfault, where it can inject poisoned ptes:

https://lore.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com

I'm wondering if that'll be helpful to qemu too, where we can migrate
hwpoison_page_list and enforce the poisoning on dest.  Then even for AO
when accessed by guest it'll generated another MCE on dest.

>              if (code == BUS_MCEERR_AR) {
>                  kvm_cpu_synchronize_state(c);
>                  if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-10-16 16:48                               ` Peter Xu
@ 2023-10-17  0:38                                 ` William Roche
  2023-10-17 15:13                                   ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: William Roche @ 2023-10-17  0:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

On 10/16/23 18:48, Peter Xu wrote:
> On Fri, Oct 13, 2023 at 03:08:39PM +0000, “William Roche wrote:
>> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
>> index 5e95c496bb..e8db6380c1 100644
>> --- a/target/arm/kvm64.c
>> +++ b/target/arm/kvm64.c
>> @@ -1158,7 +1158,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>           ram_addr = qemu_ram_addr_from_host(addr);
>>           if (ram_addr != RAM_ADDR_INVALID &&
>>               kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
>> -            kvm_hwpoison_page_add(ram_addr);
>>               /*
>>                * If this is a BUS_MCEERR_AR, we know we have been called
>>                * synchronously from the vCPU thread, so we can easily
>> @@ -1169,7 +1168,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>                * called synchronously from the vCPU thread, or a bit
>>                * later from the main thread, so doing the injection of
>>                * the error would be more complicated.
>> +             * In this case, BUS_MCEERR_AO errors are unknown from the
>> +             * guest, and we will prevent migration as long as this
>> +             * poisoned page hasn't generated a BUS_MCEERR_AR error
>> +             * that the guest takes into account.
>>                */
>> +            kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
> 
> I'm curious why ARM doesn't forward this event to guest even if it's AO.
> X86 does it, and makes more sense to me.

I agree that forwarding this error is the best option to implement.
But an important note about this aspect  is that only Intel architecture
handles the AO error forwarding correctly; currently an AMD VM crashes
when an AO error relay is attempted.

That's why we've submitted the following kvm patch:
https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/
Among other AMD enhancements to better deal with MCE relay.

>  Not familiar with arm, do you
> know the reason?

I can't answer this question as I don't know anything about the specific
'complications' mentioned in the comment above. Maybe something around
the injection through ACPI GHES and its interrupt mechanism ??
But note also that ignoring AO errors is just a question of relying on
the Hypervisor kernel to generate an AR error when the asynchronously
poisoned page is touched later. Which can be acceptable -- when the
system guaranties the AR fault on the page.

> 
> I think this patch needs review from ARM and/or KVM side.  Do you want to
> have the 1st patch merged, or rather wait for the whole set?

I think that integrating the first patch alone is not an option
as we would introduce the silent data corruption possibility I
described.  It would be better to integrate the two of them as a whole
set. But the use of the kernel feature you indicated me can change all
of that !

> 
> Another thing to mention: feel free to look at a recent addition of ioctl
> from userfault, where it can inject poisoned ptes:
> 
> https://lore.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
> 
> I'm wondering if that'll be helpful to qemu too, where we can migrate
> hwpoison_page_list and enforce the poisoning on dest.  Then even for AO
> when accessed by guest it'll generated another MCE on dest.

I could be missing something, but Yes, this is exactly how I understand
this kernel feature use case with its description in:
https://lore.kernel.org/all/20230707215540.2324998-5-axelrasmussen@google.com/

  vvvvvv
So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
   either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
   communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_POISON - this places a swap marker
   so any future accesses will SIGBUS. Because the pte is now "present",
   future accesses won't generate more userfaultfd events, they'll just
   SIGBUS directly.
  ^^^^^^

Thank you for letting me know about this kernel functionality.

I need to take some time to investigate it, to see how I could use it.

The solution I'm suggesting here doesn't cover as many cases as the
UFFDIO_POISON use could help to implement.
But it gives us a possibility to live migrate VMs that already
experienced memory errors, trusting the VM kernel to correctly deal with
these past errors.

AFAIK, currently, a standard qemu VM that has experienced a memory error
can't be live migrated at all.

Please correct me if I'm wrong.
Thanks again.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-10-17  0:38                                 ` William Roche
@ 2023-10-17 15:13                                   ` Peter Xu
  2023-11-06 21:38                                     ` William Roche
  2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2023-10-17 15:13 UTC (permalink / raw)
  To: William Roche
  Cc: qemu-devel, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

On Tue, Oct 17, 2023 at 02:38:48AM +0200, William Roche wrote:
> On 10/16/23 18:48, Peter Xu wrote:
> > On Fri, Oct 13, 2023 at 03:08:39PM +0000, “William Roche wrote:
> > > diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> > > index 5e95c496bb..e8db6380c1 100644
> > > --- a/target/arm/kvm64.c
> > > +++ b/target/arm/kvm64.c
> > > @@ -1158,7 +1158,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> > >           ram_addr = qemu_ram_addr_from_host(addr);
> > >           if (ram_addr != RAM_ADDR_INVALID &&
> > >               kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
> > > -            kvm_hwpoison_page_add(ram_addr);
> > >               /*
> > >                * If this is a BUS_MCEERR_AR, we know we have been called
> > >                * synchronously from the vCPU thread, so we can easily
> > > @@ -1169,7 +1168,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> > >                * called synchronously from the vCPU thread, or a bit
> > >                * later from the main thread, so doing the injection of
> > >                * the error would be more complicated.
> > > +             * In this case, BUS_MCEERR_AO errors are unknown from the
> > > +             * guest, and we will prevent migration as long as this
> > > +             * poisoned page hasn't generated a BUS_MCEERR_AR error
> > > +             * that the guest takes into account.
> > >                */
> > > +            kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
> > 
> > I'm curious why ARM doesn't forward this event to guest even if it's AO.
> > X86 does it, and makes more sense to me.
> 
> I agree that forwarding this error is the best option to implement.
> But an important note about this aspect  is that only Intel architecture
> handles the AO error forwarding correctly; currently an AMD VM crashes
> when an AO error relay is attempted.
> 
> That's why we've submitted the following kvm patch:
> https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/
> Among other AMD enhancements to better deal with MCE relay.

I see.

> 
> 
> >  Not familiar with arm, do you
> > know the reason?
> 
> I can't answer this question as I don't know anything about the specific
> 'complications' mentioned in the comment above. Maybe something around
> the injection through ACPI GHES and its interrupt mechanism ??
> But note also that ignoring AO errors is just a question of relying on
> the Hypervisor kernel to generate an AR error when the asynchronously
> poisoned page is touched later. Which can be acceptable -- when the
> system guaranties the AR fault on the page.
> 
> > 
> > I think this patch needs review from ARM and/or KVM side.  Do you want to
> > have the 1st patch merged, or rather wait for the whole set?
> 
> I think that integrating the first patch alone is not an option
> as we would introduce the silent data corruption possibility I
> described.

I asked because I think patch 1 itself is still an improvement, which
avoids src VM from crashing when hitting poisoned pages.  Especially IIUC
on some arch (Intel?) it's a complete fix.

But for sure we can keep them as a whole series if you want, but then it'll
be good you add some more reviewers; at least some ARM/AMD developers,
perhaps.

> It would be better to integrate the two of them as a whole
> set. But the use of the kernel feature you indicated me can change all
> of that !
> 
> > 
> > Another thing to mention: feel free to look at a recent addition of ioctl
> > from userfault, where it can inject poisoned ptes:
> > 
> > https://lore.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
> > 
> > I'm wondering if that'll be helpful to qemu too, where we can migrate
> > hwpoison_page_list and enforce the poisoning on dest.  Then even for AO
> > when accessed by guest it'll generated another MCE on dest.
> 
> I could be missing something, but Yes, this is exactly how I understand
> this kernel feature use case with its description in:
> https://lore.kernel.org/all/20230707215540.2324998-5-axelrasmussen@google.com/
> 
>  vvvvvv
> So the basic way to use this new feature is:
> 
> - On the new host, the guest's memory is registered with userfaultfd, in
>   either MISSING or MINOR mode (doesn't really matter for this purpose).
> - On any first access, we get a userfaultfd event. At this point we can
>   communicate with the old host to find out if the page was poisoned.
> - If so, we can respond with a UFFDIO_POISON - this places a swap marker
>   so any future accesses will SIGBUS. Because the pte is now "present",
>   future accesses won't generate more userfaultfd events, they'll just
>   SIGBUS directly.
>  ^^^^^^
> 
> Thank you for letting me know about this kernel functionality.
> 
> I need to take some time to investigate it, to see how I could use it.

One more hint, please double check though: in QEMU's use case (e.g. precopy
only, while not using postcopy) I think you may even be able to install the
poisoned pte without MISSING (or any other uffd) mode registered.

You can try creating one uffd descriptor (which will bind the desc with the
current mm context; in this case we need it to happen only on dest qemu),
then try injecting poison ptes anywhere in the guest address ranges.

> 
> The solution I'm suggesting here doesn't cover as many cases as the
> UFFDIO_POISON use could help to implement.
> But it gives us a possibility to live migrate VMs that already
> experienced memory errors, trusting the VM kernel to correctly deal with
> these past errors.
> 
> AFAIK, currently, a standard qemu VM that has experienced a memory error
> can't be live migrated at all.

I suppose here you meant AO errors only.

IIUC the major issue regarding migration is AO errors will become ARs on
src qemu when vcpu accessed, which means AOs are all fine if not forwarded
to guest.  However after migration that is not guaranteed.  Poisoned ptes
properly installed on dest basically grants QEMU the ability to "migrate a
poisoned page", meanwhile without really wasting a physical page on dest,
making sure those AO error addrs keep generating ARs even after migration.

It seems the 1st patch is still needed even in this case?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-10-17 15:13                                   ` Peter Xu
@ 2023-11-06 21:38                                     ` William Roche
  2023-11-08 21:45                                       ` Peter Xu
  2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  1 sibling, 1 reply; 34+ messages in thread
From: William Roche @ 2023-11-06 21:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

[-- Attachment #1: Type: text/plain, Size: 10605 bytes --]

On 10/17/23 17:13, Peter Xu wrote:

> On Tue, Oct 17, 2023 at 02:38:48AM +0200, William Roche wrote:
>> On 10/16/23 18:48, Peter Xu wrote:
>>> On Fri, Oct 13, 2023 at 03:08:39PM +0000, “William Roche wrote:
>>>> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
>>>> index 5e95c496bb..e8db6380c1 100644
>>>> --- a/target/arm/kvm64.c
>>>> +++ b/target/arm/kvm64.c
>>>> @@ -1158,7 +1158,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>>>            ram_addr = qemu_ram_addr_from_host(addr);
>>>>            if (ram_addr != RAM_ADDR_INVALID &&
>>>>                kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
>>>> -            kvm_hwpoison_page_add(ram_addr);
>>>>                /*
>>>>                 * If this is a BUS_MCEERR_AR, we know we have been called
>>>>                 * synchronously from the vCPU thread, so we can easily
>>>> @@ -1169,7 +1168,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>>>                 * called synchronously from the vCPU thread, or a bit
>>>>                 * later from the main thread, so doing the injection of
>>>>                 * the error would be more complicated.
>>>> +             * In this case, BUS_MCEERR_AO errors are unknown from the
>>>> +             * guest, and we will prevent migration as long as this
>>>> +             * poisoned page hasn't generated a BUS_MCEERR_AR error
>>>> +             * that the guest takes into account.
>>>>                 */
>>>> +            kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
>>> I'm curious why ARM doesn't forward this event to guest even if it's AO.
>>> X86 does it, and makes more sense to me.
>> I agree that forwarding this error is the best option to implement.
>> But an important note about this aspect  is that only Intel architecture
>> handles the AO error forwarding correctly; currently an AMD VM crashes
>> when an AO error relay is attempted.
>>
>> That's why we've submitted the following kvm patch:
>> https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/
>> Among other AMD enhancements to better deal with MCE relay.
> I see.
>
>>
>>>   Not familiar with arm, do you
>>> know the reason?
>> I can't answer this question as I don't know anything about the specific
>> 'complications' mentioned in the comment above. Maybe something around
>> the injection through ACPI GHES and its interrupt mechanism ??
>> But note also that ignoring AO errors is just a question of relying on
>> the Hypervisor kernel to generate an AR error when the asynchronously
>> poisoned page is touched later. Which can be acceptable -- when the
>> system guaranties the AR fault on the page.
>>
>>> I think this patch needs review from ARM and/or KVM side.  Do you want to
>>> have the 1st patch merged, or rather wait for the whole set?
>> I think that integrating the first patch alone is not an option
>> as we would introduce the silent data corruption possibility I
>> described.
> I asked because I think patch 1 itself is still an improvement, which
> avoids src VM from crashing when hitting poisoned pages.  Especially IIUC
> on some arch (Intel?) it's a complete fix.

Yes, this is almost true: According to me this fix would be a transitional
solution - a small change of the code to allow a VM live migration after a
memory error. This change would be only needed on the source machine, and
no necessary change on the destination machine.
But let me just repeat that this fix relies on trusting the VM kernel to
correctly deal with memory errors it knows about to avoid a memory
corruption!

Note also that large pages are taken into account too for our live 
migration,
but the poisoning of a qemu large page requires more work especially for VM
using standard 4k pages on top of these qemu large pages -- and this is a
completely different issue. I'm mentioning this aspect here because even on
Intel platforms, underlying large pages poisoning needs to be reported 
better
to the running VM as a large section of its memory is gone (not just a 
single
head 4k page), and adding live migration to this problem will not make 
things
any better...

> But for sure we can keep them as a whole series if you want, but then it'll
> be good you add some more reviewers; at least some ARM/AMD developers,
> perhaps.

I'll add qemu-arm@nongnu.org to the CC list for the updated version I'm
going to send.
Giving a word about the ARM specificity of the second patch.

>> It would be better to integrate the two of them as a whole
>> set. But the use of the kernel feature you indicated me can change all
>> of that !
>>
>>> Another thing to mention: feel free to look at a recent addition of ioctl
>>> from userfault, where it can inject poisoned ptes:
>>>
>>> https://lore.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
>>>
>>> I'm wondering if that'll be helpful to qemu too, where we can migrate
>>> hwpoison_page_list and enforce the poisoning on dest.  Then even for AO
>>> when accessed by guest it'll generated another MCE on dest.
>> I could be missing something, but Yes, this is exactly how I understand
>> this kernel feature use case with its description in:
>> https://lore.kernel.org/all/20230707215540.2324998-5-axelrasmussen@google.com/
>>
>>   vvvvvv
>> So the basic way to use this new feature is:
>>
>> - On the new host, the guest's memory is registered with userfaultfd, in
>>    either MISSING or MINOR mode (doesn't really matter for this purpose).
>> - On any first access, we get a userfaultfd event. At this point we can
>>    communicate with the old host to find out if the page was poisoned.
>> - If so, we can respond with a UFFDIO_POISON - this places a swap marker
>>    so any future accesses will SIGBUS. Because the pte is now "present",
>>    future accesses won't generate more userfaultfd events, they'll just
>>    SIGBUS directly.
>>   ^^^^^^
>>
>> Thank you for letting me know about this kernel functionality.
>>
>> I need to take some time to investigate it, to see how I could use it.
> One more hint, please double check though: in QEMU's use case (e.g. precopy
> only, while not using postcopy) I think you may even be able to install the
> poisoned pte without MISSING (or any other uffd) mode registered.
>
> You can try creating one uffd descriptor (which will bind the desc with the
> current mm context; in this case we need it to happen only on dest qemu),
> then try injecting poison ptes anywhere in the guest address ranges.
I did that in a self content test program: memory allocation, 
UFFDIO_REGISTER
and use of UFFDIO_POISON.
The register mode has to be given but MISSING or WP both works. This gives
the possibility to inject poison in a much easier and better way than using
madvise(... MADV_HWPOISON, ...) for example.

But it implies a lot of other changes:
     - The source has to flag the error pages to indicate a poison
       (new flag in the exchange protocole)
     - The destination has to be able to deal with the new protocole
     - The destination has to be able to mark the pages as poisoned
       (authorized to use userfaultfd)
     - So both source and destination have to be upgraded (of course
       qemu but also an appropriate kernel version providing
       UFFDIO_POISON on the destination)
     - we may need to be able to negotiate a fall back solution
     - an indication of the method to use could belong to the
       migration capabilities and parameters
     - etc...

>> The solution I'm suggesting here doesn't cover as many cases as the
>> UFFDIO_POISON use could help to implement.
>> But it gives us a possibility to live migrate VMs that already
>> experienced memory errors, trusting the VM kernel to correctly deal with
>> these past errors.
>>
>> AFAIK, currently, a standard qemu VM that has experienced a memory error
>> can't be live migrated at all.
> I suppose here you meant AO errors only.
No, if any of the memory used by a VM has been impacted by a memory error
(either with BUS_MCEERR_AO or BUS_MCEERR_AR) this memory isn't accessible
anymore, and the live migration (whatever mechanism used) can't read the
content of the impacted location. So AFAIK any mechanism used currently
doesn't work.
When we have such an error, either the migration fails (like RDMA currently
does) or it completely crashes qemu when the migration is attempted.

> IIUC the major issue regarding migration is AO errors will become ARs on
> src qemu when vcpu accessed,
This is correct.

>   which means AOs are all fine if not forwarded
> to guest.
You are right in the case where the VM stays on the source machine.
With my current proposed fix we don't forward poison to the destination
machine, so the problem is not to be able to access the content of
these Uncorrected Error memory locations -- which means that if this
content is needed we have to inform the requester that the data is
inaccessible -- that's what the poison is for, and we count on the
running VM kernel to enforce the poisoning.

And if the AO error hasn't been reported to the VM running Kernel,
we either must forward the poison to the destination machine or
prevent the live migration. (That's what the second patch does for
the platform ignoring the AO errors - currently only ARM)

>    However after migration that is not guaranteed.  Poisoned ptes
> properly installed on dest basically grants QEMU the ability to "migrate a
> poisoned page", meanwhile without really wasting a physical page on dest,
> making sure those AO error addrs keep generating ARs even after migration.
Absolutely, this is the huge advantage of such a solution.

> It seems the 1st patch is still needed even in this case?
If we can transfer a poison to the destination machine, there is no
need for the 1st patch (transforming poisoned pages into zero pages).
That's the reason why I do think that enhancing both the source qemu
and the destination qemu to deal with poisoned pages is the real
(long term) fix.
In the meantime, we could use this current small set of 2 patches to
avoid the qemu crashes on live migration after a memory fault.

I hope this clarifies the situation, and the reason why I'd prefer
the two patches to be integrated together.

I've updated the code to the latest source tree (resolving conflicts
with 8697eb857769 and 72a8192e225c) and I'm sending a v5 with this
update. Adapting the commit message to reflect the new stack trace
on crash.
I also re-ran my migration tests, with and without compression,
on ARM and x86 platforms.

I hope this can help.


[-- Attachment #2: Type: text/html, Size: 14984 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error
  2023-10-17 15:13                                   ` Peter Xu
  2023-11-06 21:38                                     ` William Roche
@ 2023-11-06 22:03                                     ` “William Roche
  2023-11-06 22:03                                       ` [PATCH v5 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
                                                         ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: “William Roche @ 2023-11-06 22:03 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, peterx
  Cc: lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

Note about ARM specificities:
This code has a small part impacting more specificaly ARM machines,
that's the reason why I added qemu-arm@nongnu.org -- see description.

A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to thoroughly correct this problem, the poison information should
follow the migration which represents several difficulties:
- poisoning a page on the destination machine to replicate the source
  poison requires CAP_SYS_ADMIN priviledges, and qemu process may not
  always run as a root process
- the destination kernel needs to be configured with CONFIG_MEMORY_FAILURE
- the poison information would require a memory transfer protocol
  enhancement to provide this information
(The current patches don't provide any of that)

But if we rely on the fact that the a running VM kernel is correctly
dealing with memory poison it is informed about: marking the poison page
as inaccessible, we could count on the VM kernel to make sure that
poisoned pages are not used, even after a migration.
In this case, I suggest to treat the poisoned pages as if they were
zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".

Now, it leaves a case that we have to deal with: if a memory error is
reported to qemu but not injected into the running kernel...
As the migration will go from a poisoned page to an all-zero page, if
the VM kernel doesn't prevent the access to this page, a memory read
that would generate a BUS_MCEERR_AR error on the source platform, could
be reading zeros on the destination. This is a memory corruption.

So we have to ensure that all poisoned pages we set to zero are known by
the running kernel. But we have a problem with platforms where BUS_MCEERR_AO
errors are ignored, which means that qemu knows about the poison but the VM
doesn't. For the moment it's only the case for ARM, but could later be
also needed for AMD VMs.
See https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/

In order to avoid this possible silent data corruption situation, we should
prevent the migration when we know that a poisoned page is ignored from the VM.

Which is, according to me, the smallest fix we need  to avoid qemu crashes
on migration after an handled memory error, without introducing a possible
corruption situation.

This fix is scripts/checkpatch.pl clean.
Unit test: Migration blocking succesfully tested on ARM -- injected AO error
blocks it. On x86 the same type of error being relayed doesn't block.

v2:
  - adding compressed transfer handling of poisoned pages

v3:
  - Included the Reviewed-by and Tested-by information on first patch
  - added a TODO comment above control_save_page()
    mentioning Zhijian's feedback about RDMA migration failure.

v4:
  - adding a patch to deal with unknown poison tracking (impacting ARM)
    (not using migrate_add_blocker as this is not devices related and
    we want to avoid the interaction with --only-migratable mechanism)

v5:
  - Updating the code to the latest version
  - adding qemu-arm@nongnu.org for a complementary review

William Roche (2):
  migration: skip poisoned memory pages on "ram saving" phase
  migration: prevent migration when a poisoned page is unknown from the
    VM

 accel/kvm/kvm-all.c      | 41 +++++++++++++++++++++++++++++++++++++++-
 accel/stubs/kvm-stub.c   | 10 ++++++++++
 include/sysemu/kvm.h     | 16 ++++++++++++++++
 include/sysemu/kvm_int.h |  3 ++-
 migration/migration.c    |  6 ++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 +++++++++++++++++++++--
 migration/ram.h          |  2 ++
 target/arm/kvm64.c       |  6 +++++-
 target/i386/kvm/kvm.c    |  2 +-
 10 files changed, 106 insertions(+), 7 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v5 1/2] migration: skip poisoned memory pages on "ram saving" phase
  2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
@ 2023-11-06 22:03                                       ` “William Roche
  2023-11-06 22:03                                       ` [PATCH v5 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
  2023-11-08 21:49                                       ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error Peter Xu
  2 siblings, 0 replies; 34+ messages in thread
From: “William Roche @ 2023-11-06 22:03 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, peterx
  Cc: lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.

The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page
#5  ram_save_target_page_legacy
#6  ram_save_host_page
#7  ram_find_and_save_block
#8  ram_save_iterate
#9  qemu_savevm_state_iterate
#10 migration_iteration_run
#11 migration_thread
#12 qemu_thread_start

Fix it by considering poisoned pages as if they were zero-pages for
the migration copy. This fix also works with underlying large pages,
taking into account the RAMBlock segment "page-size".

Standard migration and compressed transfers are handled by this code.
RDMA transfer isn't touched.

Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 ++++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e39a810a4e..64c0b37823 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1149,6 +1149,20 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
+bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
+{
+    HWPoisonPage *pg;
+    ram_addr_t ram_addr = (ram_addr_t) offset;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if ((ram_addr >= pg->ram_addr) &&
+            (ram_addr - pg->ram_addr < block->page_size)) {
+            return true;
+        }
+    }
+    return false;
+}
+
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 1b37d9a302..17774fa5ef 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -124,3 +124,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 80b69d88f6..66937f9dfe 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -539,4 +539,14 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_page - indicate if the given page is poisoned
+ * @block: memory block of the given page
+ * @ram_addr: offset of the page
+ *
+ * Returns: true: page is poisoned
+ *          false: page not yet poisoned
+ */
+bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
 #endif
diff --git a/migration/ram-compress.c b/migration/ram-compress.c
index fa4388f6a6..a7772a08a2 100644
--- a/migration/ram-compress.c
+++ b/migration/ram-compress.c
@@ -35,6 +35,7 @@
 #include "qemu/stats64.h"
 #include "migration.h"
 #include "options.h"
+#include "ram.h"
 #include "io/channel-null.h"
 #include "exec/target_page.h"
 #include "exec/ramblock.h"
@@ -214,7 +215,7 @@ static CompressResult do_compress_ram_page(QEMUFile *f, z_stream *stream,
 
     assert(qemu_file_buffer_empty(f));
 
-    if (buffer_is_zero(p, page_size)) {
+    if (migration_buffer_is_zero(block, offset, page_size)) {
         return RES_ZEROPAGE;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 8c7886ab79..5fd4d27854 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1107,6 +1107,26 @@ void ram_release_page(const char *rbname, uint64_t offset)
     ram_discard_range(rbname, offset, TARGET_PAGE_SIZE);
 }
 
+/**
+ * migration_buffer_is_zero: indicate if the page at the given
+ * location is entirely filled with zero, or is a poisoned page.
+ *
+ * @block: block that contains the page
+ * @offset: offset inside the block for the page
+ * @len: size to consider
+ */
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset,
+                                     size_t len)
+{
+    uint8_t *p = block->host + offset;
+
+    if (kvm_enabled() && kvm_hwpoisoned_page(block, (void *)offset)) {
+        return true;
+    }
+
+    return buffer_is_zero(p, len);
+}
+
 /**
  * save_zero_page: send the zero page to the stream
  *
@@ -1119,11 +1139,10 @@ void ram_release_page(const char *rbname, uint64_t offset)
 static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
                           ram_addr_t offset)
 {
-    uint8_t *p = pss->block->host + offset;
     QEMUFile *file = pss->pss_channel;
     int len = 0;
 
-    if (!buffer_is_zero(p, TARGET_PAGE_SIZE)) {
+    if (!migration_buffer_is_zero(pss->block, offset, TARGET_PAGE_SIZE)) {
         return 0;
     }
 
@@ -1154,6 +1173,7 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
  *        > 0 - number of pages written
  *
  * Return true if the pages has been saved, otherwise false is returned.
+ * TODO: hwpoison pages fail RDMA migration, should be handled.
  */
 static bool control_save_page(PageSearchStatus *pss,
                               ram_addr_t offset, int *pages)
diff --git a/migration/ram.h b/migration/ram.h
index 9b937a446b..d34ba79d36 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,8 @@ void ram_handle_zero(void *host, uint64_t size);
 void ram_transferred_add(uint64_t bytes);
 void ram_release_page(const char *rbname, uint64_t offset);
 
+bool migration_buffer_is_zero(RAMBlock *block, ram_addr_t offset, size_t len);
+
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  2023-11-06 22:03                                       ` [PATCH v5 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
@ 2023-11-06 22:03                                       ` “William Roche
  2023-11-08 21:49                                       ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error Peter Xu
  2 siblings, 0 replies; 34+ messages in thread
From: “William Roche @ 2023-11-06 22:03 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, peterx
  Cc: lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

Migrating a poisoned page as a zero-page can only be done when the
running guest kernel knows about this poison, so that it marks this
page as inaccessible and any access in the VM would fail.

But if a poison information is not relayed to the VM, the kernel
does not prevent its access. In this case, transforming a poisoned
page into a zero-page could create a case of silent data corruption.

So we have to keep track of poisons not injected into the guest,
like the ARM VM emulation ignoring BUS_MCEERR_AO errors.
When such a page exists, the migration has to be blocked.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 27 ++++++++++++++++++++++++++-
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     |  6 ++++++
 include/sysemu/kvm_int.h |  3 ++-
 migration/migration.c    |  6 ++++++
 target/arm/kvm64.c       |  6 +++++-
 target/i386/kvm/kvm.c    |  2 +-
 7 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 64c0b37823..59af34f5a6 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1130,8 +1130,17 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
     return ret;
 }
 
+/*
+ * We track the poisoned pages to be able to:
+ * - replace them on VM reset
+ * - skip them when migrating
+ * - block a migration for a VM where a poisoned page is ignored
+ *   as this VM kernel (not knowing about the error) could
+ *   incorrectly access the page.
+ */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    bool       vm_known;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1163,20 +1172,36 @@ bool kvm_hwpoisoned_page(RAMBlock *block, void *offset)
     return false;
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, bool known)
 {
     HWPoisonPage *page;
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
+            if (known && !page->vm_known) {
+                page->vm_known = true;
+            }
             return;
         }
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->vm_known = known;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
+bool kvm_hwpoisoned_unknown(void)
+{
+    HWPoisonPage *pg;
+
+    QLIST_FOREACH(pg, &hwpoison_page_list, list) {
+        if (!pg->vm_known) {
+            return true;
+        }
+    }
+    return false;
+}
+
 static uint32_t adjust_ioeventfd_endianness(uint32_t val, uint32_t size)
 {
 #if HOST_BIG_ENDIAN != TARGET_BIG_ENDIAN
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 17774fa5ef..3c914b5b65 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -129,3 +129,8 @@ bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr)
 {
     return false;
 }
+
+bool kvm_hwpoisoned_unknown(void)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 66937f9dfe..37d66ac614 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -549,4 +549,10 @@ uint32_t kvm_dirty_ring_size(void);
  *          false: page not yet poisoned
  */
 bool kvm_hwpoisoned_page(RAMBlock *block, void *ram_addr);
+
+/**
+ * kvm_hwpoisoned_unknown - indicate if a qemu reported memory error
+ * is still unknown to (hasn't been injected into) the VM kernel.
+ */
+bool kvm_hwpoisoned_unknown(void);
 #endif
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index fd846394be..fd0a32c34a 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -132,10 +132,11 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @known: indicate if the error is injected to the VM kernel
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, bool known);
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 28a34c9068..63cb2c80db 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -67,6 +67,7 @@
 #include "options.h"
 #include "sysemu/dirtylimit.h"
 #include "qemu/sockets.h"
+#include "sysemu/kvm.h"
 
 static NotifierList migration_state_notifiers =
     NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
@@ -1892,6 +1893,11 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
         return false;
     }
 
+    if (kvm_hwpoisoned_unknown()) {
+        error_setg(errp, "Can't migrate this vm with ignored poisoned page");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 3c175c93a7..5dea8051f1 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -1101,7 +1101,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
@@ -1112,7 +1111,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              * called synchronously from the vCPU thread, or a bit
              * later from the main thread, so doing the injection of
              * the error would be more complicated.
+             * In this case, BUS_MCEERR_AO errors are unknown from the
+             * guest, and we will prevent migration as long as this
+             * poisoned page hasn't generated a BUS_MCEERR_AR error
+             * that the guest takes into account.
              */
+            kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
             if (code == BUS_MCEERR_AR) {
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 770e81d56e..08410185a6 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -642,7 +642,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, true);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-11-06 21:38                                     ` William Roche
@ 2023-11-08 21:45                                       ` Peter Xu
  2023-11-10 19:22                                         ` William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-08 21:45 UTC (permalink / raw)
  To: William Roche
  Cc: qemu-devel, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

On Mon, Nov 06, 2023 at 10:38:14PM +0100, William Roche wrote:
> Note also that large pages are taken into account too for our live
> migration, but the poisoning of a qemu large page requires more work
> especially for VM using standard 4k pages on top of these qemu large
> pages -- and this is a completely different issue. I'm mentioning this
> aspect here because even on Intel platforms, underlying large pages
> poisoning needs to be reported better to the running VM as a large
> section of its memory is gone (not just a single head 4k page), and
> adding live migration to this problem will not make things any better...

Good point.. Yes, huge poisoned pages seem all broken.

> I did that in a self content test program: memory allocation,
> UFFDIO_REGISTER and use of UFFDIO_POISON.  The register mode has to be
> given but MISSING or WP both works. This gives the possibility to inject
> poison in a much easier and better way than using
> madvise(... MADV_HWPOISON, ...) for example.

Indeed, I should have left a comment if I noticed that when reviewing the
POISON changes; I overlooked that find_dst_vma(), even named like that,
will check the vma uffd context existed.  Doesn't really be necessary to
UFFDIO_POISON.

I can consider proposing a patch to allow that, which should be
trivial.. but it won't help with old kernels, so QEMU may still need to
better always register to make it always work as long as
UFFD_FEATURE_POISON reported.. sad.

> 
> But it implies a lot of other changes:
>     - The source has to flag the error pages to indicate a poison
>       (new flag in the exchange protocole)
>     - The destination has to be able to deal with the new protocole

IIUC these two can be simply implemented by migrating hwpoison_page_list
over to dest.  You need to have a compat bit for doing this, ignoring the
list on old machine types, because old QEMUs will not recognize this vmsd.

QEMU should even support migrating a list object in VMSD, feel free to have
a look at VMSTATE_QLIST_V().

>     - The destination has to be able to mark the pages as poisoned
>       (authorized to use userfaultfd)

Note: userfaultfd is actually available without any privilege if to use
UFFDIO_POISON only, as long as to open the uffd (either via syscall or
/dev/userfaultfd) using UFFD_FLAG_USER_ONLY.

A trick is we can register with UFFD_WP mode (not MISSING; because when a
kernel accesses a missing page it'll cause SIGBUS then with USER_ONLY),
then inject whatever POISON we want.  As long as UFFDIO_WRITEPROTECT is not
invoked, UFFD_WP does nothing (unlike MISSING).

>     - So both source and destination have to be upgraded (of course
>       qemu but also an appropriate kernel version providing
>       UFFDIO_POISON on the destination)

True.  Unfortunately this is not avoidable.

>     - we may need to be able to negotiate a fall back solution
>     - an indication of the method to use could belong to the
>       migration capabilities and parameters

For above two points: it's a common issue with migration compatibility.  As
long as you can provide above VMSD to migrate hwpoison_page_list, marking
all old QEMU machine types skipping that, then it should just work.

You can have a closer look at anything in hw_compat_* as an example.

>     - etc...

I think you did summarize mostly all the points I can think of; is there
really anything more? :)

It'll be great if you can, or plan to, fix that for good.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error
  2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
  2023-11-06 22:03                                       ` [PATCH v5 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
  2023-11-06 22:03                                       ` [PATCH v5 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
@ 2023-11-08 21:49                                       ` Peter Xu
  2024-01-30 19:06                                         ` [PATCH v1 0/1] " “William Roche
  2 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2023-11-08 21:49 UTC (permalink / raw)
  To: “William Roche
  Cc: qemu-devel, qemu-arm, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

On Mon, Nov 06, 2023 at 10:03:17PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> 
> Note about ARM specificities:
> This code has a small part impacting more specificaly ARM machines,
> that's the reason why I added qemu-arm@nongnu.org -- see description.
> 
> 
> A Qemu VM can survive a memory error, as qemu can relay the error to the
> VM kernel which could also deal with it -- poisoning/off-lining the impacted
> page.
> This situation creates a hole in the VM memory address space that the VM kernel
> knows about (an unreadable page or set of pages).
> 
> But the migration of this VM (live migration through the network or
> pseudo-migration with the creation of a state file) will crash Qemu when
> it sequentially reads the memory address space and stumbles on the
> existing hole.
> 
> In order to thoroughly correct this problem, the poison information should
> follow the migration which represents several difficulties:
> - poisoning a page on the destination machine to replicate the source
>   poison requires CAP_SYS_ADMIN priviledges, and qemu process may not
>   always run as a root process
> - the destination kernel needs to be configured with CONFIG_MEMORY_FAILURE
> - the poison information would require a memory transfer protocol
>   enhancement to provide this information
> (The current patches don't provide any of that)
> 
> But if we rely on the fact that the a running VM kernel is correctly
> dealing with memory poison it is informed about: marking the poison page
> as inaccessible, we could count on the VM kernel to make sure that
> poisoned pages are not used, even after a migration.
> In this case, I suggest to treat the poisoned pages as if they were
> zero-pages for the migration copy.
> This fix also works with underlying large pages, taking into account the
> RAMBlock segment "page-size".
> 
> Now, it leaves a case that we have to deal with: if a memory error is
> reported to qemu but not injected into the running kernel...
> As the migration will go from a poisoned page to an all-zero page, if
> the VM kernel doesn't prevent the access to this page, a memory read
> that would generate a BUS_MCEERR_AR error on the source platform, could
> be reading zeros on the destination. This is a memory corruption.
> 
> So we have to ensure that all poisoned pages we set to zero are known by
> the running kernel. But we have a problem with platforms where BUS_MCEERR_AO
> errors are ignored, which means that qemu knows about the poison but the VM
> doesn't. For the moment it's only the case for ARM, but could later be
> also needed for AMD VMs.
> See https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/
> 
> In order to avoid this possible silent data corruption situation, we should
> prevent the migration when we know that a poisoned page is ignored from the VM.
> 
> Which is, according to me, the smallest fix we need  to avoid qemu crashes
> on migration after an handled memory error, without introducing a possible
> corruption situation.
> 
> This fix is scripts/checkpatch.pl clean.
> Unit test: Migration blocking succesfully tested on ARM -- injected AO error
> blocks it. On x86 the same type of error being relayed doesn't block.
> 
> v2:
>   - adding compressed transfer handling of poisoned pages
> 
> v3:
>   - Included the Reviewed-by and Tested-by information on first patch
>   - added a TODO comment above control_save_page()
>     mentioning Zhijian's feedback about RDMA migration failure.
> 
> v4:
>   - adding a patch to deal with unknown poison tracking (impacting ARM)
>     (not using migrate_add_blocker as this is not devices related and
>     we want to avoid the interaction with --only-migratable mechanism)
> 
> v5:
>   - Updating the code to the latest version
>   - adding qemu-arm@nongnu.org for a complementary review
> 
> 
> William Roche (2):
>   migration: skip poisoned memory pages on "ram saving" phase
>   migration: prevent migration when a poisoned page is unknown from the
>     VM

I hope someone from arch-specific can have a quick look at patch 2..

One thing to mention is unfortunately waiting on patch 2 means we'll miss
this release. Actually it is already missed.. softfreeze yesterday [1].  So
it may likely need to wait for 9.0.

[1] https://wiki.qemu.org/Planning/8.2

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM
  2023-11-08 21:45                                       ` Peter Xu
@ 2023-11-10 19:22                                         ` William Roche
  0 siblings, 0 replies; 34+ messages in thread
From: William Roche @ 2023-11-10 19:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, lizhijian, pbonzini, quintela, leobras,
	joao.m.martins, lidongchen

[-- Attachment #1: Type: text/plain, Size: 2765 bytes --]

On 11/8/23 22:45, Peter Xu wrote:
> On Mon, Nov 06, 2023 at 10:38:14PM +0100, William Roche wrote:
>> But it implies a lot of other changes:
>>      - The source has to flag the error pages to indicate a poison
>>        (new flag in the exchange protocole)
>>      - The destination has to be able to deal with the new protocole
> IIUC these two can be simply implemented by migrating hwpoison_page_list
> over to dest.  You need to have a compat bit for doing this, ignoring the
> list on old machine types, because old QEMUs will not recognize this vmsd.
>
> QEMU should even support migrating a list object in VMSD, feel free to have
> a look at VMSTATE_QLIST_V().

This is another area that I'll need to learn about.

>>      - The destination has to be able to mark the pages as poisoned
>>        (authorized to use userfaultfd)
> Note: userfaultfd is actually available without any privilege if to use
> UFFDIO_POISON only, as long as to open the uffd (either via syscall or
> /dev/userfaultfd) using UFFD_FLAG_USER_ONLY.
>
> A trick is we can register with UFFD_WP mode (not MISSING; because when a
> kernel accesses a missing page it'll cause SIGBUS then with USER_ONLY),
> then inject whatever POISON we want.  As long as UFFDIO_WRITEPROTECT is not
> invoked, UFFD_WP does nothing (unlike MISSING).
>
>>      - So both source and destination have to be upgraded (of course
>>        qemu but also an appropriate kernel version providing
>>        UFFDIO_POISON on the destination)
> True.  Unfortunately this is not avoidable.
>
>>      - we may need to be able to negotiate a fall back solution
>>      - an indication of the method to use could belong to the
>>        migration capabilities and parameters
> For above two points: it's a common issue with migration compatibility.  As
> long as you can provide above VMSD to migrate hwpoison_page_list, marking
> all old QEMU machine types skipping that, then it should just work.
>
> You can have a closer look at anything in hw_compat_* as an example.

Yes, I'll do that.

>>      - etc...
> I think you did summarize mostly all the points I can think of; is there
> really anything more? :)

Probably some work to select the poison migration method (allowing a
migration transforming poison into zeros as a fall back method if the
poison migration itself with UFFDIO_POISON can't be used, or not) for
example.

> It'll be great if you can, or plan to, fix that for good.

Thanks for the offer ;)
I'd really like to implement that, but I currently have another pressing
issue to work on. I should be back on this topic within a few months.

I'm now waiting for some feedback from the ARM architecture reviewer(s).

Thanks a lot for all your suggestions.

[-- Attachment #2: Type: text/html, Size: 4777 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 0/1] Qemu crashes on VM migration after an handled memory error
  2023-11-08 21:49                                       ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error Peter Xu
@ 2024-01-30 19:06                                         ` “William Roche
  2024-01-30 19:06                                           ` [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory “William Roche
  0 siblings, 1 reply; 34+ messages in thread
From: “William Roche @ 2024-01-30 19:06 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: qemu-arm, lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

Problem:
--------
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page. This situation creates a hole in the VM memory address space (an
unreadable page or set of pages).

A migration request of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

New fix proposal:
-----------------
Let's prevent the migration when we know that there is a poison page in
the VM address space.

History:
--------
My first fix proposal for this crash condition (latest version:
https://lore.kernel.org/all/20231106220319.456765-1-william.roche@oracle.com/ )
relied on a well behaving kernel to guaranty that a known poison page is
not accessed. It introduced an ARM platform specificity.
I haven't received any feedback about the ARM specificity to avoid
a possible memory corruption after a migration transforming a poisoned
page into an all zero page.

I also accept that when a memory error leads to memory poisoning, this
platform functionality has to be honored as long as a physical platform
would provide it.

Peter asked for a complete correction of this problem (transfering
the memory holes information with the migration and recreating these
holes on the destination platform).

In the meantime, this is a very small fix to avoid the current crash
situation reading the poisoned memory pages.  I'm simply preventing
the migration when we know that it would crash, when there is a
poisoned page in the VM address space.

This is a generic protection code, avoiding a crash condition and
reporting the following error message:
"Error: Can't migrate this vm with hardware poisoned memory, please reboot the vm and try again"
instead of crashing the VM.

This fix is scripts/checkpatch.pl clean.
Unit tested on ARM and x86.

William Roche (1):
  migration: prevent migration when VM has poisoned memory

 accel/kvm/kvm-all.c    | 10 ++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   |  6 ++++++
 migration/migration.c  |  7 +++++++
 4 files changed, 28 insertions(+)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory
  2024-01-30 19:06                                         ` [PATCH v1 0/1] " “William Roche
@ 2024-01-30 19:06                                           ` “William Roche
  2024-01-31  1:48                                             ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: “William Roche @ 2024-01-30 19:06 UTC (permalink / raw)
  To: qemu-devel, peterx
  Cc: qemu-arm, lizhijian, pbonzini, quintela, leobras, joao.m.martins,
	lidongchen, william.roche

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page
#5  ram_save_target_page_legacy
#6  ram_save_host_page
#7  ram_find_and_save_block
#8  ram_save_iterate
#9  qemu_savevm_state_iterate
#10 migration_iteration_run
#11 migration_thread
#12 qemu_thread_start

To avoid this VM crash during the migration, prevent the migration
when a known hardware poison exists on the VM.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    | 10 ++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   |  6 ++++++
 migration/migration.c  |  7 +++++++
 4 files changed, 28 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 49e755ec4a..a8cecd040e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1119,6 +1119,11 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
     return ret;
 }
 
+/*
+ * We track the poisoned pages to be able to:
+ * - replace them on VM reset
+ * - block a migration for a VM with a poisoned page
+ */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
     QLIST_ENTRY(HWPoisonPage) list;
@@ -1152,6 +1157,11 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
+bool kvm_hwpoisoned_mem(void)
+{
+    return !QLIST_EMPTY(&hwpoison_page_list);
+}
+
 static uint32_t adjust_ioeventfd_endianness(uint32_t val, uint32_t size)
 {
 #if HOST_BIG_ENDIAN != TARGET_BIG_ENDIAN
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 1b37d9a302..ca38172884 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -124,3 +124,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_mem(void)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index d614878164..fad9a7e8ff 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -538,4 +538,10 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_mem - indicate if there is any hwpoisoned page
+ * reported for the VM.
+ */
+bool kvm_hwpoisoned_mem(void);
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index d5f705ceef..b574e66f7b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -67,6 +67,7 @@
 #include "options.h"
 #include "sysemu/dirtylimit.h"
 #include "qemu/sockets.h"
+#include "sysemu/kvm.h"
 
 static NotifierList migration_state_notifiers =
     NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
@@ -1906,6 +1907,12 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
         return false;
     }
 
+    if (kvm_hwpoisoned_mem()) {
+        error_setg(errp, "Can't migrate this vm with hardware poisoned memory, "
+                   "please reboot the vm and try again");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory
  2024-01-30 19:06                                           ` [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory “William Roche
@ 2024-01-31  1:48                                             ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2024-01-31  1:48 UTC (permalink / raw)
  To: “William Roche
  Cc: qemu-devel, qemu-arm, lizhijian, pbonzini, leobras,
	joao.m.martins, lidongchen

On Tue, Jan 30, 2024 at 07:06:40PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> A memory page poisoned from the hypervisor level is no longer readable.
> The migration of a VM will crash Qemu when it tries to read the
> memory address space and stumbles on the poisoned page with a similar
> stack trace:
> 
> Program terminated with signal SIGBUS, Bus error.
> #0  _mm256_loadu_si256
> #1  buffer_zero_avx2
> #2  select_accel_fn
> #3  buffer_is_zero
> #4  save_zero_page
> #5  ram_save_target_page_legacy
> #6  ram_save_host_page
> #7  ram_find_and_save_block
> #8  ram_save_iterate
> #9  qemu_savevm_state_iterate
> #10 migration_iteration_run
> #11 migration_thread
> #12 qemu_thread_start
> 
> To avoid this VM crash during the migration, prevent the migration
> when a known hardware poison exists on the VM.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>

I queued it for now, while it'll always good to get feedback from either
Paolo or anyone else, as the pull won't happen in one week.  If no
objection it'll be included the next migration pull.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-01-31  1:49 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-06 13:59 [PATCH 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2023-09-06 13:59 ` [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-09-06 14:19   ` Joao Martins
2023-09-06 15:16     ` Peter Xu
2023-09-06 21:29       ` William Roche
2023-09-09 14:57         ` Joao Martins
2023-09-11 19:48           ` Peter Xu
2023-09-12 18:44             ` Peter Xu
2023-09-14 20:20               ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2023-09-14 20:20                 ` [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-09-15  3:13                   ` Zhijian Li (Fujitsu)
2023-09-15 11:31                     ` William Roche
2023-09-18  3:47                       ` Zhijian Li (Fujitsu)
2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
2023-09-20 12:11                         ` William Roche
2023-09-20 23:53                         ` [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2023-09-20 23:53                           ` [PATCH v3 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-10-13 15:08                           ` [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
2023-10-13 15:08                             ` [PATCH v4 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-10-13 15:08                             ` [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
2023-10-16 16:48                               ` Peter Xu
2023-10-17  0:38                                 ` William Roche
2023-10-17 15:13                                   ` Peter Xu
2023-11-06 21:38                                     ` William Roche
2023-11-08 21:45                                       ` Peter Xu
2023-11-10 19:22                                         ` William Roche
2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
2023-11-06 22:03                                       ` [PATCH v5 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-11-06 22:03                                       ` [PATCH v5 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
2023-11-08 21:49                                       ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error Peter Xu
2024-01-30 19:06                                         ` [PATCH v1 0/1] " “William Roche
2024-01-30 19:06                                           ` [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory “William Roche
2024-01-31  1:48                                             ` Peter Xu
2023-09-14 21:50                 ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).