qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
@ 2015-12-03  7:32 Li Zhijian
  2015-12-03  9:24 ` Dr. David Alan Gilbert
  2015-12-03 11:20 ` Juan Quintela
  0 siblings, 2 replies; 9+ messages in thread
From: Li Zhijian @ 2015-12-03  7:32 UTC (permalink / raw)
  To: qemu-devel, Juan Quintela, Amit Shah,
	Dr. David Alan Gilbert (git), david
  Cc: Li Zhijian, zhanghailiang

Hi all,

Does anyboday remember the similar issue post by hailiang months ago
  http://patchwork.ozlabs.org/patch/454322/
At least tow bugs about migration had been fixed since that.

And now we found the same issue at the tcg vm(kvm is fine), after 
migration, the content VM's memory is inconsistent.

we add a patch to check memory content, you can find it from affix

steps to reporduce:
1) apply the patch and re-build qemu
2) prepare the ubuntu guest and run memtest in grub.
soruce side:
x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device 
e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive 
if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp 
tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine 
pc-i440fx-2.3,accel=tcg,usb=off

destination side:
x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device 
e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive 
if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp 
tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine 
pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881

3) start migration
with 1000M NIC, migration will finish within 3 min.

at source:
(qemu) migrate tcp:192.168.2.66:8881
after saving ram complete
e9e725df678d392b1a83b3a917f332bb
qemu-system-x86_64: end ram md5
(qemu)

at destination:
...skip...
Completed load of VM with exit code 0 seq iteration 1264
Completed load of VM with exit code 0 seq iteration 1265
Completed load of VM with exit code 0 seq iteration 1266
qemu-system-x86_64: after loading state section id 2(ram)
49c2dac7bde0e5e22db7280dcb3824f9
qemu-system-x86_64: end ram md5
qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init

49c2dac7bde0e5e22db7280dcb3824f9
qemu-system-x86_64: end ram md5

This occurs occasionally and only at tcg machine. It seems that
some pages dirtied in source side don't transferred to destination.
This problem can be reproduced even if we disable virtio.

Is it OK for some pages that not transferred to destination when do 
migration ? Or is it a bug?

Any idea...

=================md5 check patch=============================

diff --git a/Makefile.target b/Makefile.target
index 962d004..e2cb8e9 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
  obj-y += memory_mapping.o
  obj-y += dump.o
  obj-y += migration/ram.o migration/savevm.o
-LIBS := $(libs_softmmu) $(LIBS)
+LIBS := $(libs_softmmu) $(LIBS) -lplumb

  # xen support
  obj-$(CONFIG_XEN) += xen-common.o
diff --git a/migration/ram.c b/migration/ram.c
index 1eb155a..3b7a09d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
      }

      rcu_read_unlock();
-    DPRINTF("Completed load of VM with exit code %d seq iteration "
+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
              "%" PRIu64 "\n", ret, seq_iter);
      return ret;
  }
diff --git a/migration/savevm.c b/migration/savevm.c
index 0ad1b93..3feaa61 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)

  }

+#include "exec/ram_addr.h"
+#include "qemu/rcu_queue.h"
+#include <clplumbing/md5.h>
+#ifndef MD5_DIGEST_LENGTH
+#define MD5_DIGEST_LENGTH 16
+#endif
+
+static void check_host_md5(void)
+{
+    int i;
+    unsigned char md[MD5_DIGEST_LENGTH];
+    rcu_read_lock();
+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check 
'pc.ram' block */
+    rcu_read_unlock();
+
+    MD5(block->host, block->used_length, md);
+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
+        fprintf(stderr, "%02x", md[i]);
+    }
+    fprintf(stderr, "\n");
+    error_report("end ram md5");
+}
+
  void qemu_savevm_state_begin(QEMUFile *f,
                               const MigrationParams *params)
  {
@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile 
*f, bool iterable_only)
          save_section_header(f, se, QEMU_VM_SECTION_END);

          ret = se->ops->save_live_complete_precopy(f, se->opaque);
+
+        fprintf(stderr, "after saving %s complete\n", se->idstr);
+        check_host_md5();
+
          trace_savevm_section_end(se->idstr, se->section_id, ret);
          save_section_footer(f, se);
          if (ret < 0) {
@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, 
MigrationIncomingState *mis)
                               section_id, le->se->idstr);
                  return ret;
              }
+            if (section_type == QEMU_VM_SECTION_END) {
+                error_report("after loading state section id %d(%s)",
+                             section_id, le->se->idstr);
+                check_host_md5();
+            }
              if (!check_section_footer(f, le)) {
                  return -EINVAL;
              }
@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
      }

      cpu_synchronize_all_post_init();
+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
+    check_host_md5();

      return ret;
  }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03  7:32 [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration Li Zhijian
@ 2015-12-03  9:24 ` Dr. David Alan Gilbert
  2015-12-03  9:37   ` Hailiang Zhang
  2015-12-03 10:23   ` Li Zhijian
  2015-12-03 11:20 ` Juan Quintela
  1 sibling, 2 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-03  9:24 UTC (permalink / raw)
  To: Li Zhijian; +Cc: zhanghailiang, Juan Quintela, qemu-devel, Amit Shah, david

* Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
> Hi all,
> 
> Does anyboday remember the similar issue post by hailiang months ago
>  http://patchwork.ozlabs.org/patch/454322/
> At least tow bugs about migration had been fixed since that.

Yes, I wondered what happened to that.

> And now we found the same issue at the tcg vm(kvm is fine), after migration,
> the content VM's memory is inconsistent.

Hmm, TCG only - I don't know much about that; but I guess something must
be accessing memory without using the proper macros/functions so
it doesn't mark it as dirty.

> we add a patch to check memory content, you can find it from affix
> 
> steps to reporduce:
> 1) apply the patch and re-build qemu
> 2) prepare the ubuntu guest and run memtest in grub.
> soruce side:
> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
> pc-i440fx-2.3,accel=tcg,usb=off
> 
> destination side:
> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
> pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
> 
> 3) start migration
> with 1000M NIC, migration will finish within 3 min.
> 
> at source:
> (qemu) migrate tcp:192.168.2.66:8881
> after saving ram complete
> e9e725df678d392b1a83b3a917f332bb
> qemu-system-x86_64: end ram md5
> (qemu)
> 
> at destination:
> ...skip...
> Completed load of VM with exit code 0 seq iteration 1264
> Completed load of VM with exit code 0 seq iteration 1265
> Completed load of VM with exit code 0 seq iteration 1266
> qemu-system-x86_64: after loading state section id 2(ram)
> 49c2dac7bde0e5e22db7280dcb3824f9
> qemu-system-x86_64: end ram md5
> qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
> 
> 49c2dac7bde0e5e22db7280dcb3824f9
> qemu-system-x86_64: end ram md5
> 
> This occurs occasionally and only at tcg machine. It seems that
> some pages dirtied in source side don't transferred to destination.
> This problem can be reproduced even if we disable virtio.
> 
> Is it OK for some pages that not transferred to destination when do
> migration ? Or is it a bug?

I'm pretty sure that means it's a bug.  Hard to find though, I guess
at least memtest is smaller than a big OS.  I think I'd dump the whole
of memory on both sides, hexdump and diff them  - I'd guess it would
just be one byte/word different, maybe that would offer some idea what
wrote it.

Dave

> Any idea...
> 
> =================md5 check patch=============================
> 
> diff --git a/Makefile.target b/Makefile.target
> index 962d004..e2cb8e9 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
>  obj-y += memory_mapping.o
>  obj-y += dump.o
>  obj-y += migration/ram.o migration/savevm.o
> -LIBS := $(libs_softmmu) $(LIBS)
> +LIBS := $(libs_softmmu) $(LIBS) -lplumb
> 
>  # xen support
>  obj-$(CONFIG_XEN) += xen-common.o
> diff --git a/migration/ram.c b/migration/ram.c
> index 1eb155a..3b7a09d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>      }
> 
>      rcu_read_unlock();
> -    DPRINTF("Completed load of VM with exit code %d seq iteration "
> +    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
>              "%" PRIu64 "\n", ret, seq_iter);
>      return ret;
>  }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0ad1b93..3feaa61 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
> 
>  }
> 
> +#include "exec/ram_addr.h"
> +#include "qemu/rcu_queue.h"
> +#include <clplumbing/md5.h>
> +#ifndef MD5_DIGEST_LENGTH
> +#define MD5_DIGEST_LENGTH 16
> +#endif
> +
> +static void check_host_md5(void)
> +{
> +    int i;
> +    unsigned char md[MD5_DIGEST_LENGTH];
> +    rcu_read_lock();
> +    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
> 'pc.ram' block */
> +    rcu_read_unlock();
> +
> +    MD5(block->host, block->used_length, md);
> +    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
> +        fprintf(stderr, "%02x", md[i]);
> +    }
> +    fprintf(stderr, "\n");
> +    error_report("end ram md5");
> +}
> +
>  void qemu_savevm_state_begin(QEMUFile *f,
>                               const MigrationParams *params)
>  {
> @@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
> bool iterable_only)
>          save_section_header(f, se, QEMU_VM_SECTION_END);
> 
>          ret = se->ops->save_live_complete_precopy(f, se->opaque);
> +
> +        fprintf(stderr, "after saving %s complete\n", se->idstr);
> +        check_host_md5();
> +
>          trace_savevm_section_end(se->idstr, se->section_id, ret);
>          save_section_footer(f, se);
>          if (ret < 0) {
> @@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
> MigrationIncomingState *mis)
>                               section_id, le->se->idstr);
>                  return ret;
>              }
> +            if (section_type == QEMU_VM_SECTION_END) {
> +                error_report("after loading state section id %d(%s)",
> +                             section_id, le->se->idstr);
> +                check_host_md5();
> +            }
>              if (!check_section_footer(f, le)) {
>                  return -EINVAL;
>              }
> @@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
>      }
> 
>      cpu_synchronize_all_post_init();
> +    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
> +    check_host_md5();
> 
>      return ret;
>  }
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03  9:24 ` Dr. David Alan Gilbert
@ 2015-12-03  9:37   ` Hailiang Zhang
  2015-12-03 10:23     ` Li Zhijian
  2015-12-03 10:23   ` Li Zhijian
  1 sibling, 1 reply; 9+ messages in thread
From: Hailiang Zhang @ 2015-12-03  9:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Li Zhijian
  Cc: Juan Quintela, peter.huangpeng, qemu-devel, Amit Shah, david

On 2015/12/3 17:24, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
>> Hi all,
>>
>> Does anyboday remember the similar issue post by hailiang months ago
>>   http://patchwork.ozlabs.org/patch/454322/
>> At least tow bugs about migration had been fixed since that.
>
> Yes, I wondered what happened to that.
>
>> And now we found the same issue at the tcg vm(kvm is fine), after migration,
>> the content VM's memory is inconsistent.
>
> Hmm, TCG only - I don't know much about that; but I guess something must
> be accessing memory without using the proper macros/functions so
> it doesn't mark it as dirty.
>
>> we add a patch to check memory content, you can find it from affix
>>
>> steps to reporduce:
>> 1) apply the patch and re-build qemu
>> 2) prepare the ubuntu guest and run memtest in grub.
>> soruce side:
>> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
>> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
>> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
>> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
>> pc-i440fx-2.3,accel=tcg,usb=off
>>
>> destination side:
>> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
>> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
>> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
>> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
>> pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
>>
>> 3) start migration
>> with 1000M NIC, migration will finish within 3 min.
>>
>> at source:
>> (qemu) migrate tcp:192.168.2.66:8881
>> after saving ram complete
>> e9e725df678d392b1a83b3a917f332bb
>> qemu-system-x86_64: end ram md5
>> (qemu)
>>
>> at destination:
>> ...skip...
>> Completed load of VM with exit code 0 seq iteration 1264
>> Completed load of VM with exit code 0 seq iteration 1265
>> Completed load of VM with exit code 0 seq iteration 1266
>> qemu-system-x86_64: after loading state section id 2(ram)
>> 49c2dac7bde0e5e22db7280dcb3824f9
>> qemu-system-x86_64: end ram md5
>> qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
>>
>> 49c2dac7bde0e5e22db7280dcb3824f9
>> qemu-system-x86_64: end ram md5
>>
>> This occurs occasionally and only at tcg machine. It seems that
>> some pages dirtied in source side don't transferred to destination.
>> This problem can be reproduced even if we disable virtio.
>>
>> Is it OK for some pages that not transferred to destination when do
>> migration ? Or is it a bug?
>
> I'm pretty sure that means it's a bug.  Hard to find though, I guess
> at least memtest is smaller than a big OS.  I think I'd dump the whole
> of memory on both sides, hexdump and diff them  - I'd guess it would
> just be one byte/word different, maybe that would offer some idea what
> wrote it.
>

Maybe one better way to do that is with the help of userfaultfd's write-protect
capability. It is still in the development by Andrea Arcangeli, but there
is a RFC version available, please refer to http://www.spinics.net/lists/linux-mm/msg97422.html
(I'm developing live memory snapshot which based on it, maybe this is another scene where we
can use userfaultfd's WP ;) ).


> Dave
>
>> Any idea...
>>
>> =================md5 check patch=============================
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 962d004..e2cb8e9 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
>>   obj-y += memory_mapping.o
>>   obj-y += dump.o
>>   obj-y += migration/ram.o migration/savevm.o
>> -LIBS := $(libs_softmmu) $(LIBS)
>> +LIBS := $(libs_softmmu) $(LIBS) -lplumb
>>
>>   # xen support
>>   obj-$(CONFIG_XEN) += xen-common.o
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 1eb155a..3b7a09d 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
>> version_id)
>>       }
>>
>>       rcu_read_unlock();
>> -    DPRINTF("Completed load of VM with exit code %d seq iteration "
>> +    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
>>               "%" PRIu64 "\n", ret, seq_iter);
>>       return ret;
>>   }
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 0ad1b93..3feaa61 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
>>
>>   }
>>
>> +#include "exec/ram_addr.h"
>> +#include "qemu/rcu_queue.h"
>> +#include <clplumbing/md5.h>
>> +#ifndef MD5_DIGEST_LENGTH
>> +#define MD5_DIGEST_LENGTH 16
>> +#endif
>> +
>> +static void check_host_md5(void)
>> +{
>> +    int i;
>> +    unsigned char md[MD5_DIGEST_LENGTH];
>> +    rcu_read_lock();
>> +    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
>> 'pc.ram' block */
>> +    rcu_read_unlock();
>> +
>> +    MD5(block->host, block->used_length, md);
>> +    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
>> +        fprintf(stderr, "%02x", md[i]);
>> +    }
>> +    fprintf(stderr, "\n");
>> +    error_report("end ram md5");
>> +}
>> +
>>   void qemu_savevm_state_begin(QEMUFile *f,
>>                                const MigrationParams *params)
>>   {
>> @@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
>> bool iterable_only)
>>           save_section_header(f, se, QEMU_VM_SECTION_END);
>>
>>           ret = se->ops->save_live_complete_precopy(f, se->opaque);
>> +
>> +        fprintf(stderr, "after saving %s complete\n", se->idstr);
>> +        check_host_md5();
>> +
>>           trace_savevm_section_end(se->idstr, se->section_id, ret);
>>           save_section_footer(f, se);
>>           if (ret < 0) {
>> @@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
>> MigrationIncomingState *mis)
>>                                section_id, le->se->idstr);
>>                   return ret;
>>               }
>> +            if (section_type == QEMU_VM_SECTION_END) {
>> +                error_report("after loading state section id %d(%s)",
>> +                             section_id, le->se->idstr);
>> +                check_host_md5();
>> +            }
>>               if (!check_section_footer(f, le)) {
>>                   return -EINVAL;
>>               }
>> @@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
>>       }
>>
>>       cpu_synchronize_all_post_init();
>> +    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
>> +    check_host_md5();
>>
>>       return ret;
>>   }
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> .
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03  9:24 ` Dr. David Alan Gilbert
  2015-12-03  9:37   ` Hailiang Zhang
@ 2015-12-03 10:23   ` Li Zhijian
  2015-12-03 11:22     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 9+ messages in thread
From: Li Zhijian @ 2015-12-03 10:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: zhanghailiang, Juan Quintela, qemu-devel, Amit Shah, david



On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
>> Hi all,
>>
>> Does anyboday remember the similar issue post by hailiang months ago
>>   http://patchwork.ozlabs.org/patch/454322/
>> At least tow bugs about migration had been fixed since that.
>
> Yes, I wondered what happened to that.
>
>> And now we found the same issue at the tcg vm(kvm is fine), after migration,
>> the content VM's memory is inconsistent.
>
> Hmm, TCG only - I don't know much about that; but I guess something must
> be accessing memory without using the proper macros/functions so
> it doesn't mark it as dirty.
>
>> we add a patch to check memory content, you can find it from affix
>>
>> steps to reporduce:
>> 1) apply the patch and re-build qemu
>> 2) prepare the ubuntu guest and run memtest in grub.
>> soruce side:
>> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
>> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
>> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
>> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
>> pc-i440fx-2.3,accel=tcg,usb=off
>>
>> destination side:
>> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
>> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
>> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
>> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
>> pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
>>
>> 3) start migration
>> with 1000M NIC, migration will finish within 3 min.
>>
>> at source:
>> (qemu) migrate tcp:192.168.2.66:8881
>> after saving ram complete
>> e9e725df678d392b1a83b3a917f332bb
>> qemu-system-x86_64: end ram md5
>> (qemu)
>>
>> at destination:
>> ...skip...
>> Completed load of VM with exit code 0 seq iteration 1264
>> Completed load of VM with exit code 0 seq iteration 1265
>> Completed load of VM with exit code 0 seq iteration 1266
>> qemu-system-x86_64: after loading state section id 2(ram)
>> 49c2dac7bde0e5e22db7280dcb3824f9
>> qemu-system-x86_64: end ram md5
>> qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
>>
>> 49c2dac7bde0e5e22db7280dcb3824f9
>> qemu-system-x86_64: end ram md5
>>
>> This occurs occasionally and only at tcg machine. It seems that
>> some pages dirtied in source side don't transferred to destination.
>> This problem can be reproduced even if we disable virtio.
>>
>> Is it OK for some pages that not transferred to destination when do
>> migration ? Or is it a bug?
>
> I'm pretty sure that means it's a bug.  Hard to find though, I guess
> at least memtest is smaller than a big OS.  I think I'd dump the whole
> of memory on both sides, hexdump and diff them  - I'd guess it would
> just be one byte/word different, maybe that would offer some idea what
> wrote it.

I try to dump and compare them, more than 10 pages are different.
in source side, they are random value rather than always 'FF' 'FB' 'EF' 
'BF'... in destination.

and not all of the different pages are continuous.

thanks
Li


>
> Dave
>
>> Any idea...
>>
>> =================md5 check patch=============================
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 962d004..e2cb8e9 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
>>   obj-y += memory_mapping.o
>>   obj-y += dump.o
>>   obj-y += migration/ram.o migration/savevm.o
>> -LIBS := $(libs_softmmu) $(LIBS)
>> +LIBS := $(libs_softmmu) $(LIBS) -lplumb
>>
>>   # xen support
>>   obj-$(CONFIG_XEN) += xen-common.o
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 1eb155a..3b7a09d 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
>> version_id)
>>       }
>>
>>       rcu_read_unlock();
>> -    DPRINTF("Completed load of VM with exit code %d seq iteration "
>> +    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
>>               "%" PRIu64 "\n", ret, seq_iter);
>>       return ret;
>>   }
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 0ad1b93..3feaa61 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
>>
>>   }
>>
>> +#include "exec/ram_addr.h"
>> +#include "qemu/rcu_queue.h"
>> +#include <clplumbing/md5.h>
>> +#ifndef MD5_DIGEST_LENGTH
>> +#define MD5_DIGEST_LENGTH 16
>> +#endif
>> +
>> +static void check_host_md5(void)
>> +{
>> +    int i;
>> +    unsigned char md[MD5_DIGEST_LENGTH];
>> +    rcu_read_lock();
>> +    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
>> 'pc.ram' block */
>> +    rcu_read_unlock();
>> +
>> +    MD5(block->host, block->used_length, md);
>> +    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
>> +        fprintf(stderr, "%02x", md[i]);
>> +    }
>> +    fprintf(stderr, "\n");
>> +    error_report("end ram md5");
>> +}
>> +
>>   void qemu_savevm_state_begin(QEMUFile *f,
>>                                const MigrationParams *params)
>>   {
>> @@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
>> bool iterable_only)
>>           save_section_header(f, se, QEMU_VM_SECTION_END);
>>
>>           ret = se->ops->save_live_complete_precopy(f, se->opaque);
>> +
>> +        fprintf(stderr, "after saving %s complete\n", se->idstr);
>> +        check_host_md5();
>> +
>>           trace_savevm_section_end(se->idstr, se->section_id, ret);
>>           save_section_footer(f, se);
>>           if (ret < 0) {
>> @@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
>> MigrationIncomingState *mis)
>>                                section_id, le->se->idstr);
>>                   return ret;
>>               }
>> +            if (section_type == QEMU_VM_SECTION_END) {
>> +                error_report("after loading state section id %d(%s)",
>> +                             section_id, le->se->idstr);
>> +                check_host_md5();
>> +            }
>>               if (!check_section_footer(f, le)) {
>>                   return -EINVAL;
>>               }
>> @@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
>>       }
>>
>>       cpu_synchronize_all_post_init();
>> +    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
>> +    check_host_md5();
>>
>>       return ret;
>>   }
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
> .
>

-- 
Best regards.
Li Zhijian (8555)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03  9:37   ` Hailiang Zhang
@ 2015-12-03 10:23     ` Li Zhijian
  0 siblings, 0 replies; 9+ messages in thread
From: Li Zhijian @ 2015-12-03 10:23 UTC (permalink / raw)
  To: Hailiang Zhang, Dr. David Alan Gilbert
  Cc: Juan Quintela, peter.huangpeng, qemu-devel, Amit Shah, david



On 12/03/2015 05:37 PM, Hailiang Zhang wrote:
> On 2015/12/3 17:24, Dr. David Alan Gilbert wrote:
>> * Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
>>> Hi all,
>>>
>>> Does anyboday remember the similar issue post by hailiang months ago
>>>   http://patchwork.ozlabs.org/patch/454322/
>>> At least tow bugs about migration had been fixed since that.
>>
>> Yes, I wondered what happened to that.
>>
>>> And now we found the same issue at the tcg vm(kvm is fine), after
>>> migration,
>>> the content VM's memory is inconsistent.
>>
>> Hmm, TCG only - I don't know much about that; but I guess something must
>> be accessing memory without using the proper macros/functions so
>> it doesn't mark it as dirty.
>>
>>> we add a patch to check memory content, you can find it from affix
>>>
>>> steps to reporduce:
>>> 1) apply the patch and re-build qemu
>>> 2) prepare the ubuntu guest and run memtest in grub.
>>> soruce side:
>>> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
>>> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
>>> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>>>
>>> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
>>> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
>>> pc-i440fx-2.3,accel=tcg,usb=off
>>>
>>> destination side:
>>> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
>>> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
>>> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>>>
>>> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
>>> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
>>> pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
>>>
>>> 3) start migration
>>> with 1000M NIC, migration will finish within 3 min.
>>>
>>> at source:
>>> (qemu) migrate tcp:192.168.2.66:8881
>>> after saving ram complete
>>> e9e725df678d392b1a83b3a917f332bb
>>> qemu-system-x86_64: end ram md5
>>> (qemu)
>>>
>>> at destination:
>>> ...skip...
>>> Completed load of VM with exit code 0 seq iteration 1264
>>> Completed load of VM with exit code 0 seq iteration 1265
>>> Completed load of VM with exit code 0 seq iteration 1266
>>> qemu-system-x86_64: after loading state section id 2(ram)
>>> 49c2dac7bde0e5e22db7280dcb3824f9
>>> qemu-system-x86_64: end ram md5
>>> qemu-system-x86_64: qemu_loadvm_state: after
>>> cpu_synchronize_all_post_init
>>>
>>> 49c2dac7bde0e5e22db7280dcb3824f9
>>> qemu-system-x86_64: end ram md5
>>>
>>> This occurs occasionally and only at tcg machine. It seems that
>>> some pages dirtied in source side don't transferred to destination.
>>> This problem can be reproduced even if we disable virtio.
>>>
>>> Is it OK for some pages that not transferred to destination when do
>>> migration ? Or is it a bug?
>>
>> I'm pretty sure that means it's a bug.  Hard to find though, I guess
>> at least memtest is smaller than a big OS.  I think I'd dump the whole
>> of memory on both sides, hexdump and diff them  - I'd guess it would
>> just be one byte/word different, maybe that would offer some idea what
>> wrote it.
>>
>
> Maybe one better way to do that is with the help of userfaultfd's
> write-protect
> capability. It is still in the development by Andrea Arcangeli, but there
> is a RFC version available, please refer to
> http://www.spinics.net/lists/linux-mm/msg97422.html
> (I'm developing live memory snapshot which based on it, maybe this is
> another scene where we
> can use userfaultfd's WP ;) ).

sounds good.

thanks
Li

>
>
>> Dave
>>
>>> Any idea...
>>>
>>> =================md5 check patch=============================
>>>
>>> diff --git a/Makefile.target b/Makefile.target
>>> index 962d004..e2cb8e9 100644
>>> --- a/Makefile.target
>>> +++ b/Makefile.target
>>> @@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
>>>   obj-y += memory_mapping.o
>>>   obj-y += dump.o
>>>   obj-y += migration/ram.o migration/savevm.o
>>> -LIBS := $(libs_softmmu) $(LIBS)
>>> +LIBS := $(libs_softmmu) $(LIBS) -lplumb
>>>
>>>   # xen support
>>>   obj-$(CONFIG_XEN) += xen-common.o
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 1eb155a..3b7a09d 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
>>> version_id)
>>>       }
>>>
>>>       rcu_read_unlock();
>>> -    DPRINTF("Completed load of VM with exit code %d seq iteration "
>>> +    fprintf(stderr, "Completed load of VM with exit code %d seq
>>> iteration "
>>>               "%" PRIu64 "\n", ret, seq_iter);
>>>       return ret;
>>>   }
>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>> index 0ad1b93..3feaa61 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
>>>
>>>   }
>>>
>>> +#include "exec/ram_addr.h"
>>> +#include "qemu/rcu_queue.h"
>>> +#include <clplumbing/md5.h>
>>> +#ifndef MD5_DIGEST_LENGTH
>>> +#define MD5_DIGEST_LENGTH 16
>>> +#endif
>>> +
>>> +static void check_host_md5(void)
>>> +{
>>> +    int i;
>>> +    unsigned char md[MD5_DIGEST_LENGTH];
>>> +    rcu_read_lock();
>>> +    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
>>> 'pc.ram' block */
>>> +    rcu_read_unlock();
>>> +
>>> +    MD5(block->host, block->used_length, md);
>>> +    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
>>> +        fprintf(stderr, "%02x", md[i]);
>>> +    }
>>> +    fprintf(stderr, "\n");
>>> +    error_report("end ram md5");
>>> +}
>>> +
>>>   void qemu_savevm_state_begin(QEMUFile *f,
>>>                                const MigrationParams *params)
>>>   {
>>> @@ -1056,6 +1079,10 @@ void
>>> qemu_savevm_state_complete_precopy(QEMUFile *f,
>>> bool iterable_only)
>>>           save_section_header(f, se, QEMU_VM_SECTION_END);
>>>
>>>           ret = se->ops->save_live_complete_precopy(f, se->opaque);
>>> +
>>> +        fprintf(stderr, "after saving %s complete\n", se->idstr);
>>> +        check_host_md5();
>>> +
>>>           trace_savevm_section_end(se->idstr, se->section_id, ret);
>>>           save_section_footer(f, se);
>>>           if (ret < 0) {
>>> @@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
>>> MigrationIncomingState *mis)
>>>                                section_id, le->se->idstr);
>>>                   return ret;
>>>               }
>>> +            if (section_type == QEMU_VM_SECTION_END) {
>>> +                error_report("after loading state section id %d(%s)",
>>> +                             section_id, le->se->idstr);
>>> +                check_host_md5();
>>> +            }
>>>               if (!check_section_footer(f, le)) {
>>>                   return -EINVAL;
>>>               }
>>> @@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
>>>       }
>>>
>>>       cpu_synchronize_all_post_init();
>>> +    error_report("%s: after cpu_synchronize_all_post_init\n",
>>> __func__);
>>> +    check_host_md5();
>>>
>>>       return ret;
>>>   }
>>>
>>>
>>>
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>
>> .
>>
>
>
>
>
> .
>

-- 
Best regards.
Li Zhijian (8555)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03  7:32 [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration Li Zhijian
  2015-12-03  9:24 ` Dr. David Alan Gilbert
@ 2015-12-03 11:20 ` Juan Quintela
  2015-12-04  1:43   ` Li, Liang Z
  1 sibling, 1 reply; 9+ messages in thread
From: Juan Quintela @ 2015-12-03 11:20 UTC (permalink / raw)
  To: Li Zhijian
  Cc: zhanghailiang, qemu-devel, Dr. David Alan Gilbert (git),
	Amit Shah, david

Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
> Hi all,
>
> Does anyboday remember the similar issue post by hailiang months ago
>  http://patchwork.ozlabs.org/patch/454322/
> At least tow bugs about migration had been fixed since that.
>
> And now we found the same issue at the tcg vm(kvm is fine), after
> migration, the content VM's memory is inconsistent.
>
> we add a patch to check memory content, you can find it from affix
>
> steps to reporduce:
> 1) apply the patch and re-build qemu
> 2) prepare the ubuntu guest and run memtest in grub.
> soruce side:
> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
> pc-i440fx-2.3,accel=tcg,usb=off
>
> destination side:
> x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
> e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
> if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
> tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
> pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
>
> 3) start migration
> with 1000M NIC, migration will finish within 3 min.
>
> at source:
> (qemu) migrate tcp:192.168.2.66:8881
> after saving ram complete
> e9e725df678d392b1a83b3a917f332bb
> qemu-system-x86_64: end ram md5
> (qemu)
>
> at destination:
> ...skip...
> Completed load of VM with exit code 0 seq iteration 1264
> Completed load of VM with exit code 0 seq iteration 1265
> Completed load of VM with exit code 0 seq iteration 1266
> qemu-system-x86_64: after loading state section id 2(ram)
> 49c2dac7bde0e5e22db7280dcb3824f9
> qemu-system-x86_64: end ram md5
> qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
>
> 49c2dac7bde0e5e22db7280dcb3824f9
> qemu-system-x86_64: end ram md5
>
> This occurs occasionally and only at tcg machine. It seems that
> some pages dirtied in source side don't transferred to destination.
> This problem can be reproduced even if we disable virtio.
>
> Is it OK for some pages that not transferred to destination when do
> migration ? Or is it a bug?
>
> Any idea...

Thanks for describing how to reproduce the bug.
If some pages are not transferred to destination then it is a bug, so we
need to know what the problem is, notice that the problem can be that
TCG is not marking dirty some page, that Migration code "forgets" about
that page, or anything eles altogether, that is what we need to find.

There are more posibilities, I am not sure that memtest is on 32bit
mode, and it is inside posibility that we are missing some state when we
are on real mode.

Will try to take a look at this.

THanks, again.


>
> =================md5 check patch=============================
>
> diff --git a/Makefile.target b/Makefile.target
> index 962d004..e2cb8e9 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
>  obj-y += memory_mapping.o
>  obj-y += dump.o
>  obj-y += migration/ram.o migration/savevm.o
> -LIBS := $(libs_softmmu) $(LIBS)
> +LIBS := $(libs_softmmu) $(LIBS) -lplumb
>
>  # xen support
>  obj-$(CONFIG_XEN) += xen-common.o
> diff --git a/migration/ram.c b/migration/ram.c
> index 1eb155a..3b7a09d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque,
> int version_id)
>      }
>
>      rcu_read_unlock();
> -    DPRINTF("Completed load of VM with exit code %d seq iteration "
> +    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
>              "%" PRIu64 "\n", ret, seq_iter);
>      return ret;
>  }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0ad1b93..3feaa61 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
>
>  }
>
> +#include "exec/ram_addr.h"
> +#include "qemu/rcu_queue.h"
> +#include <clplumbing/md5.h>
> +#ifndef MD5_DIGEST_LENGTH
> +#define MD5_DIGEST_LENGTH 16
> +#endif
> +
> +static void check_host_md5(void)
> +{
> +    int i;
> +    unsigned char md[MD5_DIGEST_LENGTH];
> +    rcu_read_lock();
> +    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
> 'pc.ram' block */
> +    rcu_read_unlock();
> +
> +    MD5(block->host, block->used_length, md);
> +    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
> +        fprintf(stderr, "%02x", md[i]);
> +    }
> +    fprintf(stderr, "\n");
> +    error_report("end ram md5");
> +}
> +
>  void qemu_savevm_state_begin(QEMUFile *f,
>                               const MigrationParams *params)
>  {
> @@ -1056,6 +1079,10 @@ void
> qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
>          save_section_header(f, se, QEMU_VM_SECTION_END);
>
>          ret = se->ops->save_live_complete_precopy(f, se->opaque);
> +
> +        fprintf(stderr, "after saving %s complete\n", se->idstr);
> +        check_host_md5();
> +
>          trace_savevm_section_end(se->idstr, se->section_id, ret);
>          save_section_footer(f, se);
>          if (ret < 0) {
> @@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
> MigrationIncomingState *mis)
>                               section_id, le->se->idstr);
>                  return ret;
>              }
> +            if (section_type == QEMU_VM_SECTION_END) {
> +                error_report("after loading state section id %d(%s)",
> +                             section_id, le->se->idstr);
> +                check_host_md5();
> +            }
>              if (!check_section_footer(f, le)) {
>                  return -EINVAL;
>              }
> @@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
>      }
>
>      cpu_synchronize_all_post_init();
> +    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
> +    check_host_md5();
>
>      return ret;
>  }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03 10:23   ` Li Zhijian
@ 2015-12-03 11:22     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-03 11:22 UTC (permalink / raw)
  To: Li Zhijian; +Cc: zhanghailiang, Juan Quintela, qemu-devel, Amit Shah, david

* Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
> 
> 
> On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote:
> >* Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
> >>Hi all,
> >>
> >>Does anyboday remember the similar issue post by hailiang months ago
> >>  http://patchwork.ozlabs.org/patch/454322/
> >>At least tow bugs about migration had been fixed since that.
> >
> >Yes, I wondered what happened to that.
> >
> >>And now we found the same issue at the tcg vm(kvm is fine), after migration,
> >>the content VM's memory is inconsistent.
> >
> >Hmm, TCG only - I don't know much about that; but I guess something must
> >be accessing memory without using the proper macros/functions so
> >it doesn't mark it as dirty.
> >
> >>we add a patch to check memory content, you can find it from affix
> >>
> >>steps to reporduce:
> >>1) apply the patch and re-build qemu
> >>2) prepare the ubuntu guest and run memtest in grub.
> >>soruce side:
> >>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
> >>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
> >>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
> >>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
> >>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
> >>pc-i440fx-2.3,accel=tcg,usb=off
> >>
> >>destination side:
> >>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
> >>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
> >>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
> >>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
> >>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
> >>pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
> >>
> >>3) start migration
> >>with 1000M NIC, migration will finish within 3 min.
> >>
> >>at source:
> >>(qemu) migrate tcp:192.168.2.66:8881
> >>after saving ram complete
> >>e9e725df678d392b1a83b3a917f332bb
> >>qemu-system-x86_64: end ram md5
> >>(qemu)
> >>
> >>at destination:
> >>...skip...
> >>Completed load of VM with exit code 0 seq iteration 1264
> >>Completed load of VM with exit code 0 seq iteration 1265
> >>Completed load of VM with exit code 0 seq iteration 1266
> >>qemu-system-x86_64: after loading state section id 2(ram)
> >>49c2dac7bde0e5e22db7280dcb3824f9
> >>qemu-system-x86_64: end ram md5
> >>qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
> >>
> >>49c2dac7bde0e5e22db7280dcb3824f9
> >>qemu-system-x86_64: end ram md5
> >>
> >>This occurs occasionally and only at tcg machine. It seems that
> >>some pages dirtied in source side don't transferred to destination.
> >>This problem can be reproduced even if we disable virtio.
> >>
> >>Is it OK for some pages that not transferred to destination when do
> >>migration ? Or is it a bug?
> >
> >I'm pretty sure that means it's a bug.  Hard to find though, I guess
> >at least memtest is smaller than a big OS.  I think I'd dump the whole
> >of memory on both sides, hexdump and diff them  - I'd guess it would
> >just be one byte/word different, maybe that would offer some idea what
> >wrote it.
> 
> I try to dump and compare them, more than 10 pages are different.
> in source side, they are random value rather than always 'FF' 'FB' 'EF'
> 'BF'... in destination.
> 
> and not all of the different pages are continuous.

I wonder if it happens on all of memtest's different test patterns,
perhaps it might be possible to narrow it down if you tell memtest
to only run one test at a time.

Dave

> 
> thanks
> Li
> 
> 
> >
> >Dave
> >
> >>Any idea...
> >>
> >>=================md5 check patch=============================
> >>
> >>diff --git a/Makefile.target b/Makefile.target
> >>index 962d004..e2cb8e9 100644
> >>--- a/Makefile.target
> >>+++ b/Makefile.target
> >>@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
> >>  obj-y += memory_mapping.o
> >>  obj-y += dump.o
> >>  obj-y += migration/ram.o migration/savevm.o
> >>-LIBS := $(libs_softmmu) $(LIBS)
> >>+LIBS := $(libs_softmmu) $(LIBS) -lplumb
> >>
> >>  # xen support
> >>  obj-$(CONFIG_XEN) += xen-common.o
> >>diff --git a/migration/ram.c b/migration/ram.c
> >>index 1eb155a..3b7a09d 100644
> >>--- a/migration/ram.c
> >>+++ b/migration/ram.c
> >>@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> >>version_id)
> >>      }
> >>
> >>      rcu_read_unlock();
> >>-    DPRINTF("Completed load of VM with exit code %d seq iteration "
> >>+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
> >>              "%" PRIu64 "\n", ret, seq_iter);
> >>      return ret;
> >>  }
> >>diff --git a/migration/savevm.c b/migration/savevm.c
> >>index 0ad1b93..3feaa61 100644
> >>--- a/migration/savevm.c
> >>+++ b/migration/savevm.c
> >>@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
> >>
> >>  }
> >>
> >>+#include "exec/ram_addr.h"
> >>+#include "qemu/rcu_queue.h"
> >>+#include <clplumbing/md5.h>
> >>+#ifndef MD5_DIGEST_LENGTH
> >>+#define MD5_DIGEST_LENGTH 16
> >>+#endif
> >>+
> >>+static void check_host_md5(void)
> >>+{
> >>+    int i;
> >>+    unsigned char md[MD5_DIGEST_LENGTH];
> >>+    rcu_read_lock();
> >>+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
> >>'pc.ram' block */
> >>+    rcu_read_unlock();
> >>+
> >>+    MD5(block->host, block->used_length, md);
> >>+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
> >>+        fprintf(stderr, "%02x", md[i]);
> >>+    }
> >>+    fprintf(stderr, "\n");
> >>+    error_report("end ram md5");
> >>+}
> >>+
> >>  void qemu_savevm_state_begin(QEMUFile *f,
> >>                               const MigrationParams *params)
> >>  {
> >>@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
> >>bool iterable_only)
> >>          save_section_header(f, se, QEMU_VM_SECTION_END);
> >>
> >>          ret = se->ops->save_live_complete_precopy(f, se->opaque);
> >>+
> >>+        fprintf(stderr, "after saving %s complete\n", se->idstr);
> >>+        check_host_md5();
> >>+
> >>          trace_savevm_section_end(se->idstr, se->section_id, ret);
> >>          save_section_footer(f, se);
> >>          if (ret < 0) {
> >>@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
> >>MigrationIncomingState *mis)
> >>                               section_id, le->se->idstr);
> >>                  return ret;
> >>              }
> >>+            if (section_type == QEMU_VM_SECTION_END) {
> >>+                error_report("after loading state section id %d(%s)",
> >>+                             section_id, le->se->idstr);
> >>+                check_host_md5();
> >>+            }
> >>              if (!check_section_footer(f, le)) {
> >>                  return -EINVAL;
> >>              }
> >>@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
> >>      }
> >>
> >>      cpu_synchronize_all_post_init();
> >>+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
> >>+    check_host_md5();
> >>
> >>      return ret;
> >>  }
> >>
> >>
> >>
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >
> >.
> >
> 
> -- 
> Best regards.
> Li Zhijian (8555)
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-03 11:20 ` Juan Quintela
@ 2015-12-04  1:43   ` Li, Liang Z
  2015-12-17  6:07     ` Amit Shah
  0 siblings, 1 reply; 9+ messages in thread
From: Li, Liang Z @ 2015-12-04  1:43 UTC (permalink / raw)
  To: quintela@redhat.com, Li Zhijian
  Cc: Amit Shah, david@gibson.dropbear.id.au, zhanghailiang,
	Dr. David Alan Gilbert (git), qemu-devel@nongnu.org

> 
> Thanks for describing how to reproduce the bug.
> If some pages are not transferred to destination then it is a bug, so we need
> to know what the problem is, notice that the problem can be that TCG is not
> marking dirty some page, that Migration code "forgets" about that page, or
> anything eles altogether, that is what we need to find.
> 
> There are more posibilities, I am not sure that memtest is on 32bit mode, and
> it is inside posibility that we are missing some state when we are on real
> mode.
> 
> Will try to take a look at this.
> 
> THanks, again.
> 

Hi Juan & Amit

 Do you think we should add a mechanism to check the data integrity during LM like Zhijian's patch did?  it may be very helpful for developers. 
 Actually, I did the similar thing before in order to make sure that I did the right thing we I change the code related to LM.

Liang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
  2015-12-04  1:43   ` Li, Liang Z
@ 2015-12-17  6:07     ` Amit Shah
  0 siblings, 0 replies; 9+ messages in thread
From: Amit Shah @ 2015-12-17  6:07 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: zhanghailiang, Li Zhijian, quintela@redhat.com,
	qemu-devel@nongnu.org, Dr. David Alan Gilbert (git),
	david@gibson.dropbear.id.au

On (Fri) 04 Dec 2015 [01:43:07], Li, Liang Z wrote:
> > 
> > Thanks for describing how to reproduce the bug.
> > If some pages are not transferred to destination then it is a bug, so we need
> > to know what the problem is, notice that the problem can be that TCG is not
> > marking dirty some page, that Migration code "forgets" about that page, or
> > anything eles altogether, that is what we need to find.
> > 
> > There are more posibilities, I am not sure that memtest is on 32bit mode, and
> > it is inside posibility that we are missing some state when we are on real
> > mode.
> > 
> > Will try to take a look at this.
> > 
> > THanks, again.
> > 
> 
> Hi Juan & Amit
> 
>  Do you think we should add a mechanism to check the data integrity during LM like Zhijian's patch did?  it may be very helpful for developers. 
>  Actually, I did the similar thing before in order to make sure that I did the right thing we I change the code related to LM.

If you mean for debugging, something that's not always on, then I'm
fine with it.

A script that goes along that shows the result of comparison of the
diff will be helpful too, something that shows how many pages are
differnt, how many bytes in a page on average, and so on.

		Amit

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-12-17  6:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-03  7:32 [Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration Li Zhijian
2015-12-03  9:24 ` Dr. David Alan Gilbert
2015-12-03  9:37   ` Hailiang Zhang
2015-12-03 10:23     ` Li Zhijian
2015-12-03 10:23   ` Li Zhijian
2015-12-03 11:22     ` Dr. David Alan Gilbert
2015-12-03 11:20 ` Juan Quintela
2015-12-04  1:43   ` Li, Liang Z
2015-12-17  6:07     ` Amit Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).