From: zhanghailiang <zhang.zhanghailiang@huawei.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: hangaohuai@huawei.com, Li Zhijian <lizhijian@cn.fujitsu.com>,
quintela@redhat.com, qemu-devel@nongnu.org,
peter.huangpeng@huawei.com,
"Gonglei (Arei)" <arei.gonglei@huawei.com>,
Amit Shah <amit.shah@redhat.com>,
david@gibson.dropbear.id.au
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Thu, 2 Apr 2015 19:52:25 +0800 [thread overview]
Message-ID: <551D2D79.90905@huawei.com> (raw)
In-Reply-To: <20150331190634.GC7687@work-vm>
On 2015/4/1 3:06, Dr. David Alan Gilbert wrote:
> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>> On 2015/3/30 15:59, Dr. David Alan Gilbert wrote:
>>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>>> On 2015/3/27 18:18, Dr. David Alan Gilbert wrote:
>>>>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>>>>> On 2015/3/26 11:52, Li Zhijian wrote:
>>>>>>> On 03/26/2015 11:12 AM, Wen Congyang wrote:
>>>>>>>> On 03/25/2015 05:50 PM, Juan Quintela wrote:
>>>>>>>>> zhanghailiang<zhang.zhanghailiang@huawei.com> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> We found that, sometimes, the content of VM's memory is inconsistent between Source side and Destination side
>>>>>>>>>> when we check it just after finishing migration but before VM continue to Run.
>>>>>>>>>>
>>>>>>>>>> We use a patch like bellow to find this issue, you can find it from affix,
>>>>>>>>>> and Steps to reprduce:
>>>>>>>>>>
>>>>>>>>>> (1) Compile QEMU:
>>>>>>>>>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make
>>>>>>>>>>
>>>>>>>>>> (2) Command and output:
>>>>>>>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -netdev tap,id=hn0-device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio
>>>>>>>>> Could you try to reproduce:
>>>>>>>>> - without vhost
>>>>>>>>> - without virtio-net
>>>>>>>>> - cache=unsafe is going to give you trouble, but trouble should only
>>>>>>>>> happen after migration of pages have finished.
>>>>>>>> If I use ide disk, it doesn't happen.
>>>>>>>> Even if I use virtio-net with vhost=on, it still doesn't happen. I guess
>>>>>>>> it is because I migrate the guest when it is booting. The virtio net
>>>>>>>> device is not used in this case.
>>>>>>> Er??????
>>>>>>> it reproduces in my ide disk
>>>>>>> there is no any virtio device, my command line like below
>>>>>>>
>>>>>>> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net none
>>>>>>> -boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -machine
>>>>>>> usb=off -no-user-config -nodefaults -monitor stdio -vga std
>>>>>>>
>>>>>>> it seems easily to reproduce this issue by following steps in _ubuntu_ guest
>>>>>>> 1. in source side, choose memtest in grub
>>>>>>> 2. do live migration
>>>>>>> 3. exit memtest(type Esc in when memory testing)
>>>>>>> 4. wait migration complete
>>>>>>>
>>>>>>
>>>>>> Yes???it is a thorny problem. It is indeed easy to reproduce, just as
>>>>>> your steps in the above.
>>>>>>
>>>>>> This is my test result: (I also test accel=tcg, it can be reproduced also.)
>>>>>> Source side:
>>>>>> # x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu qemu64,-kvmclock -boot c -drive file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio
>>>>>> (qemu) ACPI_BUILD: init ACPI tables
>>>>>> ACPI_BUILD: init ACPI tables
>>>>>> migrate tcp:9.61.1.8:3004
>>>>>> ACPI_BUILD: init ACPI tables
>>>>>> before cpu_synchronize_all_states
>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>> md_host : before saving ram complete
>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>> md_host : after saving ram complete
>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>> (qemu)
>>>>>>
>>>>>> Destination side:
>>>>>> # x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu qemu64,-kvmclock -boot c -drive file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio -incoming tcp:0:3004
>>>>>> (qemu) QEMU_VM_SECTION_END, after loading ram
>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>> md_host : after loading all vmstate
>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>> md_host : after cpu_synchronize_all_post_init
>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>
>>>>> Hmm, that's not good. I suggest you md5 each of the RAMBlock's individually;
>>>>> to see if it's main RAM that's different or something more subtle like
>>>>> video RAM.
>>>>>
>>>>
>>>> Er, all my previous tests are md5 'pc.ram' block only.
>>>>
>>>>> But then maybe it's easier just to dump the whole of RAM to file
>>>>> and byte compare it (hexdump the two dumps and diff ?)
>>>>
>>>> Hmm, we also used memcmp function to compare every page, but the addresses
>>>> seem to be random.
>>>>
>>>> Besides, in our previous test, we found it seems to be more easy to reproduce
>>>> when migration occurs during VM's start-up or reboot process.
>>>>
>>>> Is there any possible that some devices have special treatment when VM start-up
>>>> which may miss setting dirty-bitmap ?
>>>
>>> I don't think there should be, but the code paths used during startup are
>>> probably much less tested with migration. I'm sure the startup code
>>> uses different part of device emulation. I do know we have some bugs
>>
>> Er, Maybe there is a special case:
>>
>> During VM's start-up, i found that the KVMslot changed many times, it was a process of
>> smashing total memory space into small slot.
>>
>> If some pages was dirtied and its dirty-bitmap has been set in KVM module,
>> but we didn't sync the bitmaps to QEMU user-space before this slot was smashed,
>> with its previous bitmap been destroyed.
>> The bitmap of dirty pages in the previous KVMslot maybe be missed.
>>
>> What's your opinion? Can this situation i described in the above happen?
>>
>> The bellow log was grabbed, when i tried to figure out a quite same question (some pages miss dirty-bitmap setting) we found in COLO:
>> Occasionally, there will be an error report in SLAVE side:
>>
>> qemu: warning: error while loading state for instance 0x0 of device
>> 'kvm-tpr-opt' '
>> qemu-system-x86_64: loadvm failed
>>
>> We found that it related to three address (gpa: 0xca000,0xcb000,0xcc000, which are the address of 'kvmvapic.rom ?'), and sometimes
>> their corresponding dirty-map will be missed in Master side, because their KVMSlot is destroyed before we sync its dirty-bitmap to qemu.
>>
>> (I'm still not quite sure if this can also happen in common migration, i will try to test it in normal migration)
>
Hi,
We have found two bugs (places) that miss setting migration-bitmap of dirty pages,
The virtio-blk related can be fixed by patch of Wen Congyang, you can find his reply in the list.
And the 'kvm-tpr-opt' related can be fixed by the follow patch.
Thanks,
zhang
>From 0c63687d0f14f928d6eb4903022a7981db6ba59f Mon Sep 17 00:00:00 2001
From: zhanghailiang <zhang.zhanghailiang@huawei.com>
Date: Thu, 2 Apr 2015 19:26:31 +0000
Subject: [PATCH] kvm-all: Sync dirty-bitmap from kvm before kvm destroy the
corresponding dirty_bitmap
Sometimes, we destroy the dirty_bitmap in kvm_memory_slot before any sync action
occur, this bit in dirty_bitmap will be missed, and which will lead the corresponding
dirty pages to be missed in migration.
This usually happens when do migration during VM's Start-up or Reboot.
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
---
exec.c | 2 +-
kvm-all.c | 4 +++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/exec.c b/exec.c
index 874ecfc..4b1b39b 100644
--- a/exec.c
+++ b/exec.c
@@ -59,7 +59,7 @@
//#define DEBUG_SUBPAGE
#if !defined(CONFIG_USER_ONLY)
-static bool in_migration;
+bool in_migration;
/* ram_list is read under rcu_read_lock()/rcu_read_unlock(). Writes
* are protected by the ramlist lock.
diff --git a/kvm-all.c b/kvm-all.c
index 335438a..dd75eff 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -128,6 +128,8 @@ bool kvm_allowed;
bool kvm_readonly_mem_allowed;
bool kvm_vm_attributes_allowed;
+extern bool in_migration;
+
static const KVMCapabilityInfo kvm_required_capabilites[] = {
KVM_CAP_INFO(USER_MEMORY),
KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS),
@@ -715,7 +717,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add)
old = *mem;
- if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES || in_migration) {
kvm_physical_sync_dirty_bitmap(section);
}
--
> Ah, kvm-tpr-opt.
> I also hit this today with colo; it didn't happen on a new machine, but it happens on an older machine, or in
> this case a nested setup.
>
> -global kvm-apic.vapic=false
>
> Turns the vapic off, and seems to have got me further.
>
> I don't understand the details of kvm-tpr-opt; but it's a hack for speeding up access
> to the 'tpr' (Task priority register) which apparently Windows does a lot and used
> to slow things down; the hack does involve modifying some stuff on the fly, and it
> doesn't happen if the host has some new features (tpr shadow I think also look for flexpriority).
>
> Dave
>>
>> Thanks,
>> zhanghailiang
>>
>> /var/log/message:
>> Mar 31 18:32:53 master kernel: [15908.524721] memslot dirty bitmap of '4' was destroyed, base_gfn 0x100, npages 524032
>> Mar 31 18:32:53 master kernel: [15908.657853] +++ gfn=0xcc is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.665105] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.672360] +++ gfn=0xca is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.682058] memslot dirty bitmap of '4' was destroyed, base_gfn 0xfebc0, npages 16
>> Mar 31 18:32:53 master kernel: [15908.849527] +++ gfn=0xca is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.856845] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.864161] +++ gfn=0xcc is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.872676] memslot dirty bitmap of '4' was destroyed, base_gfn 0xfeb80, npages 64
>> Mar 31 18:32:53 master kernel: [15908.882694] +++ gfn=0xca is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.889948] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.897202] +++ gfn=0xcc is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.911652] +++ gfn=0xca is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.919543] +++ gfn=0xca is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.927419] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.935317] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.943209] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.951083] +++ gfn=0xcb is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.958971] +++ gfn=0xcc is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.966837] +++ gfn=0xcc is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.974707] +++ gfn=0xcc is marked dirty, id=2, base_gfn=0xc0, npages=524096
>> Mar 31 18:32:53 master kernel: [15908.988470] memslot dirty bitmap of '2' was destroyed, base_gfn 0xc0, npages 524096
>> Mar 31 18:32:53 master kernel: [15909.002403] kvm: zapping shadow pages for mmio generation wraparound
>> Mar 31 18:32:53 master kernel: [15909.002523] memslot dirty bitmap of '2' was destroyed, base_gfn 0xc0, npages 8
>> Mar 31 18:32:53 master kernel: [15909.010110] memslot dirty bitmap of '4' was destroyed, base_gfn 0xc8, npages 524088
>> Mar 31 18:32:53 master kernel: [15909.023988] memslot dirty bitmap of '5' was destroyed, base_gfn 0xcd, npages 3
>> Mar 31 18:32:53 master kernel: [15909.031594] memslot dirty bitmap of '6' was destroyed, base_gfn 0xd0, npages 524080
>> Mar 31 18:32:53 master kernel: [15909.044708] memslot dirty bitmap of '5' was destroyed, base_gfn 0xcd, npages 11
>> Mar 31 18:32:53 master kernel: [15909.052392] memslot dirty bitmap of '6' was destroyed, base_gfn 0xd8, npages 524072
>> Mar 31 18:32:53 master kernel: [15909.065651] memslot dirty bitmap of '5' was destroyed, base_gfn 0xcd, npages 19
>> Mar 31 18:32:53 master kernel: [15909.073329] memslot dirty bitmap of '6' was destroyed, base_gfn 0xe0, npages 524064
>> Mar 31 18:32:53 master kernel: [15909.086379] memslot dirty bitmap of '5' was destroyed, base_gfn 0xcd, npages 27
>> Mar 31 18:32:53 master kernel: [15909.094084] memslot dirty bitmap of '6' was destroyed, base_gfn 0xe8, npages 524056
>> Mar 31 18:32:53 master kernel: [15909.107354] memslot dirty bitmap of '6' was destroyed, base_gfn 0xec, npages 524052
>> Mar 31 18:32:54 master dhcpcd[5408]: eth0: timed out
>> Mar 31 18:32:54 master dhcpcd[5408]: eth0: trying to use old lease in `/var/lib/dhcpcd/dhcpcd-eth0.info'
>> Mar 31 18:32:54 master dhcpcd[5408]: eth0: lease expired 30265 seconds ago
>> Mar 31 18:32:54 master dhcpcd[5408]: eth0: broadcasting for a lease
>> Mar 31 18:33:02 master qemu-system-x86_64: ====qemu: The 0 times do checkpoing===
>> Mar 31 18:33:02 master qemu-system-x86_64: == migration_bitmap_sync ==
>> Mar 31 18:33:02 master qemu-system-x86_64: qemu: --- addr=0xca000 (hva=0x7f4fa32ca000), who's bitmap not set ---
>> Mar 31 18:33:02 master qemu-system-x86_64: qemu: --- addr=0xcb000 (hva=0x7f4fa32cb000), who's bitmap not set ---
>> Mar 31 18:33:02 master qemu-system-x86_64: qemu: --- addr=0xcc000 (hva=0x7f4fa32cc000), who's bitmap not set ---
>> Mar 31 18:33:05 master kernel: [15921.057246] device eth2 left promiscuous mode
>> Mar 31 18:33:07 master avahi-daemon[5773]: Withdrawing address record for fe80::90be:cfff:fe9e:f03e on tap0.
>> Mar 31 18:33:07 master kernel: [15922.513313] br0: port 2(tap0) entered disabled state
>> Mar 31 18:33:07 master kernel: [15922.513480] device tap0 left promiscuous mode
>> Mar 31 18:33:07 master kernel: [15922.513508] br0: port 2(tap0) entered disabled state
>> Mar 31 18:33:07 master kernel: [15922.562591] memslot dirty bitmap of '8' was destroyed, base_gfn 0x100, npages 524032
>> Mar 31 18:33:07 master kernel: [15922.570948] memslot dirty bitmap of '3' was destroyed, base_gfn 0xfd000, npages 4096
>> Mar 31 18:33:07 master kernel: [15922.578967] memslot dirty bitmap of '0' was destroyed, base_gfn 0x0, npages 160
>> Mar 31 18:33:07 master kernel: [15922.586556] memslot dirty bitmap of '1' was destroyed, base_gfn 0xfffc0, npages 64
>> Mar 31 18:33:07 master kernel: [15922.586574] memslot dirty bitmap of '5' was destroyed, base_gfn 0xcd, npages 31
>> Mar 31 18:33:07 master kernel: [15922.586575] memslot dirty bitmap of '7' was destroyed, base_gfn 0xf0, npages 16
>> Mar 31 18:33:07 master kernel: [15922.586577] memslot dirty bitmap of '2' was destroyed, base_gfn 0xc0, npages 10
>> Mar 31 18:33:07 master kernel: [15922.586578] memslot dirty bitmap of '6' was destroyed, base_gfn 0xec, npages 4
>> Mar 31 18:33:07 master kernel: [15922.586579] memslot dirty bitmap of '4' was destroyed, base_gfn 0xca, npages 3
>>
>> PS:
>> QEMU:
>> static void migration_bitmap_sync(void)
>> {
>> trace_migration_bitmap_sync_end(migration_dirty_pages
>> - num_dirty_pages_init);
>> num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
>> end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> + syslog(LOG_INFO, "== migration_bitmap_sync ==");
>> + check_bitmap(();
>>
>> +void check_bitmap(void)
>> +{
>> + int i;
>> + char *host;
>> + ram_addr_t addr[3] = { 0xca000, 0xcb000, 0xcc000 };
>> + RAMBlock *block = NULL;
>> + int ret;
>> +
>> + block = QLIST_FIRST_RCU(&ram_list.blocks);
>> + for (i = 0; i < 3; i++) {
>> + unsigned long nr = block->mr->ram_addr + (addr[i] >> TARGET_PAGE_BITS);
>> + host = memory_region_get_ram_ptr(block->mr) + addr[i];
>> + ret = test_bit(nr, migration_bitmap);
>> + if (ret == 0) {
>> + syslog(LOG_INFO, "qemu: --- addr=0x%llx (hva=%p), who's bitmap not set ---\n",
>> + addr[i], host);
>> + } else {
>> + syslog(LOG_INFO, "qemu: +++ OK, addr=0x%llx (hva=%p) , who's bitap is set +++\n",
>> + addr[i], host);;
>> + }
>> + }
>> +}
>>
>> KVM???
>> static void mark_page_dirty_in_slot(struct kvm *kvm,
>> struct kvm_memory_slot *memslot,
>> gfn_t gfn)
>> {
>> + if ((gfn == 0xca || gfn == 0xcb || gfn == 0xcc) && !memslot->dirty_bitmap) {
>> + printk("oops, dirty_bitmap is null, gfn=0x%llx\n",gfn);
>> + }
>> if (memslot && memslot->dirty_bitmap) {
>> unsigned long rel_gfn = gfn - memslot->base_gfn;
>> + if (gfn == 0xca || gfn == 0xcb || gfn == 0xcc) {
>> + printk("+++ gfn=0x%llx is marked dirty, id=%d, base_gfn=0x%llx, npages=%d\n",
>> + gfn, memslot->id, memslot->base_gfn, memslot->npages);
>> + }
>> set_bit_le(rel_gfn, memslot->dirty_bitmap);
>> }
>> }
>>
>> static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>> {
>> if (!memslot->dirty_bitmap)
>> return;
>> + if (memslot->id != 9) {
>> + printk("memslot dirty bitmap of '%d' was destroyed, base_gfn 0x%llx, npages %lld\n",
>> + memslot->id, memslot->base_gfn, memslot->npages);
>> + }
>> kvm_kvfree(memslot->dirty_bitmap);
>> memslot->dirty_bitmap = NULL;
>> }
>>
>>
>>> filed against migration during windows boot, I'd not considered that it might
>>> be devices not updating the bitmap.
>>>
>>> Dave
>>>
>>>>
>>>>
>>>> Thanks,
>>>> zhanghailiang
>>>>
>>>>
>>>>>>>>
>>>>>>>>> What kind of load were you having when reproducing this issue?
>>>>>>>>> Just to confirm, you have been able to reproduce this without COLO
>>>>>>>>> patches, right?
>>>>>>>>>
>>>>>>>>>> (qemu) migrate tcp:192.168.3.8:3004
>>>>>>>>>> before saving ram complete
>>>>>>>>>> ff703f6889ab8701e4e040872d079a28
>>>>>>>>>> md_host : after saving ram complete
>>>>>>>>>> ff703f6889ab8701e4e040872d079a28
>>>>>>>>>>
>>>>>>>>>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -incoming tcp:0:3004
>>>>>>>>>> (qemu) QEMU_VM_SECTION_END, after loading ram
>>>>>>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>>>>>>> md_host : after loading all vmstate
>>>>>>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>>>>>>> md_host : after cpu_synchronize_all_post_init
>>>>>>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>>>>>>>
>>>>>>>>>> This happens occasionally, and it is more easy to reproduce when issue migration command during VM's startup time.
>>>>>>>>> OK, a couple of things. Memory don't have to be exactly identical.
>>>>>>>>> Virtio devices in particular do funny things on "post-load". There
>>>>>>>>> aren't warantees for that as far as I know, we should end with an
>>>>>>>>> equivalent device state in memory.
>>>>>>>>>
>>>>>>>>>> We have done further test and found that some pages has been dirtied but its corresponding migration_bitmap is not set.
>>>>>>>>>> We can't figure out which modules of QEMU has missed setting bitmap when dirty page of VM,
>>>>>>>>>> it is very difficult for us to trace all the actions of dirtying VM's pages.
>>>>>>>>> This seems to point to a bug in one of the devices.
>>>>>>>>>
>>>>>>>>>> Actually, the first time we found this problem was in the COLO FT development, and it triggered some strange issues in
>>>>>>>>>> VM which all pointed to the issue of inconsistent of VM's memory. (We have try to save all memory of VM to slave side every time
>>>>>>>>>> when do checkpoint in COLO FT, and everything will be OK.)
>>>>>>>>>>
>>>>>>>>>> Is it OK for some pages that not transferred to destination when do migration ? Or is it a bug?
>>>>>>>>> Pages transferred should be the same, after device state transmission is
>>>>>>>>> when things could change.
>>>>>>>>>
>>>>>>>>>> This issue has blocked our COLO development... :(
>>>>>>>>>>
>>>>>>>>>> Any help will be greatly appreciated!
>>>>>>>>> Later, Juan.
>>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>> .
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> .
>
next prev parent reply other threads:[~2015-04-02 11:53 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-25 9:31 [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration zhanghailiang
2015-03-25 9:46 ` Dr. David Alan Gilbert
2015-03-25 11:28 ` zhanghailiang
2015-03-25 11:36 ` Dr. David Alan Gilbert
2015-03-25 11:48 ` zhanghailiang
2015-03-25 9:50 ` Juan Quintela
2015-03-25 10:21 ` Wen Congyang
2015-03-25 13:12 ` Paolo Bonzini
2015-03-26 1:43 ` Wen Congyang
2015-03-25 11:32 ` zhanghailiang
2015-03-26 3:12 ` Wen Congyang
2015-03-26 3:52 ` Li Zhijian
2015-03-27 10:13 ` zhanghailiang
2015-03-27 10:18 ` Dr. David Alan Gilbert
2015-03-28 9:54 ` zhanghailiang
2015-03-30 7:59 ` Dr. David Alan Gilbert
2015-03-31 11:48 ` zhanghailiang
2015-03-31 19:06 ` Dr. David Alan Gilbert
2015-04-02 11:52 ` zhanghailiang [this message]
2015-04-02 13:00 ` Paolo Bonzini
2015-04-03 8:51 ` Jason Wang
2015-04-03 9:08 ` Wen Congyang
2015-04-03 9:20 ` zhanghailiang
2015-04-08 8:08 ` Jason Wang
2015-03-27 10:51 ` Juan Quintela
2015-03-28 1:08 ` zhanghailiang
2015-03-26 10:29 ` Juan Quintela
2015-03-26 11:57 ` Michael S. Tsirkin
2015-03-27 8:56 ` Stefan Hajnoczi
2015-03-27 9:14 ` Wen Congyang
2015-03-27 9:57 ` Stefan Hajnoczi
2015-03-27 10:05 ` Wen Congyang
2015-03-27 10:11 ` Stefan Hajnoczi
2015-03-27 10:36 ` Juan Quintela
2015-03-27 10:34 ` Juan Quintela
2015-03-31 7:54 ` Wen Congyang
2015-03-31 14:16 ` Stefan Hajnoczi
2015-04-02 9:14 ` Wen Congyang
2015-04-02 13:17 ` Paolo Bonzini
2015-04-03 1:29 ` Wen Congyang
2015-04-03 10:56 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=551D2D79.90905@huawei.com \
--to=zhang.zhanghailiang@huawei.com \
--cc=amit.shah@redhat.com \
--cc=arei.gonglei@huawei.com \
--cc=david@gibson.dropbear.id.au \
--cc=dgilbert@redhat.com \
--cc=hangaohuai@huawei.com \
--cc=lizhijian@cn.fujitsu.com \
--cc=peter.huangpeng@huawei.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).