From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37366) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YbfFn-0005Wj-Fh for qemu-devel@nongnu.org; Fri, 27 Mar 2015 21:09:36 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YbfFk-00006I-8J for qemu-devel@nongnu.org; Fri, 27 Mar 2015 21:09:35 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:50719) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YbfFj-0008VE-B3 for qemu-devel@nongnu.org; Fri, 27 Mar 2015 21:09:32 -0400 Message-ID: <5515FF22.8040608@huawei.com> Date: Sat, 28 Mar 2015 09:08:50 +0800 From: zhanghailiang MIME-Version: 1.0 References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno> <5513793B.6020909@cn.fujitsu.com> <5513826D.2010505@cn.fujitsu.com> <55152D5B.1090906@huawei.com> <87k2y2vhlf.fsf@neno.neno> In-Reply-To: <87k2y2vhlf.fsf@neno.neno> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: quintela@redhat.com Cc: hangaohuai@huawei.com, Li Zhijian , peter.huangpeng@huawei.com, qemu-devel@nongnu.org, "Gonglei (Arei)" , Amit Shah , "Dr. David Alan Gilbert (git)" , david@gibson.dropbear.id.au On 2015/3/27 18:51, Juan Quintela wrote: > zhanghailiang wrote: >> On 2015/3/26 11:52, Li Zhijian wrote: >>> On 03/26/2015 11:12 AM, Wen Congyang wrote: >>>> On 03/25/2015 05:50 PM, Juan Quintela wrote: >>>>> zhanghailiang wrote: >>>>>> Hi all, >>>>>> >>>>>> We found that, sometimes, the content of VM's memory is >>>>>> inconsistent between Source side and Destination side >>>>>> when we check it just after finishing migration but before VM continue to Run. >>>>>> >>>>>> We use a patch like bellow to find this issue, you can find it from affix, >>>>>> and Steps to reprduce: >>>>>> >>>>>> (1) Compile QEMU: >>>>>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make >>>>>> >>>>>> (2) Command and output: >>>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu >>>>>> qemu64,-kvmclock -netdev tap,id=hn0-device >>>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive >>>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe >>>>>> -device >>>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 >>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet >>>>>> -monitor stdio >>>>> Could you try to reproduce: >>>>> - without vhost >>>>> - without virtio-net >>>>> - cache=unsafe is going to give you trouble, but trouble should only >>>>> happen after migration of pages have finished. >>>> If I use ide disk, it doesn't happen. >>>> Even if I use virtio-net with vhost=on, it still doesn't happen. I guess >>>> it is because I migrate the guest when it is booting. The virtio net >>>> device is not used in this case. >>> Er~~ >>> it reproduces in my ide disk >>> there is no any virtio device, my command line like below >>> >>> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net none >>> -boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -machine >>> usb=off -no-user-config -nodefaults -monitor stdio -vga std >>> >>> it seems easily to reproduce this issue by following steps in _ubuntu_ guest >>> 1. in source side, choose memtest in grub >>> 2. do live migration >>> 3. exit memtest(type Esc in when memory testing) >>> 4. wait migration complete >>> >> >> Yes,it is a thorny problem. It is indeed easy to reproduce, just as >> your steps in the above. > > Thanks for the test case. I will try to give a try on Monday. Now that > we have a test case, it should be able to instrument things. As the > problem is on memtest, it can't be the disk, clearly :p OK, thanks. > >> >> This is my test result: (I also test accel=tcg, it can be reproduced also.) >> Source side: >> # x86_64-softmmu/qemu-system-x86_64 -machine >> pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu >> qemu64,-kvmclock -boot c -drive >> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device >> cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio >> (qemu) ACPI_BUILD: init ACPI tables >> ACPI_BUILD: init ACPI tables >> migrate tcp:9.61.1.8:3004 >> ACPI_BUILD: init ACPI tables >> before cpu_synchronize_all_states >> 5a8f72d66732cac80d6a0d5713654c0e >> md_host : before saving ram complete >> 5a8f72d66732cac80d6a0d5713654c0e >> md_host : after saving ram complete >> 5a8f72d66732cac80d6a0d5713654c0e >> (qemu) >> >> Destination side: >> # x86_64-softmmu/qemu-system-x86_64 -machine >> pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu >> qemu64,-kvmclock -boot c -drive >> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device >> cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio >> -incoming tcp:0:3004 >> (qemu) QEMU_VM_SECTION_END, after loading ram >> d7cb0d8a4bdd1557fb0e78baee50c986 >> md_host : after loading all vmstate >> d7cb0d8a4bdd1557fb0e78baee50c986 >> md_host : after cpu_synchronize_all_post_init >> d7cb0d8a4bdd1557fb0e78baee50c986 >> >> >> Thanks, >> zhang >> >>>> >>>>> What kind of load were you having when reproducing this issue? >>>>> Just to confirm, you have been able to reproduce this without COLO >>>>> patches, right? >>>>> >>>>>> (qemu) migrate tcp:192.168.3.8:3004 >>>>>> before saving ram complete >>>>>> ff703f6889ab8701e4e040872d079a28 >>>>>> md_host : after saving ram complete >>>>>> ff703f6889ab8701e4e040872d079a28 >>>>>> >>>>>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu >>>>>> qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device >>>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive >>>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe >>>>>> -device >>>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 >>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet >>>>>> -monitor stdio -incoming tcp:0:3004 >>>>>> (qemu) QEMU_VM_SECTION_END, after loading ram >>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>>> md_host : after loading all vmstate >>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>>> md_host : after cpu_synchronize_all_post_init >>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>>> >>>>>> This happens occasionally, and it is more easy to reproduce when >>>>>> issue migration command during VM's startup time. >>>>> OK, a couple of things. Memory don't have to be exactly identical. >>>>> Virtio devices in particular do funny things on "post-load". There >>>>> aren't warantees for that as far as I know, we should end with an >>>>> equivalent device state in memory. >>>>> >>>>>> We have done further test and found that some pages has been >>>>>> dirtied but its corresponding migration_bitmap is not set. >>>>>> We can't figure out which modules of QEMU has missed setting >>>>>> bitmap when dirty page of VM, >>>>>> it is very difficult for us to trace all the actions of dirtying VM's pages. >>>>> This seems to point to a bug in one of the devices. >>>>> >>>>>> Actually, the first time we found this problem was in the COLO FT >>>>>> development, and it triggered some strange issues in >>>>>> VM which all pointed to the issue of inconsistent of VM's >>>>>> memory. (We have try to save all memory of VM to slave side every >>>>>> time >>>>>> when do checkpoint in COLO FT, and everything will be OK.) >>>>>> >>>>>> Is it OK for some pages that not transferred to destination when >>>>>> do migration ? Or is it a bug? >>>>> Pages transferred should be the same, after device state transmission is >>>>> when things could change. >>>>> >>>>>> This issue has blocked our COLO development... :( >>>>>> >>>>>> Any help will be greatly appreciated! >>>>> Later, Juan. >>>>> >>>> . >>>> >>> >>> > > . >