From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46939) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YayBB-0000Pb-SY for qemu-devel@nongnu.org; Wed, 25 Mar 2015 23:09:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YayB8-0002w0-Lp for qemu-devel@nongnu.org; Wed, 25 Mar 2015 23:09:57 -0400 Received: from [59.151.112.132] (port=49401 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YayB8-0002rC-0m for qemu-devel@nongnu.org; Wed, 25 Mar 2015 23:09:54 -0400 Message-ID: <5513793B.6020909@cn.fujitsu.com> Date: Thu, 26 Mar 2015 11:12:59 +0800 From: Wen Congyang MIME-Version: 1.0 References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno> In-Reply-To: <87a8z12yot.fsf@neno.neno> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: quintela@redhat.com, zhanghailiang Cc: hangaohuai@huawei.com, Li Zhijian , qemu-devel@nongnu.org, peter.huangpeng@huawei.com, "Gonglei (Arei)" , Amit Shah , "Dr. David Alan Gilbert (git)" , david@gibson.dropbear.id.au On 03/25/2015 05:50 PM, Juan Quintela wrote: > zhanghailiang wrote: >> Hi all, >> >> We found that, sometimes, the content of VM's memory is inconsistent between Source side and Destination side >> when we check it just after finishing migration but before VM continue to Run. >> >> We use a patch like bellow to find this issue, you can find it from affix, >> and Steps to reprduce: >> >> (1) Compile QEMU: >> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make >> >> (2) Command and output: >> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -netdev tap,id=hn0-device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio > > Could you try to reproduce: > - without vhost > - without virtio-net > - cache=unsafe is going to give you trouble, but trouble should only > happen after migration of pages have finished. If I use ide disk, it doesn't happen. Even if I use virtio-net with vhost=on, it still doesn't happen. I guess it is because I migrate the guest when it is booting. The virtio net device is not used in this case. Thanks Wen Congyang > > What kind of load were you having when reproducing this issue? > Just to confirm, you have been able to reproduce this without COLO > patches, right? > >> (qemu) migrate tcp:192.168.3.8:3004 >> before saving ram complete >> ff703f6889ab8701e4e040872d079a28 >> md_host : after saving ram complete >> ff703f6889ab8701e4e040872d079a28 >> >> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -incoming tcp:0:3004 >> (qemu) QEMU_VM_SECTION_END, after loading ram >> 230e1e68ece9cd4e769630e1bcb5ddfb >> md_host : after loading all vmstate >> 230e1e68ece9cd4e769630e1bcb5ddfb >> md_host : after cpu_synchronize_all_post_init >> 230e1e68ece9cd4e769630e1bcb5ddfb >> >> This happens occasionally, and it is more easy to reproduce when issue migration command during VM's startup time. > > OK, a couple of things. Memory don't have to be exactly identical. > Virtio devices in particular do funny things on "post-load". There > aren't warantees for that as far as I know, we should end with an > equivalent device state in memory. > >> We have done further test and found that some pages has been dirtied but its corresponding migration_bitmap is not set. >> We can't figure out which modules of QEMU has missed setting bitmap when dirty page of VM, >> it is very difficult for us to trace all the actions of dirtying VM's pages. > > This seems to point to a bug in one of the devices. > >> Actually, the first time we found this problem was in the COLO FT development, and it triggered some strange issues in >> VM which all pointed to the issue of inconsistent of VM's memory. (We have try to save all memory of VM to slave side every time >> when do checkpoint in COLO FT, and everything will be OK.) >> >> Is it OK for some pages that not transferred to destination when do migration ? Or is it a bug? > > Pages transferred should be the same, after device state transmission is > when things could change. > >> This issue has blocked our COLO development... :( >> >> Any help will be greatly appreciated! > > Later, Juan. >