From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45791) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yb6QA-0005fM-8Y for qemu-devel@nongnu.org; Thu, 26 Mar 2015 07:57:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yb6Q6-0003s1-Ru for qemu-devel@nongnu.org; Thu, 26 Mar 2015 07:57:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50212) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yb6Q6-0003rt-K7 for qemu-devel@nongnu.org; Thu, 26 Mar 2015 07:57:54 -0400 Date: Thu, 26 Mar 2015 12:57:38 +0100 From: "Michael S. Tsirkin" Message-ID: <20150326125708-mutt-send-email-mst@redhat.com> References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno> <5513793B.6020909@cn.fujitsu.com> <874mp8xd9k.fsf@neno.neno> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <874mp8xd9k.fsf@neno.neno> Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Juan Quintela Cc: Kevin Wolf , hangaohuai@huawei.com, zhanghailiang , Li Zhijian , qemu-devel@nongnu.org, "Dr. David Alan Gilbert (git)" , "Gonglei (Arei)" , Stefan Hajnoczi , Amit Shah , peter.huangpeng@huawei.com, david@gibson.dropbear.id.au On Thu, Mar 26, 2015 at 11:29:43AM +0100, Juan Quintela wrote: > Wen Congyang wrote: > > On 03/25/2015 05:50 PM, Juan Quintela wrote: > >> zhanghailiang wrote: > >>> Hi all, > >>> > >>> We found that, sometimes, the content of VM's memory is > >>> inconsistent between Source side and Destination side > >>> when we check it just after finishing migration but before VM continue to Run. > >>> > >>> We use a patch like bellow to find this issue, you can find it from affix, > >>> and Steps to reprduce: > >>> > >>> (1) Compile QEMU: > >>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make > >>> > >>> (2) Command and output: > >>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu > >>> qemu64,-kvmclock -netdev tap,id=hn0-device > >>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive > >>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe > >>> -device > >>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 > >>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet > >>> -monitor stdio > >> > >> Could you try to reproduce: > >> - without vhost > >> - without virtio-net > >> - cache=unsafe is going to give you trouble, but trouble should only > >> happen after migration of pages have finished. > > > > If I use ide disk, it doesn't happen. > > Even if I use virtio-net with vhost=on, it still doesn't happen. I guess > > it is because I migrate the guest when it is booting. The virtio net > > device is not used in this case. > > Kevin, Stefan, Michael, any great idea? > > Thanks, Juan. If this is during boot from disk, we can more or less rule out virtio-net/vhost-net. > > > > Thanks > > Wen Congyang > > > >> > >> What kind of load were you having when reproducing this issue? > >> Just to confirm, you have been able to reproduce this without COLO > >> patches, right? > >> > >>> (qemu) migrate tcp:192.168.3.8:3004 > >>> before saving ram complete > >>> ff703f6889ab8701e4e040872d079a28 > >>> md_host : after saving ram complete > >>> ff703f6889ab8701e4e040872d079a28 > >>> > >>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu > >>> qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device > >>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive > >>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe > >>> -device > >>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 > >>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet > >>> -monitor stdio -incoming tcp:0:3004 > >>> (qemu) QEMU_VM_SECTION_END, after loading ram > >>> 230e1e68ece9cd4e769630e1bcb5ddfb > >>> md_host : after loading all vmstate > >>> 230e1e68ece9cd4e769630e1bcb5ddfb > >>> md_host : after cpu_synchronize_all_post_init > >>> 230e1e68ece9cd4e769630e1bcb5ddfb > >>> > >>> This happens occasionally, and it is more easy to reproduce when > >>> issue migration command during VM's startup time. > >> > >> OK, a couple of things. Memory don't have to be exactly identical. > >> Virtio devices in particular do funny things on "post-load". There > >> aren't warantees for that as far as I know, we should end with an > >> equivalent device state in memory. > >> > >>> We have done further test and found that some pages has been > >>> dirtied but its corresponding migration_bitmap is not set. > >>> We can't figure out which modules of QEMU has missed setting bitmap > >>> when dirty page of VM, > >>> it is very difficult for us to trace all the actions of dirtying VM's pages. > >> > >> This seems to point to a bug in one of the devices. > >> > >>> Actually, the first time we found this problem was in the COLO FT > >>> development, and it triggered some strange issues in > >>> VM which all pointed to the issue of inconsistent of VM's > >>> memory. (We have try to save all memory of VM to slave side every > >>> time > >>> when do checkpoint in COLO FT, and everything will be OK.) > >>> > >>> Is it OK for some pages that not transferred to destination when do > >>> migration ? Or is it a bug? > >> > >> Pages transferred should be the same, after device state transmission is > >> when things could change. > >> > >>> This issue has blocked our COLO development... :( > >>> > >>> Any help will be greatly appreciated! > >> > >> Later, Juan. > >>