From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42212) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YgDqX-0002mI-LS for qemu-devel@nongnu.org; Thu, 09 Apr 2015 10:54:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YgDqT-0004Zd-9x for qemu-devel@nongnu.org; Thu, 09 Apr 2015 10:54:21 -0400 Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:59128 helo=mx01.kamp.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YgDqS-0004ZC-Ur for qemu-devel@nongnu.org; Thu, 09 Apr 2015 10:54:17 -0400 Message-ID: <55269291.2000805@kamp.de> Date: Thu, 09 Apr 2015 16:54:09 +0200 From: Peter Lieven MIME-Version: 1.0 References: <5522D4BD.7080805@kamp.de> <5522D57B.3000203@redhat.com> <5522D85D.20907@kamp.de> <5522DA21.1010702@kamp.de> <20150407084303.GA2298@work-vm> <5523F38A.4090305@kamp.de> <20150407152957.GB2287@work-vm> <55242597.1000700@kamp.de> <20150407190112.GD2287@work-vm> <55267553.5000506@kamp.de> <20150409134339.GE2292@work-vm> In-Reply-To: <20150409134339.GE2292@work-vm> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: Paolo Bonzini , John Snow , "qemu-devel@nongnu.org" , qemu-block@nongnu.org Am 09.04.2015 um 15:43 schrieb Dr. David Alan Gilbert: > * Peter Lieven (pl@kamp.de) wrote: >> Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert: >>> * Peter Lieven (pl@kamp.de) wrote: >>>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert: >>>>> * Peter Lieven (pl@kamp.de) wrote: >>>>>> Hi David, >>>>>> >>>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert: >>>>>>>>>> Any particular workload or reproducer? >>>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it. >>>>>>>>> >>>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as >>>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16... >>>>>>>>> >>>>>>>>> Exact cmdline is: >>>>>>>>> /usr/bin/qemu-2.2.1 -enable-kvm -M pc-1.2 -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2 -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native -serial null -parallel null -m 1024 -smp 2,sockets=1,cores=2,threads=1 -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on -k de -incoming tcp:0:5003 -pidfile /var/run/qemu/vm-146.pid -mem-path /hugepages -mem-prealloc -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus -cpu qemu64,-kvmclock >>>>>>>>> >>>>>>>>> Exact kernel is: >>>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.) >>>>>>>>> >>>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Peter >>>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now). >>>>>>>> >>>>>>>> 2.6.8-24-smp is also affected. >>>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ? >>>>>> Its more often than 1/10 I would say. >>>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find. >>>>> In your setup, how easy would it be for you to try : >>>>> with either 2.1 or current head? >>>>> with a newer machine-type? >>>>> without the cdrom? >>>> Its all possible. I can clone the system and try everything on my test systems. I hope >>>> it reproduces there. >>> Great. I think the order I would go would be: >>> Try head - if it works we know we've already got the fix somewhere >>> Try 2.1 - if it works we know it's something we introduced between >>> 2.1 and 2.2.1 >>> Try a newer machine type - because pc-1.2 probably isn't tested much >>> CDROM at the end. >> Update: >> - head -> not working >> - 2.1.3 -> not working >> - without CROM -> not working >> - with head and no machine type specified -> not working >> - with -device isa-ide -> BIOS not booting harddisk > Well, at least it's consistent.... > >> Will now try 1.3.1 just to be sure. >> >> Any ideas how to debug the IDE state after migration and/or check if the issue is similar to the ATAPI IDE >> problem? > It's unlikely to be quite the same - most of the ATAPI problems were related to ATAPI > being quite separate and not saving much state. > > The way I found the CDROM problems was to turn on most of the debugging in the ide and bmdma code > and on a failed migrate try and see what the state of any IO was at the point it migrated. Thats tough. I enalbed DEBUG_IDE and DEBUG_AIO at first. But I have never debugged IDE before so I first have to understand how that works.... What debugging confirms is that the IDE interface ideed stalls completely. One thing I found curious in pci.c: #define BM_MIGRATION_COMPAT_STATUS_BITS \ (IDE_RETRY_DMA | IDE_RETRY_PIO | \ IDE_RETRY_READ | IDE_RETRY_FLUSH) Why is there no IDE_RETRY_WRITE ? Honestly, I have not yet understood that that BM_MIGRATION_COMPAT_STATUS_BITS is for. > > One other thing to check; I found the newer kernel code recovers better after > IDE problems; so on a newer guest kernel are there any log warnings about IDE problems, > even if the guests are otherwise apparently happy? I will check for that. Thanks, Peter