From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Lieven Subject: Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1 Date: Thu, 28 Jun 2012 12:13:20 +0200 Message-ID: <4FEC2E40.5090400@dlhnet.de> References: <4FEB2945.1030607@dlhnet.de> <4FEB3AC6.6010206@web.de> <4FEC1FC9.7050103@dlhnet.de> <4FEC2210.1030005@siemens.com> <4FEC2475.4030202@dlhnet.de> <4FEC263C.5030504@siemens.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "qemu-devel@nongnu.org" , "kvm@vger.kernel.org" , Gleb Natapov To: Jan Kiszka Return-path: Received: from ssl.dlhnet.de ([91.198.192.8]:54135 "EHLO ssl.dlh.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756600Ab2F1KNW (ORCPT ); Thu, 28 Jun 2012 06:13:22 -0400 In-Reply-To: <4FEC263C.5030504@siemens.com> Sender: kvm-owner@vger.kernel.org List-ID: On 28.06.2012 11:39, Jan Kiszka wrote: > On 2012-06-28 11:31, Peter Lieven wrote: >> On 28.06.2012 11:21, Jan Kiszka wrote: >>> On 2012-06-28 11:11, Peter Lieven wrote: >>>> On 27.06.2012 18:54, Jan Kiszka wrote: >>>>> On 2012-06-27 17:39, Peter Lieven wrote: >>>>>> Hi all, >>>>>> >>>>>> i debugged this further and found out that kvm-kmod-3.0 is working with >>>>>> qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is >>>>>> working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0). >>>>>> Has anyone a clue which new KVM feature could cause this if a vcpu is in >>>>>> an infinite loop? >>>>> Before accusing kvm-kmod ;), can you check if the effect is visible with >>>>> an original Linux 3.3.x or 3.4.x kernel as well? >>>> sorry, i should have been more specific. maybe I also misunderstood sth. >>>> I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel >>>> 3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if >>>> I use >>>> a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't. >>>> however, maybe we don't have to dig to deep - see below. >>> kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4 >>> working on an older kernel. This step may introduce bugs of its own. >>> Therefore my suggestion to use a "real" 3.x kernel to exclude that risk >>> first of all. >>> >>>>> Then, bisection the change in qemu-kvm that apparently resolved the >>>>> issue would be interesting. >>>>> >>>>> If we have to dig deeper, tracing [1] the lockup would likely be helpful >>>>> (all events of the qemu process, not just KVM related ones: trace-cmd >>>>> record -e all qemu-system-x86_64 ...). >>>> that here is bascially whats going on: >>>> >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio read >>>> len 3 gpa 0xa0000 val 0x10ff >>>> qemu-kvm-1.0-2506 [010] 60996.908000: vcpu_match_mmio: gva >>>> 0xa0000 gpa 0xa0000 Read GPA >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> unsatisfied-read len 1 gpa 0xa0000 val 0x0 >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_userspace_exit: reason >>>> KVM_EXIT_MMIO (6) >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> read len 3 gpa 0xa0000 val 0x10ff >>>> qemu-kvm-1.0-2506 [010] 60996.908000: vcpu_match_mmio: gva >>>> 0xa0000 gpa 0xa0000 Read GPA >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> unsatisfied-read len 1 gpa 0xa0000 val 0x0 >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_userspace_exit: reason >>>> KVM_EXIT_MMIO (6) >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> read len 3 gpa 0xa0000 val 0x10ff >>>> qemu-kvm-1.0-2506 [010] 60996.908000: vcpu_match_mmio: gva >>>> 0xa0000 gpa 0xa0000 Read GPA >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> unsatisfied-read len 1 gpa 0xa0000 val 0x0 >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_userspace_exit: reason >>>> KVM_EXIT_MMIO (6) >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> read len 3 gpa 0xa0000 val 0x10ff >>>> qemu-kvm-1.0-2506 [010] 60996.908000: vcpu_match_mmio: gva >>>> 0xa0000 gpa 0xa0000 Read GPA >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_mmio: mmio >>>> unsatisfied-read len 1 gpa 0xa0000 val 0x0 >>>> qemu-kvm-1.0-2506 [010] 60996.908000: kvm_userspace_exit: reason >>>> KVM_EXIT_MMIO (6) >>>> >>>> its doing that forever. this is tracing the kvm module. doing the >>>> qemu-system-x86_64 trace is a bit compilcated, but >>>> maybe this is already sufficient. otherwise i will of course gather this >>>> info as well. >>> That's only tracing KVM event, and it's tracing when things went wrong >>> already. We may need a full trace (-e all) specifically for the period >>> when this pattern above started. >> i will do that. maybe i should explain that the vcpu is executing >> garbage when this above starts. its basically booting from an empty >> harddisk. >> >> if i understand correctly qemu-kvm loops in kvm_cpu_exec(CPUState *env); >> >> maybe the time to handle the monitor/qmp connection is just to short. >> if i understand furhter correctly, it can only handle monitor connections >> while qemu-kvm is executing kvm_vcpu_ioctl(env, KVM_RUN, 0); or am i >> wrong here? the time spend in this state might be rather short. > Unless you played with priorities and affinities, the Linux scheduler > should provide the required time to the iothread. I have a 1.1GB (85MB compressed) trace-file. If you have time to look at it I could drop it somewhere. We currently run all VMs with nice 1 because we observed that this improves that controlability of the Node in case all VMs have execessive CPU load. Running the VM unniced does not change the behaviour unfortunately. Peter >> my concern is not that the machine hangs, just the the hypervisor is >> unresponsive >> and its impossible to reset or quit gracefully. the only way to get the >> hypervisor >> ended is via SIGKILL. > Right. Even if the guest runs wild, you must be able to control the vm > via the monitor etc. If not, that's a bug. > > Jan >