From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34395) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aRDtb-00072l-Cn for qemu-devel@nongnu.org; Thu, 04 Feb 2016 02:00:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aRDtY-0000yj-4B for qemu-devel@nongnu.org; Thu, 04 Feb 2016 02:00:03 -0500 Received: from mx2.suse.de ([195.135.220.15]:49113) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aRDtX-0000yZ-Pi for qemu-devel@nongnu.org; Thu, 04 Feb 2016 02:00:00 -0500 References: <56B2754B.7030809@redhat.com> <56B28B1C.7060202@redhat.com> From: Hannes Reinecke Message-ID: <56B2F6EE.6030205@suse.de> Date: Thu, 4 Feb 2016 07:59:58 +0100 MIME-Version: 1.0 In-Reply-To: <56B28B1C.7060202@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] sda abort with virtio-scsi List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , Jim Minter , qemu-devel On 02/04/2016 12:19 AM, Paolo Bonzini wrote: >=20 >=20 > On 03/02/2016 22:46, Jim Minter wrote: >> I am hitting the following VM lockup issue running a VM with latest >> RHEL7 kernel on a host also running latest RHEL7 kernel. FWIW I'm usi= ng >> virtio-scsi because I want to use discard=3Dunmap. I ran the VM as fo= llows: >> >> /usr/libexec/qemu-kvm -nodefaults \ >> -cpu host \ >> -smp 4 \ >> -m 8192 \ >> -drive discard=3Dunmap,file=3Dvm.qcow2,id=3Ddisk1,if=3Dnone,cache=3D= unsafe \ >> -device virtio-scsi-pci \ >> -device scsi-disk,drive=3Ddisk1 \ >> -netdev bridge,id=3Dnet0,br=3Dbr0 \ >> -device virtio-net-pci,netdev=3Dnet0,mac=3D$(utils/random-mac.py) \ >> -chardev socket,id=3Dchan0,path=3D/tmp/rhev.sock,server,nowait \ >> -chardev socket,id=3Dchan1,path=3D/tmp/qemu.sock,server,nowait \ >> -monitor unix:tmp/vm.sock,server,nowait \ >> -device virtio-serial-pci \ >> -device virtserialport,chardev=3Dchan0,name=3Dcom.redhat.rhevm.vdsm = \ >> -device virtserialport,chardev=3Dchan1,name=3Dorg.qemu.guest_agent.0= \ >> -device cirrus-vga \ >> -vnc none \ >> -usbdevice tablet >> >> The host was busyish at the time, but not excessively (IMO). Nothing >> untoward in the host's kernel log; host storage subsystem is fine. I >> didn't get any qemu logs this time around, but I will when the issue >> next recurs. The VM's full kernel log is attached; here are the >> highlights: >=20 > Hannes, were you going to send a patch to disable time outs? >=20 Rah. Didn't I do it already? Seems like I didn't; will be doing so shortly. >> >> INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 2, t=3D= 60002 jiffies, g=3D5253, c=3D5252, q=3D0) >> sending NMI to all CPUs: >> NMI backtrace for cpu 1 >> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-327.4.5.el7.x86_64 #1 >> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >> task: ffff88023417d080 ti: ffff8802341a4000 task.ti: ffff8802341a4000 >> RIP: 0010:[] [] native_safe_halt+= 0x6/0x10 >> RSP: 0018:ffff8802341a7e98 EFLAGS: 00000286 >> RAX: 00000000ffffffed RBX: ffff8802341a4000 RCX: 0100000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046 >> RBP: ffff8802341a7e98 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 >> R13: ffff8802341a4000 R14: ffff8802341a4000 R15: 0000000000000000 >> FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000= 000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f4978587008 CR3: 000000003645e000 CR4: 00000000003407e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Stack: >> ffff8802341a7eb8 ffffffff8101dbcf ffff8802341a4000 ffffffff81a68260 >> ffff8802341a7ec8 ffffffff8101e4d6 ffff8802341a7f20 ffffffff810d62e5 >> ffff8802341a7fd8 ffff8802341a4000 2581685d70de192c 7ba58fdb3a3bc8d4 >> Call Trace: >> [] default_idle+0x1f/0xc0 >> [] arch_cpu_idle+0x26/0x30 >> [] cpu_startup_entry+0x245/0x290 >> [] start_secondary+0x1ba/0x230 >> Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 5= 5 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 = 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84 >> NMI backtrace for cpu 0=20 >=20 > This is the NMI watchdog firing; the CPU got stuck for 20 seconds. The > issue was not a busy host, but a busy storage (could it be a network > partition if the disk was hosted on NFS???) >=20 > Firing the NMI watchdog is fixed in more recent QEMU, which has > asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3 > (try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm)= . >=20 Actually, you still cannot do _real_ async cancellation of I/O; the linux aio subsystem implements io_cancel(), but the cancellation just aborts the (internal) waitqueue element, not the I/O itself. Cheers, Hannes --=20 Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg GF: F. Imend=C3=B6rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N=C3=BCrnberg)