From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45297)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1aRI46-0000fc-PX
	for qemu-devel@nongnu.org; Thu, 04 Feb 2016 06:27:11 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1aRI41-0004B6-OG
	for qemu-devel@nongnu.org; Thu, 04 Feb 2016 06:27:10 -0500
Received: from mx1.redhat.com ([209.132.183.28]:40455)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1aRI41-0004B2-GB
	for qemu-devel@nongnu.org; Thu, 04 Feb 2016 06:27:05 -0500
References: <56B2754B.7030809@redhat.com> <56B28B1C.7060202@redhat.com>
	<56B2F6EE.6030205@suse.de>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <56B33584.6070405@redhat.com>
Date: Thu, 4 Feb 2016 12:27:00 +0100
MIME-Version: 1.0
In-Reply-To: <56B2F6EE.6030205@suse.de>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] sda abort with virtio-scsi
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Hannes Reinecke <hare@suse.de>, Jim Minter <jminter@redhat.com>, qemu-devel <qemu-devel@nongnu.org>



On 04/02/2016 07:59, Hannes Reinecke wrote:
> On 02/04/2016 12:19 AM, Paolo Bonzini wrote:
>>
>>
>> On 03/02/2016 22:46, Jim Minter wrote:
>>> I am hitting the following VM lockup issue running a VM with latest
>>> RHEL7 kernel on a host also running latest RHEL7 kernel.  FWIW I'm us=
ing
>>> virtio-scsi because I want to use discard=3Dunmap.  I ran the VM as f=
ollows:
>>>
>>> /usr/libexec/qemu-kvm -nodefaults \
>>>   -cpu host \
>>>   -smp 4 \
>>>   -m 8192 \
>>>   -drive discard=3Dunmap,file=3Dvm.qcow2,id=3Ddisk1,if=3Dnone,cache=3D=
unsafe \
>>>   -device virtio-scsi-pci \
>>>   -device scsi-disk,drive=3Ddisk1 \
>>>   -netdev bridge,id=3Dnet0,br=3Dbr0 \
>>>   -device virtio-net-pci,netdev=3Dnet0,mac=3D$(utils/random-mac.py) \
>>>   -chardev socket,id=3Dchan0,path=3D/tmp/rhev.sock,server,nowait \
>>>   -chardev socket,id=3Dchan1,path=3D/tmp/qemu.sock,server,nowait \
>>>   -monitor unix:tmp/vm.sock,server,nowait \
>>>   -device virtio-serial-pci \
>>>   -device virtserialport,chardev=3Dchan0,name=3Dcom.redhat.rhevm.vdsm=
 \
>>>   -device virtserialport,chardev=3Dchan1,name=3Dorg.qemu.guest_agent.=
0 \
>>>   -device cirrus-vga \
>>>   -vnc none \
>>>   -usbdevice tablet
>>>
>>> The host was busyish at the time, but not excessively (IMO).  Nothing
>>> untoward in the host's kernel log; host storage subsystem is fine.  I
>>> didn't get any qemu logs this time around, but I will when the issue
>>> next recurs.  The VM's full kernel log is attached; here are the
>>> highlights:
>>
>> Hannes, were you going to send a patch to disable time outs?
>>
> Rah. Didn't I do it already?
> Seems like I didn't; will be doing so shortly.
>=20
>>>
>>> INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 2, t=
=3D60002 jiffies, g=3D5253, c=3D5252, q=3D0)
>>> sending NMI to all CPUs:
>>> NMI backtrace for cpu 1
>>> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-327.4.5.el7.x86_64 #=
1
>>> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>>> task: ffff88023417d080 ti: ffff8802341a4000 task.ti: ffff8802341a4000
>>> RIP: 0010:[<ffffffff81058e96>]  [<ffffffff81058e96>] native_safe_halt=
+0x6/0x10
>>> RSP: 0018:ffff8802341a7e98  EFLAGS: 00000286
>>> RAX: 00000000ffffffed RBX: ffff8802341a4000 RCX: 0100000000000000
>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
>>> RBP: ffff8802341a7e98 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
>>> R13: ffff8802341a4000 R14: ffff8802341a4000 R15: 0000000000000000
>>> FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:000000000=
0000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007f4978587008 CR3: 000000003645e000 CR4: 00000000003407e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Stack:
>>>  ffff8802341a7eb8 ffffffff8101dbcf ffff8802341a4000 ffffffff81a68260
>>>  ffff8802341a7ec8 ffffffff8101e4d6 ffff8802341a7f20 ffffffff810d62e5
>>>  ffff8802341a7fd8 ffff8802341a4000 2581685d70de192c 7ba58fdb3a3bc8d4
>>> Call Trace:
>>>  [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
>>>  [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
>>>  [<ffffffff810d62e5>] cpu_startup_entry+0x245/0x290
>>>  [<ffffffff810475fa>] start_secondary+0x1ba/0x230
>>> Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 =
55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3=
 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
>>> NMI backtrace for cpu 0=20
>>
>> This is the NMI watchdog firing; the CPU got stuck for 20 seconds.  Th=
e
>> issue was not a busy host, but a busy storage (could it be a network
>> partition if the disk was hosted on NFS???)
>>
>> Firing the NMI watchdog is fixed in more recent QEMU, which has
>> asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3
>> (try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm=
).
>>
> Actually, you still cannot do _real_ async cancellation of I/O; the
> linux aio subsystem implements io_cancel(), but the cancellation
> just aborts the (internal) waitqueue element, not the I/O itself.

Right, but at least the TMF is asynchronous.  Synchronous TMFs keep the
VCPUs in QEMU for many seconds and cause the watchdog to fire.

Paolo