From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59767) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wnob8-0005ei-RO for qemu-devel@nongnu.org; Fri, 23 May 2014 08:29:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Wnob1-0007Wq-8Z for qemu-devel@nongnu.org; Fri, 23 May 2014 08:29:18 -0400 Received: from mx1.redhat.com ([209.132.183.28]:17258) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wnob1-0007Wb-01 for qemu-devel@nongnu.org; Fri, 23 May 2014 08:29:11 -0400 Date: Fri, 23 May 2014 14:29:06 +0200 From: Kevin Wolf Message-ID: <20140523122906.GB5254@noname.redhat.com> References: <537E62CE.2050302@beyond.pl> <537E66A5.9010609@beyond.pl> <537F047C.9010800@redhat.com> <537F141C.5080102@beyond.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <537F141C.5080102@beyond.pl> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] qemu 2.0, deadlock in block-commit List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Marcin =?utf-8?Q?Gibu=C5=82a?= Cc: Paolo Bonzini , qemu-devel@nongnu.org Am 23.05.2014 um 11:25 hat Marcin Gibu=C5=82a geschrieben: > On 23.05.2014 10:19, Paolo Bonzini wrote: > >Il 22/05/2014 23:05, Marcin Gibu=C5=82a ha scritto: > >>Some more info. > >>VM was doing lot of write IO during this test. > > > >QEMU is waiting for librados to complete I/O. Can you reproduce it wi= th > >a different driver? >=20 > Hi, >=20 > I've reproduced it without RBD. Backtrace below: > [...] I see that you have a mix of aio=3Dnative and aio=3Dthreads. I can't say much about the aio=3Dnative disks (perhaps try to reproduce without them?), but there are definitely no worker threads for the other disks that bdrv_drain_all() would have to wait for. bdrv_requests_pending(), called by bdrv_requests_pending_all(), is the function that determines for each of the disks in your VM if it still has requests in flight that need to be completed. This function must have returned true even though there is nothing to wait for. Can you check which of its conditions led to this behaviour, and for which disk it did? Either by setting a breakpoint there and singlestepping through the function the next time it is called (if the poll even has a timeout), or by inspecting the conditions manually in gdb. Kevin