From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:39755) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7itb-0003jc-Ia for qemu-devel@nongnu.org; Fri, 09 Aug 2013 05:22:12 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V7itX-0007h3-3C for qemu-devel@nongnu.org; Fri, 09 Aug 2013 05:22:07 -0400 Received: from mail-2.de-punkt.de ([2a00:12c0:1:64::5dbe:40ee]:51847) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7itW-0007gw-Na for qemu-devel@nongnu.org; Fri, 09 Aug 2013 05:22:03 -0400 Message-ID: <5204B4B8.3080302@filoo.de> Date: Fri, 09 Aug 2013 11:22:00 +0200 From: Oliver Francke MIME-Version: 1.0 References: <51FB887F.5070908@filoo.de> <51FC2903.3030802@cloudapt.com> <5739DFCB-21A5-4AED-82BF-6B58D3E1502A@filoo.de> <20130805074835.GA12658@stefanha-thinkpad.muc.redhat.com> <520391D1.7070704@filoo.de> <5203CEE4.7040901@inktank.com> In-Reply-To: <5203CEE4.7040901@inktank.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686] List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Josh Durgin Cc: ceph-users@lists.ceph.com, Mike Dawson , Stefan Hajnoczi , "qemu-devel@nongnu.org" Hi Josh, just opened http://tracker.ceph.com/issues/5919 with all collected information incl. debug-log. Hope it helps, Oliver. On 08/08/2013 07:01 PM, Josh Durgin wrote: > On 08/08/2013 05:40 AM, Oliver Francke wrote: >> Hi Josh, >> >> I have a session logged with: >> >> debug_ms=3D1:debug_rbd=3D20:debug_objectcacher=3D30 >> >> as you requested from Mike, even if I think, we do have another story >> here, anyway. >> >> Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is >> 3.2.0-51-amd... >> >> Do you want me to open a ticket for that stuff? I have about 5MB >> compressed logfile waiting for you ;) > > Yes, that'd be great. If you could include the time when you saw the=20 > guest hang that'd be ideal. I'm not sure if this is one or two bugs, > but it seems likely it's a bug in rbd and not qemu. > > Thanks! > Josh > >> Thnx in advance, >> >> Oliver. >> >> On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote: >>> On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote: >>>> Am 02.08.2013 um 23:47 schrieb Mike Dawson : >>>>> We can "un-wedge" the guest by opening a NoVNC session or running a >>>>> 'virsh screenshot' command. After that, the guest resumes and runs >>>>> as expected. At that point we can examine the guest. Each time we'l= l >>>>> see: >>> If virsh screenshot works then this confirms that QEMU itself is stil= l >>> responding. Its main loop cannot be blocked since it was able to >>> process the screendump command. >>> >>> This supports Josh's theory that a callback is not being invoked. Th= e >>> virtio-blk I/O request would be left in a pending state. >>> >>> Now here is where the behavior varies between configurations: >>> >>> On a Windows guest with 1 vCPU, you may see the symptom that the=20 >>> guest no >>> longer responds to ping. >>> >>> On a Linux guest with multiple vCPUs, you may see the hung task messa= ge >>> from the guest kernel because other vCPUs are still making progress. >>> Just the vCPU that issued the I/O request and whose task is in >>> UNINTERRUPTIBLE state would really be stuck. >>> >>> Basically, the symptoms depend not just on how QEMU is behaving but=20 >>> also >>> on the guest kernel and how many vCPUs you have configured. >>> >>> I think this can explain how both problems you are observing, Oliver=20 >>> and >>> Mike, are a result of the same bug. At least I hope they are :). >>> >>> Stefan >> >> > --=20 Oliver Francke filoo GmbH Moltkestra=DFe 25a 33330 G=FCtersloh HRB4355 AG G=FCtersloh Gesch=E4ftsf=FChrer: J.Rehp=F6hler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh