From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40235) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7nJw-0005Xr-MO for qemu-devel@nongnu.org; Fri, 09 Aug 2013 10:05:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V7nJr-0004rl-U4 for qemu-devel@nongnu.org; Fri, 09 Aug 2013 10:05:36 -0400 Received: from mail.arhont.com ([89.187.71.156]:55168 helo=pingo2.arhont.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7nJr-0004qI-G0 for qemu-devel@nongnu.org; Fri, 09 Aug 2013 10:05:31 -0400 Date: Fri, 9 Aug 2013 15:05:22 +0100 (BST) From: Andrei Mikhailovsky Message-ID: <13653691.7559.1376057121351.JavaMail.andrei@finka> In-Reply-To: <5204B4B8.3080302@filoo.de> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_7558_3628372.1376057121350" Subject: Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686] List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Oliver Francke Cc: Josh Durgin , ceph-users@lists.ceph.com, Mike Dawson , Stefan Hajnoczi , qemu-devel@nongnu.org ------=_Part_7558_3628372.1376057121350 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable I can confirm that I am having similar issues with ubuntu vm guests using f= io with bs=3D4k direct=3D1 numjobs=3D4 iodepth=3D16. Occasionally i see han= g tasks, occasionally guest vm stops responding without leaving anything in= the logs and sometimes i see kernel panic on the console. I typically leav= e the runtime of the fio test for 60 minutes and it tends to stop respondin= g after about 10-30 mins.=20 I am on ubuntu 12.04 with 3.5 kernel backport and using ceph 0.61.7 with qe= mu 1.5.0 and libvirt 1.0.2=20 Andrei=20 ----- Original Message ----- From: "Oliver Francke" =20 To: "Josh Durgin" =20 Cc: ceph-users@lists.ceph.com, "Mike Dawson" , "S= tefan Hajnoczi" , qemu-devel@nongnu.org=20 Sent: Friday, 9 August, 2013 10:22:00 AM=20 Subject: Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-= RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unrespons= ive qemu-process, [Qemu-devel] [Bug 1207686]=20 Hi Josh,=20 just opened=20 http://tracker.ceph.com/issues/5919=20 with all collected information incl. debug-log.=20 Hope it helps,=20 Oliver.=20 On 08/08/2013 07:01 PM, Josh Durgin wrote:=20 > On 08/08/2013 05:40 AM, Oliver Francke wrote:=20 >> Hi Josh,=20 >>=20 >> I have a session logged with:=20 >>=20 >> debug_ms=3D1:debug_rbd=3D20:debug_objectcacher=3D30=20 >>=20 >> as you requested from Mike, even if I think, we do have another story=20 >> here, anyway.=20 >>=20 >> Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is=20 >> 3.2.0-51-amd...=20 >>=20 >> Do you want me to open a ticket for that stuff? I have about 5MB=20 >> compressed logfile waiting for you ;)=20 >=20 > Yes, that'd be great. If you could include the time when you saw the=20 > guest hang that'd be ideal. I'm not sure if this is one or two bugs,=20 > but it seems likely it's a bug in rbd and not qemu.=20 >=20 > Thanks!=20 > Josh=20 >=20 >> Thnx in advance,=20 >>=20 >> Oliver.=20 >>=20 >> On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:=20 >>> On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:=20 >>>> Am 02.08.2013 um 23:47 schrieb Mike Dawson := =20 >>>>> We can "un-wedge" the guest by opening a NoVNC session or running a= =20 >>>>> 'virsh screenshot' command. After that, the guest resumes and runs=20 >>>>> as expected. At that point we can examine the guest. Each time we'll= =20 >>>>> see:=20 >>> If virsh screenshot works then this confirms that QEMU itself is still= =20 >>> responding. Its main loop cannot be blocked since it was able to=20 >>> process the screendump command.=20 >>>=20 >>> This supports Josh's theory that a callback is not being invoked. The= =20 >>> virtio-blk I/O request would be left in a pending state.=20 >>>=20 >>> Now here is where the behavior varies between configurations:=20 >>>=20 >>> On a Windows guest with 1 vCPU, you may see the symptom that the=20 >>> guest no=20 >>> longer responds to ping.=20 >>>=20 >>> On a Linux guest with multiple vCPUs, you may see the hung task message= =20 >>> from the guest kernel because other vCPUs are still making progress.=20 >>> Just the vCPU that issued the I/O request and whose task is in=20 >>> UNINTERRUPTIBLE state would really be stuck.=20 >>>=20 >>> Basically, the symptoms depend not just on how QEMU is behaving but=20 >>> also=20 >>> on the guest kernel and how many vCPUs you have configured.=20 >>>=20 >>> I think this can explain how both problems you are observing, Oliver=20 >>> and=20 >>> Mike, are a result of the same bug. At least I hope they are :).=20 >>>=20 >>> Stefan=20 >>=20 >>=20 >=20 --=20 Oliver Francke=20 filoo GmbH=20 Moltkestra=C3=9Fe 25a=20 33330 G=C3=BCtersloh=20 HRB4355 AG G=C3=BCtersloh=20 Gesch=C3=A4ftsf=C3=BChrer: J.Rehp=C3=B6hler | C.Kunz=20 Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh=20 _______________________________________________=20 ceph-users mailing list=20 ceph-users@lists.ceph.com=20 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com=20 ------=_Part_7558_3628372.1376057121350 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <= div style=3D'font-family: arial,helvetica,sans-serif; font-size: 10pt; colo= r: #000000'>I can confirm that I am having similar issues with ubuntu vm gu= ests using fio with bs=3D4k direct=3D1 numjobs=3D4 iodepth=3D16. Occasional= ly i see hang tasks, occasionally guest vm stops responding without leaving= anything in the logs and sometimes i see kernel panic on the console. I ty= pically leave the runtime of the fio test for 60 minutes and it tends to st= op responding after about 10-30 mins.

I am on ubuntu 12.04 with 3.5= kernel backport and using ceph 0.61.7 with qemu 1.5.0 and libvirt 1.0.2
Andrei

From: "Oliver Francke" <O= liver.Francke@filoo.de>
To: "Josh Durgin" <josh.durgin@inkt= ank.com>
Cc: ceph-users@lists.ceph.com, "Mike Dawson" <mike= .dawson@cloudapt.com>, "Stefan Hajnoczi" <stefanha@redhat.com>, qe= mu-devel@nongnu.org
Sent: Friday, 9 August, 2013 10:22:00 AM
<= b>Subject: Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x,= ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unr= esponsive qemu-process, [Qemu-devel] [Bug 1207686]

Hi Josh,

j= ust opened

http://tracker.ceph.com/issues/5919

with all colle= cted information incl. debug-log.

Hope it helps,

Oliver.
<= br>On 08/08/2013 07:01 PM, Josh Durgin wrote:
> On 08/08/2013 05:40 A= M, Oliver Francke wrote:
>> Hi Josh,
>>
>> I hav= e a session logged with:
>>
>>      debug_= ms=3D1:debug_rbd=3D20:debug_objectcacher=3D30
>>
>> as yo= u requested from Mike, even if I think, we do have another story
>>= ; here, anyway.
>>
>> Host-kernel is: 3.10.0-rc7, qemu-cl= ient 1.6.0-rc2, client-kernel is
>> 3.2.0-51-amd...
>>>> Do you want me to open a ticket for that stuff? I have about 5MB<= br>>> compressed logfile waiting for you ;)
>
> Yes, that= 'd be great. If you could include the time when you saw the
> guest = hang that'd be ideal. I'm not sure if this is one or two bugs,
> but = it seems likely it's a bug in rbd and not qemu.
>
> Thanks!
= > Josh
>
>> Thnx in advance,
>>
>> Oliv= er.
>>
>> On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:<= br>>>> On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wr= ote:
>>>> Am 02.08.2013 um 23:47 schrieb Mike Dawson <mik= e.dawson@cloudapt.com>:
>>>>> We can "un-wedge" the gu= est by opening a NoVNC session or running a
>>>>> 'virsh = screenshot' command. After that, the guest resumes and runs
>>>= >> as expected. At that point we can examine the guest. Each time we'= ll
>>>>> see:
>>> If virsh screenshot works t= hen this confirms that QEMU itself is still
>>> responding. &nb= sp;Its main loop cannot be blocked since it was able to
>>> pro= cess the screendump command.
>>>
>>> This supports = Josh's theory that a callback is not being invoked.  The
>>&g= t; virtio-blk I/O request would be left in a pending state.
>>>=
>>> Now here is where the behavior varies between configuratio= ns:
>>>
>>> On a Windows guest with 1 vCPU, you may= see the symptom that the
>>> guest no
>>> longer = responds to ping.
>>>
>>> On a Linux guest with mul= tiple vCPUs, you may see the hung task message
>>> from the gue= st kernel because other vCPUs are still making progress.
>>> Ju= st the vCPU that issued the I/O request and whose task is in
>>>= ; UNINTERRUPTIBLE state would really be stuck.
>>>
>>&= gt; Basically, the symptoms depend not just on how QEMU is behaving but >>> also
>>> on the guest kernel and how many vCPUs y= ou have configured.
>>>
>>> I think this can explai= n how both problems you are observing, Oliver
>>> and
>&= gt;> Mike, are a result of the same bug.  At least I hope they are = :).
>>>
>>> Stefan
>>
>>
><= br>

--

Oliver Francke

filoo GmbH
Moltkestra=C3=9Fe= 25a
33330 G=C3=BCtersloh
HRB4355 AG G=C3=BCtersloh

Gesch=C3= =A4ftsf=C3=BChrer: J.Rehp=C3=B6hler | C.Kunz

Folgen Sie uns auf Twit= ter: http://twitter.com/filoogmbh

__________________________________= _____________
ceph-users mailing list
ceph-users@lists.ceph.com
ht= tp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

------=_Part_7558_3628372.1376057121350--