From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40235)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <andrei@arhont.com>) id 1V7nJw-0005Xr-MO
	for qemu-devel@nongnu.org; Fri, 09 Aug 2013 10:05:44 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <andrei@arhont.com>) id 1V7nJr-0004rl-U4
	for qemu-devel@nongnu.org; Fri, 09 Aug 2013 10:05:36 -0400
Received: from mail.arhont.com ([89.187.71.156]:55168 helo=pingo2.arhont.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <andrei@arhont.com>) id 1V7nJr-0004qI-G0
	for qemu-devel@nongnu.org; Fri, 09 Aug 2013 10:05:31 -0400
Date: Fri, 9 Aug 2013 15:05:22 +0100 (BST)
From: Andrei Mikhailovsky <andrei@arhont.com>
Message-ID: <13653691.7559.1376057121351.JavaMail.andrei@finka>
In-Reply-To: <5204B4B8.3080302@filoo.de>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_7558_3628372.1376057121350"
Subject: Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards,
 linux kernel 3.2.x, ceph-RBD,
 heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive
 qemu-process, [Bug 1207686]
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Oliver Francke <Oliver.Francke@filoo.de>
Cc: Josh Durgin <josh.durgin@inktank.com>, ceph-users@lists.ceph.com, Mike Dawson <mike.dawson@cloudapt.com>, Stefan Hajnoczi <stefanha@redhat.com>, qemu-devel@nongnu.org

------=_Part_7558_3628372.1376057121350
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

I can confirm that I am having similar issues with ubuntu vm guests using f=
io with bs=3D4k direct=3D1 numjobs=3D4 iodepth=3D16. Occasionally i see han=
g tasks, occasionally guest vm stops responding without leaving anything in=
 the logs and sometimes i see kernel panic on the console. I typically leav=
e the runtime of the fio test for 60 minutes and it tends to stop respondin=
g after about 10-30 mins.=20

I am on ubuntu 12.04 with 3.5 kernel backport and using ceph 0.61.7 with qe=
mu 1.5.0 and libvirt 1.0.2=20

Andrei=20
----- Original Message -----

From: "Oliver Francke" <Oliver.Francke@filoo.de>=20
To: "Josh Durgin" <josh.durgin@inktank.com>=20
Cc: ceph-users@lists.ceph.com, "Mike Dawson" <mike.dawson@cloudapt.com>, "S=
tefan Hajnoczi" <stefanha@redhat.com>, qemu-devel@nongnu.org=20
Sent: Friday, 9 August, 2013 10:22:00 AM=20
Subject: Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-=
RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unrespons=
ive qemu-process, [Qemu-devel] [Bug 1207686]=20

Hi Josh,=20

just opened=20

http://tracker.ceph.com/issues/5919=20

with all collected information incl. debug-log.=20

Hope it helps,=20

Oliver.=20

On 08/08/2013 07:01 PM, Josh Durgin wrote:=20
> On 08/08/2013 05:40 AM, Oliver Francke wrote:=20
>> Hi Josh,=20
>>=20
>> I have a session logged with:=20
>>=20
>> debug_ms=3D1:debug_rbd=3D20:debug_objectcacher=3D30=20
>>=20
>> as you requested from Mike, even if I think, we do have another story=20
>> here, anyway.=20
>>=20
>> Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is=20
>> 3.2.0-51-amd...=20
>>=20
>> Do you want me to open a ticket for that stuff? I have about 5MB=20
>> compressed logfile waiting for you ;)=20
>=20
> Yes, that'd be great. If you could include the time when you saw the=20
> guest hang that'd be ideal. I'm not sure if this is one or two bugs,=20
> but it seems likely it's a bug in rbd and not qemu.=20
>=20
> Thanks!=20
> Josh=20
>=20
>> Thnx in advance,=20
>>=20
>> Oliver.=20
>>=20
>> On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:=20
>>> On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:=20
>>>> Am 02.08.2013 um 23:47 schrieb Mike Dawson <mike.dawson@cloudapt.com>:=
=20
>>>>> We can "un-wedge" the guest by opening a NoVNC session or running a=
=20
>>>>> 'virsh screenshot' command. After that, the guest resumes and runs=20
>>>>> as expected. At that point we can examine the guest. Each time we'll=
=20
>>>>> see:=20
>>> If virsh screenshot works then this confirms that QEMU itself is still=
=20
>>> responding. Its main loop cannot be blocked since it was able to=20
>>> process the screendump command.=20
>>>=20
>>> This supports Josh's theory that a callback is not being invoked. The=
=20
>>> virtio-blk I/O request would be left in a pending state.=20
>>>=20
>>> Now here is where the behavior varies between configurations:=20
>>>=20
>>> On a Windows guest with 1 vCPU, you may see the symptom that the=20
>>> guest no=20
>>> longer responds to ping.=20
>>>=20
>>> On a Linux guest with multiple vCPUs, you may see the hung task message=
=20
>>> from the guest kernel because other vCPUs are still making progress.=20
>>> Just the vCPU that issued the I/O request and whose task is in=20
>>> UNINTERRUPTIBLE state would really be stuck.=20
>>>=20
>>> Basically, the symptoms depend not just on how QEMU is behaving but=20
>>> also=20
>>> on the guest kernel and how many vCPUs you have configured.=20
>>>=20
>>> I think this can explain how both problems you are observing, Oliver=20
>>> and=20
>>> Mike, are a result of the same bug. At least I hope they are :).=20
>>>=20
>>> Stefan=20
>>=20
>>=20
>=20


--=20

Oliver Francke=20

filoo GmbH=20
Moltkestra=C3=9Fe 25a=20
33330 G=C3=BCtersloh=20
HRB4355 AG G=C3=BCtersloh=20

Gesch=C3=A4ftsf=C3=BChrer: J.Rehp=C3=B6hler | C.Kunz=20

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh=20

_______________________________________________=20
ceph-users mailing list=20
ceph-users@lists.ceph.com=20
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com=20


------=_Part_7558_3628372.1376057121350
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><style type=3D'text/css'>p { margin: 0; }</style></head><body><=
div style=3D'font-family: arial,helvetica,sans-serif; font-size: 10pt; colo=
r: #000000'>I can confirm that I am having similar issues with ubuntu vm gu=
ests using fio with bs=3D4k direct=3D1 numjobs=3D4 iodepth=3D16. Occasional=
ly i see hang tasks, occasionally guest vm stops responding without leaving=
 anything in the logs and sometimes i see kernel panic on the console. I ty=
pically leave the runtime of the fio test for 60 minutes and it tends to st=
op responding after about 10-30 mins. <br><br>I am on ubuntu 12.04 with 3.5=
 kernel backport and using ceph 0.61.7 with qemu 1.5.0 and libvirt 1.0.2<br=
><br>Andrei<br><hr id=3D"zwchr"><div style=3D"color: rgb(0, 0, 0); font-wei=
ght: normal; font-style: normal; text-decoration: none; font-family: Helvet=
ica,Arial,sans-serif; font-size: 12pt;"><b>From: </b>"Oliver Francke" &lt;O=
liver.Francke@filoo.de&gt;<br><b>To: </b>"Josh Durgin" &lt;josh.durgin@inkt=
ank.com&gt;<br><b>Cc: </b>ceph-users@lists.ceph.com, "Mike Dawson" &lt;mike=
.dawson@cloudapt.com&gt;, "Stefan Hajnoczi" &lt;stefanha@redhat.com&gt;, qe=
mu-devel@nongnu.org<br><b>Sent: </b>Friday, 9 August, 2013 10:22:00 AM<br><=
b>Subject: </b>Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x,=
 ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unr=
esponsive qemu-process, [Qemu-devel] [Bug 1207686]<br><br>Hi Josh,<br><br>j=
ust opened<br><br>http://tracker.ceph.com/issues/5919<br><br>with all colle=
cted information incl. debug-log.<br><br>Hope it helps,<br><br>Oliver.<br><=
br>On 08/08/2013 07:01 PM, Josh Durgin wrote:<br>&gt; On 08/08/2013 05:40 A=
M, Oliver Francke wrote:<br>&gt;&gt; Hi Josh,<br>&gt;&gt;<br>&gt;&gt; I hav=
e a session logged with:<br>&gt;&gt;<br>&gt;&gt; &nbsp; &nbsp; &nbsp;debug_=
ms=3D1:debug_rbd=3D20:debug_objectcacher=3D30<br>&gt;&gt;<br>&gt;&gt; as yo=
u requested from Mike, even if I think, we do have another story<br>&gt;&gt=
; here, anyway.<br>&gt;&gt;<br>&gt;&gt; Host-kernel is: 3.10.0-rc7, qemu-cl=
ient 1.6.0-rc2, client-kernel is<br>&gt;&gt; 3.2.0-51-amd...<br>&gt;&gt;<br=
>&gt;&gt; Do you want me to open a ticket for that stuff? I have about 5MB<=
br>&gt;&gt; compressed logfile waiting for you ;)<br>&gt;<br>&gt; Yes, that=
'd be great. If you could include the time when you saw the <br>&gt; guest =
hang that'd be ideal. I'm not sure if this is one or two bugs,<br>&gt; but =
it seems likely it's a bug in rbd and not qemu.<br>&gt;<br>&gt; Thanks!<br>=
&gt; Josh<br>&gt;<br>&gt;&gt; Thnx in advance,<br>&gt;&gt;<br>&gt;&gt; Oliv=
er.<br>&gt;&gt;<br>&gt;&gt; On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:<=
br>&gt;&gt;&gt; On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wr=
ote:<br>&gt;&gt;&gt;&gt; Am 02.08.2013 um 23:47 schrieb Mike Dawson &lt;mik=
e.dawson@cloudapt.com&gt;:<br>&gt;&gt;&gt;&gt;&gt; We can "un-wedge" the gu=
est by opening a NoVNC session or running a<br>&gt;&gt;&gt;&gt;&gt; 'virsh =
screenshot' command. After that, the guest resumes and runs<br>&gt;&gt;&gt;=
&gt;&gt; as expected. At that point we can examine the guest. Each time we'=
ll<br>&gt;&gt;&gt;&gt;&gt; see:<br>&gt;&gt;&gt; If virsh screenshot works t=
hen this confirms that QEMU itself is still<br>&gt;&gt;&gt; responding. &nb=
sp;Its main loop cannot be blocked since it was able to<br>&gt;&gt;&gt; pro=
cess the screendump command.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; This supports =
Josh's theory that a callback is not being invoked. &nbsp;The<br>&gt;&gt;&g=
t; virtio-blk I/O request would be left in a pending state.<br>&gt;&gt;&gt;=
<br>&gt;&gt;&gt; Now here is where the behavior varies between configuratio=
ns:<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; On a Windows guest with 1 vCPU, you may=
 see the symptom that the <br>&gt;&gt;&gt; guest no<br>&gt;&gt;&gt; longer =
responds to ping.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; On a Linux guest with mul=
tiple vCPUs, you may see the hung task message<br>&gt;&gt;&gt; from the gue=
st kernel because other vCPUs are still making progress.<br>&gt;&gt;&gt; Ju=
st the vCPU that issued the I/O request and whose task is in<br>&gt;&gt;&gt=
; UNINTERRUPTIBLE state would really be stuck.<br>&gt;&gt;&gt;<br>&gt;&gt;&=
gt; Basically, the symptoms depend not just on how QEMU is behaving but <br=
>&gt;&gt;&gt; also<br>&gt;&gt;&gt; on the guest kernel and how many vCPUs y=
ou have configured.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; I think this can explai=
n how both problems you are observing, Oliver <br>&gt;&gt;&gt; and<br>&gt;&=
gt;&gt; Mike, are a result of the same bug. &nbsp;At least I hope they are =
:).<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; Stefan<br>&gt;&gt;<br>&gt;&gt;<br>&gt;<=
br><br><br>-- <br><br>Oliver Francke<br><br>filoo GmbH<br>Moltkestra=C3=9Fe=
 25a<br>33330 G=C3=BCtersloh<br>HRB4355 AG G=C3=BCtersloh<br><br>Gesch=C3=
=A4ftsf=C3=BChrer: J.Rehp=C3=B6hler | C.Kunz<br><br>Folgen Sie uns auf Twit=
ter: http://twitter.com/filoogmbh<br><br>__________________________________=
_____________<br>ceph-users mailing list<br>ceph-users@lists.ceph.com<br>ht=
tp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<br></div><br></div></b=
ody></html>
------=_Part_7558_3628372.1376057121350--