From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41134) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V6R5N-0004KP-FV for qemu-devel@nongnu.org; Mon, 05 Aug 2013 16:09:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V6R5H-00072y-DD for qemu-devel@nongnu.org; Mon, 05 Aug 2013 16:08:57 -0400 Received: from mail-oa0-x232.google.com ([2607:f8b0:4003:c02::232]:35910) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V6R5H-00072l-5X for qemu-devel@nongnu.org; Mon, 05 Aug 2013 16:08:51 -0400 Received: by mail-oa0-f50.google.com with SMTP id i4so7171765oah.23 for ; Mon, 05 Aug 2013 13:08:50 -0700 (PDT) Message-ID: <5200064F.5090103@cloudapt.com> Date: Mon, 05 Aug 2013 16:08:47 -0400 From: Mike Dawson MIME-Version: 1.0 References: <51FB887F.5070908@filoo.de> <51FC2903.3030802@cloudapt.com> <5739DFCB-21A5-4AED-82BF-6B58D3E1502A@filoo.de> <20130805074835.GA12658@stefanha-thinkpad.muc.redhat.com> In-Reply-To: <20130805074835.GA12658@stefanha-thinkpad.muc.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686] List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Josh Durgin , ceph-users@lists.ceph.com, Oliver Francke , "qemu-devel@nongnu.org" Josh, Logs are uploaded to cephdrop with the file name mikedawson-rbd-qemu-deadlock. - At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0 - At about 2013-08-05 19:53:51, ran a 'virsh screenshot' Environment is: - Ceph 0.61.7 (client is co-mingled with three OSDs) - rbd cache = true and cache=writeback - qemu 1.4.0 1.4.0+dfsg-1expubuntu4 - Ubuntu Raring with 3.8.0-25-generic This issue is reproducible in my environment, and I'm willing to run any wip branch you need. What else can I provide to help? Thanks, Mike Dawson On 8/5/2013 3:48 AM, Stefan Hajnoczi wrote: > On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote: >> Am 02.08.2013 um 23:47 schrieb Mike Dawson : >>> We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh screenshot' command. After that, the guest resumes and runs as expected. At that point we can examine the guest. Each time we'll see: > > If virsh screenshot works then this confirms that QEMU itself is still > responding. Its main loop cannot be blocked since it was able to > process the screendump command. > > This supports Josh's theory that a callback is not being invoked. The > virtio-blk I/O request would be left in a pending state. > > Now here is where the behavior varies between configurations: > > On a Windows guest with 1 vCPU, you may see the symptom that the guest no > longer responds to ping. > > On a Linux guest with multiple vCPUs, you may see the hung task message > from the guest kernel because other vCPUs are still making progress. > Just the vCPU that issued the I/O request and whose task is in > UNINTERRUPTIBLE state would really be stuck. > > Basically, the symptoms depend not just on how QEMU is behaving but also > on the guest kernel and how many vCPUs you have configured. > > I think this can explain how both problems you are observing, Oliver and > Mike, are a result of the same bug. At least I hope they are :). > > Stefan >