From mboxrd@z Thu Jan  1 00:00:00 1970
From: Kevin Wolf <kwolf@redhat.com>
Subject: Re: qemu-kvm hangs if multipath device is queing
Date: Tue, 18 May 2010 15:22:36 +0200
Message-ID: <4BF2949C.8010108@redhat.com>
References: <4BDF3F94.1080608@dlh.net> <4BDFDC44.9030808@redhat.com>	<4BE00750.6040804@dlh.net> <4BE01120.30608@redhat.com>	<4BE02440.6010802@dlh.net> <4BE028BF.1000603@redhat.com> <4BEAB4B0.70803@dlh.net> <4BED1740.1080604@redhat.com> <4BF275B1.8030106@dlh.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>
To: Peter Lieven <pl@dlh.net>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:19306 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1757269Ab0ERNXI (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 18 May 2010 09:23:08 -0400
In-Reply-To: <4BF275B1.8030106@dlh.net>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Am 18.05.2010 13:10, schrieb Peter Lieven:
> hi kevin,
> 
> here is the backtrace of (hopefully) all threads:
> 
> ^C
> Program received signal SIGINT, Interrupt.
> [Switching to Thread 0x7f39b72656f0 (LWP 10695)]
> 0x00007f39b6c3ea94 in __lll_lock_wait () from /lib/libpthread.so.0
> 
> (gdb) thread apply all bt
> 
> Thread 2 (Thread 0x7f39b57b8950 (LWP 10698)):
> #0  0x00007f39b6c3eedb in read () from /lib/libpthread.so.0
> #1  0x000000000049e723 in qemu_laio_completion_cb (opaque=0x22b4010) at 
> linux-aio.c:125
> #2  0x000000000049e8ad in laio_cancel (blockacb=0x22ba310) at 
> linux-aio.c:184

I think it's stuck here in an endless loop:

    while (laiocb->ret == -EINPROGRESS)
        qemu_laio_completion_cb(laiocb->ctx);

Can you verify this by single-stepping one or two loop iterations? ret
and errno after the read call could be interesting, too.

We'll be stuck in an endless loop if the request doesn't complete, which
might well happen in your scenario. Not sure what the right thing to do
is. We probably need to fail the bdrv_aio_cancel to avoid blocking the
whole program, but I have no idea what device emulations should do on
that condition.

As long as we can't handle that condition correctly, leaving the hang in
place is probably the best option. Maybe add some sleep to avoid 100%
CPU consumption.

Kevin