From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NHf5f-0005Wo-Uf
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 10:01:31 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NHf5b-0005Px-8a
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 10:01:31 -0500
Received: from [199.232.76.173] (port=41823 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NHf5b-0005Pk-4U
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 10:01:27 -0500
Received: from mx1.redhat.com ([209.132.183.28]:10530)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <kwolf@redhat.com>) id 1NHf5a-0006oI-Ul
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 10:01:27 -0500
Message-ID: <4B1D1882.7040404@redhat.com>
Date: Mon, 07 Dec 2009 16:00:18 +0100
From: Kevin Wolf <kwolf@redhat.com>
MIME-Version: 1.0
References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com>
	<4B055D32.3040601@siemens.com> <4B1D0E34.6070907@siemens.com>
In-Reply-To: <4B1D0E34.6070907@siemens.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jan Kiszka <jan.kiszka@siemens.com>
Cc: qemu-devel <qemu-devel@nongnu.org>, kvm <kvm@vger.kernel.org>

Am 07.12.2009 15:16, schrieb Jan Kiszka:
>> Likely not. What I did was nothing special, and I did not noticed such a
>> crash in the last months.
> 
> And now it happened again (qemu-kvm head, during kernel installation
> from network onto local qcow2-disk). Any clever idea how to proceed with
> this?

I still haven't seen this and I still have no theory on what could be
happening here. I'm just trying to write down what I think must happen
to get into this situation. Maybe you can point at something I'm missing
or maybe it helps you to have a sudden inspiration.

The crash happens because we have a loop in the s->cluster_allocs list.
A loop can only be created by inserting an object twice. The only insert
to this list happens in qcow2_alloc_cluster_offset (though an earlier
call than that of the stack trace).

There is only one relevant caller of this function, qcow_aio_write_cb.
Part of it is a call to run_dependent_requests which removes the request
from s->cluster_allocs. So after the QLIST_REMOVE in
run_dependent_requests the request can't be contained in the list, but
at the call of qcow2_alloc_cluster_offset it must be contained again. It
must be added somewhere in between these two calls.

In qcow_aio_write_cb there isn't much happening between these calls. The
only thing that could somehow become dangerous is the
qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

> I could try to run the step in a loop, hopefully retriggering it once in
> a (likely longer) while. But then we need some good instrumentation first.

I can't explain what exactly would be going wrong there, but if my
thoughts are right so far, I think that moving this into a Bottom Half
would help. So if you can reproduce it in a loop this could be worth a try.

I'd certainly prefer to understand the problem first, but thinking about
AIO is the perfect way to make your brain hurt...

Kevin