From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: Endless loop in qcow2_alloc_cluster_offset
Date: Mon, 07 Dec 2009 17:09:45 +0100
Message-ID: <4B1D28C9.70201@siemens.com>
References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com> <4B055D32.3040601@siemens.com> <4B1D0E34.6070907@siemens.com> <4B1D1882.7040404@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: qemu-devel <qemu-devel@nongnu.org>, kvm <kvm@vger.kernel.org>
To: Kevin Wolf <kwolf@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from goliath.siemens.de ([192.35.17.28]:16331 "EHLO
	goliath.siemens.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S934347AbZLGQKA (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 7 Dec 2009 11:10:00 -0500
In-Reply-To: <4B1D1882.7040404@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Kevin Wolf wrote:
> Am 07.12.2009 15:16, schrieb Jan Kiszka:
>>> Likely not. What I did was nothing special, and I did not noticed such a
>>> crash in the last months.
>> And now it happened again (qemu-kvm head, during kernel installation
>> from network onto local qcow2-disk). Any clever idea how to proceed with
>> this?
> 
> I still haven't seen this and I still have no theory on what could be
> happening here. I'm just trying to write down what I think must happen
> to get into this situation. Maybe you can point at something I'm missing
> or maybe it helps you to have a sudden inspiration.
> 
> The crash happens because we have a loop in the s->cluster_allocs list.
> A loop can only be created by inserting an object twice. The only insert
> to this list happens in qcow2_alloc_cluster_offset (though an earlier
> call than that of the stack trace).
> 
> There is only one relevant caller of this function, qcow_aio_write_cb.
> Part of it is a call to run_dependent_requests which removes the request
> from s->cluster_allocs. So after the QLIST_REMOVE in
> run_dependent_requests the request can't be contained in the list, but
> at the call of qcow2_alloc_cluster_offset it must be contained again. It
> must be added somewhere in between these two calls.
> 
> In qcow_aio_write_cb there isn't much happening between these calls. The
> only thing that could somehow become dangerous is the
> qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

If m->nb_clusters is not, the entry won't be removed from the list. And
of something corrupted nb_clusters so that it became 0 although it's
still enqueued, we would see the deadly loop I faced, right?
Unfortunately, any arbitrary memory corruption that generates such zeros
can cause this...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux