From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: Endless loop in qcow2_alloc_cluster_offset
Date: Mon, 07 Dec 2009 15:16:20 +0100
Message-ID: <4B1D0E34.6070907@siemens.com>
References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com> <4B055D32.3040601@siemens.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel <qemu-devel@nongnu.org>,
	kvm <kvm@vger.kernel.org>
To: unlisted-recipients:; (no To-header on input)
Return-path: <kvm-owner@vger.kernel.org>
Received: from thoth.sbs.de ([192.35.17.2]:23220 "EHLO thoth.sbs.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S935126AbZLGOQx (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 7 Dec 2009 09:16:53 -0500
In-Reply-To: <4B055D32.3040601@siemens.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Jan Kiszka wrote:
> Kevin Wolf wrote:
>> Hi Jan,
>>
>> Am 19.11.2009 13:19, schrieb Jan Kiszka:
>>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>>> $5 = (struct QCowL2Meta *) 0xcb3568
>>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
>>>
>>> So next == first.
>> Oops. Doesn't sound quite right...
>>
>>> Is something fiddling with cluster_allocs concurrently, e.g. some signal
>>> handler? Or what could cause this list corruption? Would it be enough to
>>> move to QLIST_FOREACH_SAFE?
>> Are there any specific signals you're thinking of? Related to block code
> 
> No, was just blind guessing.
> 
>> I can only think of SIGUSR2 and this one shouldn't call any block driver
>> functions directly. You're using aio=threads, I assume? (It's the default)
> 
> Yes, all on defaults.
> 
>> QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
>> doesn't insert or remove any elements. If the list is corrupted now, I
>> think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
>> the endless loop would occur one call later.
>>
>> The only way I see to get such a loop in a list is to re-insert an
>> element that already is part of the list. The only insert is at
>> qcow2-cluster.c:777. Remains the question how we came there twice
>> without run_dependent_requests() removing the L2Meta from our list first
>> - because this is definitely wrong...
>>
>> Presumably, it's not reproducible?
> 
> Likely not. What I did was nothing special, and I did not noticed such a
> crash in the last months.

And now it happened again (qemu-kvm head, during kernel installation
from network onto local qcow2-disk). Any clever idea how to proceed with
this?

I could try to run the step in a loop, hopefully retriggering it once in
a (likely longer) while. But then we need some good instrumentation first.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NHeON-0001O8-E0
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:47 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NHeOI-0001Jh-B6
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:46 -0500
Received: from [199.232.76.173] (port=47267 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NHeOI-0001JZ-32
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:42 -0500
Received: from thoth.sbs.de ([192.35.17.2]:23005)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <jan.kiszka@siemens.com>) id 1NHeOH-0001P4-Vm
	for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:42 -0500
Message-ID: <4B1D0E34.6070907@siemens.com>
Date: Mon, 07 Dec 2009 15:16:20 +0100
From: Jan Kiszka <jan.kiszka@siemens.com>
MIME-Version: 1.0
References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com>
	<4B055D32.3040601@siemens.com>
In-Reply-To: <4B055D32.3040601@siemens.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel <qemu-devel@nongnu.org>, kvm <kvm@vger.kernel.org>

Jan Kiszka wrote:
> Kevin Wolf wrote:
>> Hi Jan,
>>
>> Am 19.11.2009 13:19, schrieb Jan Kiszka:
>>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>>> $5 = (struct QCowL2Meta *) 0xcb3568
>>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
>>>
>>> So next == first.
>> Oops. Doesn't sound quite right...
>>
>>> Is something fiddling with cluster_allocs concurrently, e.g. some signal
>>> handler? Or what could cause this list corruption? Would it be enough to
>>> move to QLIST_FOREACH_SAFE?
>> Are there any specific signals you're thinking of? Related to block code
> 
> No, was just blind guessing.
> 
>> I can only think of SIGUSR2 and this one shouldn't call any block driver
>> functions directly. You're using aio=threads, I assume? (It's the default)
> 
> Yes, all on defaults.
> 
>> QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
>> doesn't insert or remove any elements. If the list is corrupted now, I
>> think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
>> the endless loop would occur one call later.
>>
>> The only way I see to get such a loop in a list is to re-insert an
>> element that already is part of the list. The only insert is at
>> qcow2-cluster.c:777. Remains the question how we came there twice
>> without run_dependent_requests() removing the L2Meta from our list first
>> - because this is definitely wrong...
>>
>> Presumably, it's not reproducible?
> 
> Likely not. What I did was nothing special, and I did not noticed such a
> crash in the last months.

And now it happened again (qemu-kvm head, during kernel installation
from network onto local qcow2-disk). Any clever idea how to proceed with
this?

I could try to run the step in a loop, hopefully retriggering it once in
a (likely longer) while. But then we need some good instrumentation first.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux