From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kiszka Subject: Re: Endless loop in qcow2_alloc_cluster_offset Date: Mon, 07 Dec 2009 15:16:20 +0100 Message-ID: <4B1D0E34.6070907@siemens.com> References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com> <4B055D32.3040601@siemens.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Kevin Wolf , qemu-devel , kvm To: unlisted-recipients:; (no To-header on input) Return-path: Received: from thoth.sbs.de ([192.35.17.2]:23220 "EHLO thoth.sbs.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935126AbZLGOQx (ORCPT ); Mon, 7 Dec 2009 09:16:53 -0500 In-Reply-To: <4B055D32.3040601@siemens.com> Sender: kvm-owner@vger.kernel.org List-ID: Jan Kiszka wrote: > Kevin Wolf wrote: >> Hi Jan, >> >> Am 19.11.2009 13:19, schrieb Jan Kiszka: >>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first >>> $5 = (struct QCowL2Meta *) 0xcb3568 >>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first >>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} >>> >>> So next == first. >> Oops. Doesn't sound quite right... >> >>> Is something fiddling with cluster_allocs concurrently, e.g. some signal >>> handler? Or what could cause this list corruption? Would it be enough to >>> move to QLIST_FOREACH_SAFE? >> Are there any specific signals you're thinking of? Related to block code > > No, was just blind guessing. > >> I can only think of SIGUSR2 and this one shouldn't call any block driver >> functions directly. You're using aio=threads, I assume? (It's the default) > > Yes, all on defaults. > >> QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop >> doesn't insert or remove any elements. If the list is corrupted now, I >> think it would be corrupted with QLIST_FOREACH_SAFE as well - at best, >> the endless loop would occur one call later. >> >> The only way I see to get such a loop in a list is to re-insert an >> element that already is part of the list. The only insert is at >> qcow2-cluster.c:777. Remains the question how we came there twice >> without run_dependent_requests() removing the L2Meta from our list first >> - because this is definitely wrong... >> >> Presumably, it's not reproducible? > > Likely not. What I did was nothing special, and I did not noticed such a > crash in the last months. And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NHeON-0001O8-E0 for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:47 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NHeOI-0001Jh-B6 for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:46 -0500 Received: from [199.232.76.173] (port=47267 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NHeOI-0001JZ-32 for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:42 -0500 Received: from thoth.sbs.de ([192.35.17.2]:23005) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NHeOH-0001P4-Vm for qemu-devel@nongnu.org; Mon, 07 Dec 2009 09:16:42 -0500 Message-ID: <4B1D0E34.6070907@siemens.com> Date: Mon, 07 Dec 2009 15:16:20 +0100 From: Jan Kiszka MIME-Version: 1.0 References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com> <4B055D32.3040601@siemens.com> In-Reply-To: <4B055D32.3040601@siemens.com> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , qemu-devel , kvm Jan Kiszka wrote: > Kevin Wolf wrote: >> Hi Jan, >> >> Am 19.11.2009 13:19, schrieb Jan Kiszka: >>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first >>> $5 = (struct QCowL2Meta *) 0xcb3568 >>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first >>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} >>> >>> So next == first. >> Oops. Doesn't sound quite right... >> >>> Is something fiddling with cluster_allocs concurrently, e.g. some signal >>> handler? Or what could cause this list corruption? Would it be enough to >>> move to QLIST_FOREACH_SAFE? >> Are there any specific signals you're thinking of? Related to block code > > No, was just blind guessing. > >> I can only think of SIGUSR2 and this one shouldn't call any block driver >> functions directly. You're using aio=threads, I assume? (It's the default) > > Yes, all on defaults. > >> QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop >> doesn't insert or remove any elements. If the list is corrupted now, I >> think it would be corrupted with QLIST_FOREACH_SAFE as well - at best, >> the endless loop would occur one call later. >> >> The only way I see to get such a loop in a list is to re-insert an >> element that already is part of the list. The only insert is at >> qcow2-cluster.c:777. Remains the question how we came there twice >> without run_dependent_requests() removing the L2Meta from our list first >> - because this is definitely wrong... >> >> Presumably, it's not reproducible? > > Likely not. What I did was nothing special, and I did not noticed such a > crash in the last months. And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux