From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kevin Wolf Subject: Re: Endless loop in qcow2_alloc_cluster_offset Date: Mon, 07 Dec 2009 16:00:18 +0100 Message-ID: <4B1D1882.7040404@redhat.com> References: <4B0537EB.4000909@siemens.com> <4B055AEF.4030406@redhat.com> <4B055D32.3040601@siemens.com> <4B1D0E34.6070907@siemens.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: qemu-devel , kvm To: Jan Kiszka Return-path: Received: from mx1.redhat.com ([209.132.183.28]:52564 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757918AbZLGPBV (ORCPT ); Mon, 7 Dec 2009 10:01:21 -0500 In-Reply-To: <4B1D0E34.6070907@siemens.com> Sender: kvm-owner@vger.kernel.org List-ID: Am 07.12.2009 15:16, schrieb Jan Kiszka: >> Likely not. What I did was nothing special, and I did not noticed such a >> crash in the last months. > > And now it happened again (qemu-kvm head, during kernel installation > from network onto local qcow2-disk). Any clever idea how to proceed with > this? I still haven't seen this and I still have no theory on what could be happening here. I'm just trying to write down what I think must happen to get into this situation. Maybe you can point at something I'm missing or maybe it helps you to have a sudden inspiration. The crash happens because we have a loop in the s->cluster_allocs list. A loop can only be created by inserting an object twice. The only insert to this list happens in qcow2_alloc_cluster_offset (though an earlier call than that of the stack trace). There is only one relevant caller of this function, qcow_aio_write_cb. Part of it is a call to run_dependent_requests which removes the request from s->cluster_allocs. So after the QLIST_REMOVE in run_dependent_requests the request can't be contained in the list, but at the call of qcow2_alloc_cluster_offset it must be contained again. It must be added somewhere in between these two calls. In qcow_aio_write_cb there isn't much happening between these calls. The only thing that could somehow become dangerous is the qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests. > I could try to run the step in a loop, hopefully retriggering it once in > a (likely longer) while. But then we need some good instrumentation first. I can't explain what exactly would be going wrong there, but if my thoughts are right so far, I think that moving this into a Bottom Half would help. So if you can reproduce it in a loop this could be worth a try. I'd certainly prefer to understand the problem first, but thinking about AIO is the perfect way to make your brain hurt... Kevin