From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1M1kOm-00038S-VE for qemu-devel@nongnu.org; Wed, 06 May 2009 12:55:13 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1M1kOi-00037n-Fz for qemu-devel@nongnu.org; Wed, 06 May 2009 12:55:12 -0400 Received: from [199.232.76.173] (port=44494 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1M1kOi-00037k-AG for qemu-devel@nongnu.org; Wed, 06 May 2009 12:55:08 -0400 Received: from mx2.redhat.com ([66.187.237.31]:58443) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1M1kOh-00031k-Q9 for qemu-devel@nongnu.org; Wed, 06 May 2009 12:55:08 -0400 Message-ID: <4A01C0C6.7020902@redhat.com> Date: Wed, 06 May 2009 19:54:30 +0300 From: Avi Kivity MIME-Version: 1.0 References: <1241627950-22195-1-git-send-email-kwolf@redhat.com> In-Reply-To: <1241627950-22195-1-git-send-email-kwolf@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: markmc@redhat.com, qemu-devel@nongnu.org Kevin Wolf wrote: > A write on a qcow2 image that results in the allocation of a new cluster can be > divided into two parts: There is a part which happens before the actual data is > written, this is about allocating clusters in the image file. The second part > happens in the AIO callback handler and is about making L2 entries for the > newly allocated clusters. > > When two AIO requests happen to touch the same free cluster, there is a chance > that the second request still sees the cluster as free because the callback of > the first request has not yet run. This means that it reserves another cluster > for the same virtual hard disk offset and hooks it up in its own callback, > overwriting what the first callback has done. Long story cut short: Bad Things > happen. > > This patch maintains a list of in-flight requests that have allocated new > clusters. A second request touching the same cluster doesn't find an entry yet > in the L2 table, but can look it up in the list now. The second request is > limited so that it either doesn't touch the allocation of the first request > (so it can have a non-overlapping allocation) or that it doesn't exceed the > end of the allocation of the first request (so it can reuse this allocation > and doesn't need to do anything itself). > > > + } else { > + /* Reuse the previously allocated clusters */ > + if (end_offset > old_end_offset) { > + nb_clusters = (old_end_offset - offset) >> s->cluster_bits; > + } > + cluster_offset = old_alloc->cluster_offset + (offset - old_offset); > + m->nb_clusters = 0; > + goto out; > + } > + } > + > + LIST_INSERT_HEAD(&s->cluster_allocs, m, next); > > What happens if the second request completes before the first? Then, when the first request completes, alloc_cluster_link_l2() will call copy_clusters() and overwrite the second request. Also, the second request now depends on the first to update its metadata. But if the first request fail, it will not update its metadata, and the second request will complete without error and also without updating its metadata. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.