From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LYWsj-00009q-R9
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:21 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LYWsi-00009d-DG
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:21 -0500
Received: from [199.232.76.173] (port=53092 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LYWsi-00009a-7S
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:20 -0500
Received: from mail2.shareable.org ([80.68.89.115]:39670)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <jamie@shareable.org>) id 1LYWsh-0000WI-OA
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:20 -0500
Date: Sun, 15 Feb 2009 02:37:18 +0000
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] Re: qcow2 corruption observed,
	fixed by reverting old change
Message-ID: <20090215023718.GD9281@shareable.org>
References: <20090211070049.GA27821@shareable.org>
	<loom.20090213T060937-534@post.gmane.org>
	<49955681.9070301@suse.de> <20090213162336.GI18471@shareable.org>
	<aaccfcb60902132231v53b54070sf7a0151ee565214@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <aaccfcb60902132231v53b54070sf7a0151ee565214@mail.gmail.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Marc Bevand <m.bevand@gmail.com>
Cc: qemu-devel@nongnu.org, Gleb Natapov <gleb@redhat.com>, kvm@vger.kernel.org

Marc Bevand wrote:
> On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier <jamie@shareable.org> wrote:
> >
> > Marc..  this is quite a serious bug you've reported.  Is there a
> > reason you didn't report it earlier?
> 
> Because I only started hitting that bug a couple weeks ago after
> having upgraded to a buggy kvm version.
> 
> > Is there a way to restructure the code and/or how it works so it's
> > more clearly correct?
> 
> I am seriously concerned about the general design of qcow2. The code
> base is more complex than it needs to be, the format itself is
> susceptible to race conditions causing cluster leaks when updating
> some internal datastructures, it gets easily fragmented, etc.

When I read it, I thought the code was remarkably compact for what it
does, although I agree that the leaks, fragmentation and inconsistency
on crashes are serious.  From elsewhere it sounds like the refcount
update cost is significant too.

> I am considering implementing a new disk image format that supports
> base images, snapshots (of the guest state), clones (of the disk
> content); that has a radically simpler design & code base; that is
> always consistent "on disk"; that is friendly to delta diffing (ie.
> space-efficient when used with ZFS snapshots or rsync); and that makes
> use of checksumming & replication to detect & fix corruption of
> critical data structures (ideally this should be implemented by the
> filesystem, unfortunately ZFS is not available everywhere :D).

You have just described a high quality modern filesystem or database
engine; both would certainly be far more complex than qcow2's code.
Especially with checksumming and replication :)

ZFS isn't everywhere, but it looks like everyone wants to clone ZFS's
best features everywhere (but not it's worst feature: lots of memory
required).

I've had similar thoughts myself, by the way :-)

> I believe the key to achieve these (seemingly utopian) goals is to
> represent a disk "image" as a set of sparse files, 1 per
> snapshot/clone.

You can already do this, if your filesystem supports snapshotting.  On
Linux hosts, any filesystem can snapshot by using LVM underneath it
(although it's not pretty to do).  A few experimental Linux
filesystems let you snapshot at the filesystem level.

A feature you missed in the utopian vision is sharing backing store
for equal parts of files between different snapshots _after_ they've
been written in separate branches (with the same data), and also among
different VMs.  It's becoming stylish to put similarity detection in
the filesystem somewhere too :-)

-- Jamie