From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LYWsj-00009q-R9 for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:21 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LYWsi-00009d-DG for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:21 -0500 Received: from [199.232.76.173] (port=53092 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LYWsi-00009a-7S for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:20 -0500 Received: from mail2.shareable.org ([80.68.89.115]:39670) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LYWsh-0000WI-OA for qemu-devel@nongnu.org; Sat, 14 Feb 2009 21:37:20 -0500 Date: Sun, 15 Feb 2009 02:37:18 +0000 From: Jamie Lokier Subject: Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change Message-ID: <20090215023718.GD9281@shareable.org> References: <20090211070049.GA27821@shareable.org> <49955681.9070301@suse.de> <20090213162336.GI18471@shareable.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Marc Bevand Cc: qemu-devel@nongnu.org, Gleb Natapov , kvm@vger.kernel.org Marc Bevand wrote: > On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier wrote: > > > > Marc.. this is quite a serious bug you've reported. Is there a > > reason you didn't report it earlier? > > Because I only started hitting that bug a couple weeks ago after > having upgraded to a buggy kvm version. > > > Is there a way to restructure the code and/or how it works so it's > > more clearly correct? > > I am seriously concerned about the general design of qcow2. The code > base is more complex than it needs to be, the format itself is > susceptible to race conditions causing cluster leaks when updating > some internal datastructures, it gets easily fragmented, etc. When I read it, I thought the code was remarkably compact for what it does, although I agree that the leaks, fragmentation and inconsistency on crashes are serious. From elsewhere it sounds like the refcount update cost is significant too. > I am considering implementing a new disk image format that supports > base images, snapshots (of the guest state), clones (of the disk > content); that has a radically simpler design & code base; that is > always consistent "on disk"; that is friendly to delta diffing (ie. > space-efficient when used with ZFS snapshots or rsync); and that makes > use of checksumming & replication to detect & fix corruption of > critical data structures (ideally this should be implemented by the > filesystem, unfortunately ZFS is not available everywhere :D). You have just described a high quality modern filesystem or database engine; both would certainly be far more complex than qcow2's code. Especially with checksumming and replication :) ZFS isn't everywhere, but it looks like everyone wants to clone ZFS's best features everywhere (but not it's worst feature: lots of memory required). I've had similar thoughts myself, by the way :-) > I believe the key to achieve these (seemingly utopian) goals is to > represent a disk "image" as a set of sparse files, 1 per > snapshot/clone. You can already do this, if your filesystem supports snapshotting. On Linux hosts, any filesystem can snapshot by using LVM underneath it (although it's not pretty to do). A few experimental Linux filesystems let you snapshot at the filesystem level. A feature you missed in the utopian vision is sharing backing store for equal parts of files between different snapshots _after_ they've been written in separate branches (with the same data), and also among different VMs. It's becoming stylish to put similarity detection in the filesystem somewhere too :-) -- Jamie