From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=35361 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pz93q-0000Kv-ON
	for qemu-devel@nongnu.org; Mon, 14 Mar 2011 10:47:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pz93o-0002RF-W5
	for qemu-devel@nongnu.org; Mon, 14 Mar 2011 10:47:54 -0400
Received: from mail-yx0-f173.google.com ([209.85.213.173]:55140)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pz93o-0002Qx-Oq
	for qemu-devel@nongnu.org; Mon, 14 Mar 2011 10:47:52 -0400
Received: by yxk8 with SMTP id 8so2571543yxk.4
	for <qemu-devel@nongnu.org>; Mon, 14 Mar 2011 07:47:49 -0700 (PDT)
Message-ID: <4D7E2A91.3040300@codemonkey.ws>
Date: Mon, 14 Mar 2011 09:47:45 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] Re: Strategic decision: COW format
References: <OF3C9DAE9F.EC6B5878-ON85257826.00715C10-85257826.007A14FB@LocalDomain>	<OFF3B73D6C.D1225EB2-ON85257838.006A5FC5-85257838.006C65F3@us.ibm.com>	<4D5BC467.4070804@redhat.com>	<m3r5b53duy.fsf_-_@blackfin.pond.sub.org>	<4D5E4271.80501@redhat.com>	<20110220221357.GO4580@hall.aurel32.net>	<4D62295E.1030504@redhat.com>	<AANLkTi=ZAZ=yCZXHcfeT6715fwJzTXAUbHLbKd206pQA@mail.gmail.com>	<OFAEB4CD91.BE989F29-ON8525783F.007366B8-85257840.00130B47@LocalDomain>	<OF03D5ED77.46666A58-ON85257852.0013F083-85257852.001FF9A6@us.ibm.com>	<4D7D036B.4050706@codemonkey.ws>	<OF1B5598FB.C373F3D7-ON85257853.000BAE11-85257853.000D6017@us.ibm.com>	<4D7E167A.1020509@codemonkey.ws>
	<4D7E22FF.3090803@redhat.com>
In-Reply-To: <4D7E22FF.3090803@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Chunqiang Tang <ctang@us.ibm.com>, Stefan Hajnoczi <stefanha@gmail.com>, Markus Armbruster <armbru@redhat.com>, Aurelien Jarno <aurelien@aurel32.net>, qemu-devel@nongnu.org

On 03/14/2011 09:15 AM, Kevin Wolf wrote:
>> The file system can keep a lot of these things around pretty easily but
>> with your proposal, it seems like there can only be one.  If you support
>> many of them, I think you'll degenerate to something as complex as a
>> reference count table.
> IIUC, he already uses a refcount table.

Well, he needs a separate mechanism to make trim/discard work, but for 
the snapshot discussion, a reference count table is avoided.

The bitmap only covers whether the guest has accessed a block or not.  
Then there is a separate table that maps guest offsets to offsets within 
the file.

I haven't thought hard about it, but my guess is that there is an 
ordering constraint between these two pieces of metadata which is why 
the journal is necessary.  I get worried about the complexity of a 
journal even more than a reference count table.

>   Actually, I think that a
> refcount table is a requirement to provide the interesting properties
> that internal snapshots have (see my other mail).

Well the trick here AFAICT is that you're basically storing external 
snapshots internally.  So it's sort of like a bunch of FVD formats 
embedded into a single image.

> Refcount tables aren't a very complex thing either. In fact, it makes a
> format much simpler to have one concept like refcount tables instead of
> adding another different mechanism for each new feature that would be
> natural with refcount tables.

I think it's a reasonable design goal to minimize any metadata updates 
in the fast path.  If we can write 1 piece of metadata verses writing 2, 
then it's worth exploring IMHO.

> The only problem with them is that they are metadata that must be
> updated. However, I think we have discussed enough how to avoid the
> greatest part of that cost.

Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 
the writes for the refcount table?

>> On the other hand, I think it's reasonable to just avoid the CoW overlay
>> entirely and say that moving to a previous snapshot destroys any of it's
>> children.  I think this ends up being a simplifying assumption that is
>> worth investigating further.
>>
>>   From the use-cases that I'm aware of (backup and RAS), I think these
>> semantics are okay.
> I don't think this semantics would be expected. Any anyway, would this
> really allow simplification of the format?

I don't know, I'm really just trying to separate out the implementation 
of the format to the use-cases we're trying to address.

Even if we're talking about qcow3, then if we only really care about 
read-only snapshots, perhaps we can add a feature bit for this and take 
advantage of this to make the WCE=0 case much faster.

But the fundamental question is, does this satisfy the use-cases we care 
about?

Regards,

Anthony Liguori

>   I'm afraid that you would go
> for complicated solutions with odd semantics just because of an
> arbitrary dislike of refcounts.
>
> Kevin
>