From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=39581 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PsGwQ-0004uz-5V
	for qemu-devel@nongnu.org; Wed, 23 Feb 2011 10:47:52 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1PsGwK-0004je-OF
	for qemu-devel@nongnu.org; Wed, 23 Feb 2011 10:47:46 -0500
Received: from mail-vw0-f45.google.com ([209.85.212.45]:34197)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1PsGwK-0004ja-KB
	for qemu-devel@nongnu.org; Wed, 23 Feb 2011 10:47:44 -0500
Received: by vws19 with SMTP id 19so3891816vws.4
	for <qemu-devel@nongnu.org>; Wed, 23 Feb 2011 07:47:44 -0800 (PST)
Message-ID: <4D652C26.2010304@codemonkey.ws>
Date: Wed, 23 Feb 2011 09:47:50 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] Re: Strategic decision: COW format
References: <OF3C9DAE9F.EC6B5878-ON85257826.00715C10-85257826.007A14FB@LocalDomain>	<OFF3B73D6C.D1225EB2-ON85257838.006A5FC5-85257838.006C65F3@us.ibm.com>	<4D5BC467.4070804@redhat.com>	<m3r5b53duy.fsf_-_@blackfin.pond.sub.org>	<4D5E4271.80501@redhat.com>	<4D5E8031.5020402@codemonkey.ws>	<m3fwrgpis0.fsf@blackfin.pond.sub.org>	<4D637A20.9020307@redhat.com>	<4D650F10.3060900@redhat.com>
	<4D651858.9040106@codemonkey.ws> <4D651BD2.3040500@redhat.com>
	<4D6527F4.2010101@codemonkey.ws> <4D652984.90401@redhat.com>
In-Reply-To: <4D652984.90401@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Chunqiang Tang <ctang@us.ibm.com>, qemu-devel@nongnu.org, Markus Armbruster <armbru@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

On 02/23/2011 09:36 AM, Avi Kivity wrote:
> On 02/23/2011 05:29 PM, Anthony Liguori wrote:
>>
>>>> existed, what about snapshots?  Are we okay having a feature in a
>>>> prominent format that isn't going to meet user's expectations?
>>>>
>>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots is
>>>> going to have even reasonable performance in qcow2?
>>> Is there any hope for backing file chains of 1000 files or more? I
>>> haven't tried it out, but in theory I'd expect that internal snapshots
>>> could cope better with it than external ones because internal snapshots
>>> don't have to go through the whole chain all the time.
>>
>> I don't think there's a user expectation of backing file chains of 
>> 1000 files performing well.  However, I've talked to a number of 
>> customers that have been interested in using internal snapshots for 
>> checkpointing which would involve a large number of snapshots.
>>
>> In fact, Fabrice originally added qcow2 because he was interested in 
>> doing reverse debugging.  The idea of internal snapshots was to store 
>> a high number of checkpoints to allow reverse debugging to be optimized.
>
> I don't see how that works, since the memory image is duplicated for 
> each snapshot.  So thousands of snapshots = terabytes of storage, and 
> hours of creating the snapshots.

Fabrice wanted to use CoW to as a mechanism to deduplicate the memory 
contents with the on-disk state specifically to address this problem.  
For the longest time, there was a comment in the savevm code along these 
lines.  It might still be there.

I think the lack of on-disk hashes was a critical missing bit to make 
this feature really work well.

> Migrate-to-file with block live migration, or even better, something 
> based on Kemari would be a lot faster.
>
>>
>> I think the way snapshot metadata is stored makes this not realistic 
>> since they're stored in more or less a linear array.  I think to 
>> really support a high number of snapshots, you'd want to store a hash 
>> with each block that contained a refcount > 1.  I think you quickly 
>> end up reinventing btrfs though in the process.
>
> Can you elaborate?  What's the problem with a linear array of 
> snapshots (say up to 10,000 snapshots)?

Lots of things.  The array will start to consume quite a bit of 
contiguous space as it gets larger which means it needs to be 
relocated.  Deleting a snapshot is a far more expensive operation than 
it needs to be.

Regards,

Anthony Liguori