From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=53643 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Otl9s-0003Is-7c
	for qemu-devel@nongnu.org; Thu, 09 Sep 2010 13:43:37 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1Otl9p-0005nP-Lq
	for qemu-devel@nongnu.org; Thu, 09 Sep 2010 13:43:36 -0400
Received: from mail-iw0-f173.google.com ([209.85.214.173]:53398)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1Otl9p-0005nD-Ge
	for qemu-devel@nongnu.org; Thu, 09 Sep 2010 13:43:33 -0400
Received: by iwn38 with SMTP id 38so1342050iwn.4
	for <qemu-devel@nongnu.org>; Thu, 09 Sep 2010 10:43:32 -0700 (PDT)
Message-ID: <4C891CC0.1090108@codemonkey.ws>
Date: Thu, 09 Sep 2010 12:43:28 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com>	<4C84E738.3020802@codemonkey.ws>	<4C865187.6090508@redhat.com>
	<AANLkTikv+3mfCgAAB7X6VF5cHYWJ1qRGFOK96mDQx3LT@mail.gmail.com>
	<4C8885BB.8020000@redhat.com>
In-Reply-To: <4C8885BB.8020000@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

On 09/09/2010 01:59 AM, Avi Kivity wrote:
>  On 09/08/2010 06:07 PM, Stefan Hajnoczi wrote:
>>>>      uint32_t table_size;          /* table size, in clusters */
>>> Presumably L1 table size?  Or any table size?
>>>
>>> Hm.  It would be nicer not to require contiguous sectors anywhere.  How
>>> about a variable- or fixed-height tree?
>> Both extents and fancier trees don't fit the philosophy, which is to
>> keep things straightforward and fast by doing less.  With extents and
>> trees you've got something that looks much more like a full-blown
>> filesystem.  Is there an essential feature or characteristic that QED
>> cannot provide in its current design?
>>
>
> Not using extents mean that random workloads on very large disks will 
> continuously need to page in L2s (which are quite large, 256KB is 
> large enough that you need to account for read time, not just seek 
> time).  Keeping it to two levels means that the image size is limited, 
> not very good for an image format designed in 2010.

Define "very large disks".

My target for VM images is 100GB-1TB.  Practically speaking, that at 
least covers us for the next 5 years.

Since QED has rich support for features, we can continue to evolve the 
format over time in a backwards compatible way.  I'd rather delay 
supporting massively huge disks for the future when we better understand 
true nature of the problem.

>>> Is the physical image size always derived from the host file 
>>> metadata?  Is
>>> this always safe?
>> In my email summarizing crash scenarios and recovery we cover the
>> bases and I think it is safe to rely on file size as physical image
>> size.  The drawback is that you need a host filesystem and cannot
>> directly use a bare block device.  I think that is acceptable for a
>> sparse format, otherwise we'd be using raw.
>
> Hm, we do have a use case for qcow2-over-lvm.  I can't say it's 
> something I like, but a point to consider.

We specifically are not supporting that use-case in QED today.  There's 
a good reason for it.  For cluster allocation, we achieve good 
performance because for L2 cluster updates, we can avoid synchronous 
metadata updates (except for L1 updates).

We achieve synchronous metadata updates by leveraging the underlying 
filesystem's metadata.  The underlying filesystems are much smarter 
about their metadata updates.  They'll keep a journal to delay 
synchronous updates and other fancy things.

If we tried to represent the disk size in the header, we would have to 
do an fsync() on every cluster allocation.

I can only imagine the use case for qcow2-over-lvm is performance.  But 
the performance of QED on a file system is so much better than qcow2 
that you can safely just use a file system and avoid the complexity of 
qcow2 over lvm.

Regards,

Anthony Liguori