From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=53643 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Otl9s-0003Is-7c for qemu-devel@nongnu.org; Thu, 09 Sep 2010 13:43:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Otl9p-0005nP-Lq for qemu-devel@nongnu.org; Thu, 09 Sep 2010 13:43:36 -0400 Received: from mail-iw0-f173.google.com ([209.85.214.173]:53398) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Otl9p-0005nD-Ge for qemu-devel@nongnu.org; Thu, 09 Sep 2010 13:43:33 -0400 Received: by iwn38 with SMTP id 38so1342050iwn.4 for ; Thu, 09 Sep 2010 10:43:32 -0700 (PDT) Message-ID: <4C891CC0.1090108@codemonkey.ws> Date: Thu, 09 Sep 2010 12:43:28 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> <4C84E738.3020802@codemonkey.ws> <4C865187.6090508@redhat.com> <4C8885BB.8020000@redhat.com> In-Reply-To: <4C8885BB.8020000@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Kevin Wolf , Stefan Hajnoczi , Stefan Hajnoczi , qemu-devel@nongnu.org On 09/09/2010 01:59 AM, Avi Kivity wrote: > On 09/08/2010 06:07 PM, Stefan Hajnoczi wrote: >>>> uint32_t table_size; /* table size, in clusters */ >>> Presumably L1 table size? Or any table size? >>> >>> Hm. It would be nicer not to require contiguous sectors anywhere. How >>> about a variable- or fixed-height tree? >> Both extents and fancier trees don't fit the philosophy, which is to >> keep things straightforward and fast by doing less. With extents and >> trees you've got something that looks much more like a full-blown >> filesystem. Is there an essential feature or characteristic that QED >> cannot provide in its current design? >> > > Not using extents mean that random workloads on very large disks will > continuously need to page in L2s (which are quite large, 256KB is > large enough that you need to account for read time, not just seek > time). Keeping it to two levels means that the image size is limited, > not very good for an image format designed in 2010. Define "very large disks". My target for VM images is 100GB-1TB. Practically speaking, that at least covers us for the next 5 years. Since QED has rich support for features, we can continue to evolve the format over time in a backwards compatible way. I'd rather delay supporting massively huge disks for the future when we better understand true nature of the problem. >>> Is the physical image size always derived from the host file >>> metadata? Is >>> this always safe? >> In my email summarizing crash scenarios and recovery we cover the >> bases and I think it is safe to rely on file size as physical image >> size. The drawback is that you need a host filesystem and cannot >> directly use a bare block device. I think that is acceptable for a >> sparse format, otherwise we'd be using raw. > > Hm, we do have a use case for qcow2-over-lvm. I can't say it's > something I like, but a point to consider. We specifically are not supporting that use-case in QED today. There's a good reason for it. For cluster allocation, we achieve good performance because for L2 cluster updates, we can avoid synchronous metadata updates (except for L1 updates). We achieve synchronous metadata updates by leveraging the underlying filesystem's metadata. The underlying filesystems are much smarter about their metadata updates. They'll keep a journal to delay synchronous updates and other fancy things. If we tried to represent the disk size in the header, we would have to do an fsync() on every cluster allocation. I can only imagine the use case for qcow2-over-lvm is performance. But the performance of QED on a file system is so much better than qcow2 that you can safely just use a file system and avoid the complexity of qcow2 over lvm. Regards, Anthony Liguori